PDF XMP: move field editing code to external script, document
This commit is contained in:
parent
9e046187da
commit
ef9e7a935b
@ -606,6 +606,23 @@ very slow.</para></listitem></varlistentry>
|
||||
available). This is
|
||||
normally disabled, because it does slow down PDF indexing a bit even if
|
||||
not one attachment is ever found.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">
|
||||
<term><varname>pdfextrameta</varname></term>
|
||||
<listitem><para>Extract text from selected XMP metadata tags. This
|
||||
is a space-separated list of qualified XMP tag names. Each element can also
|
||||
include a translation to a Recoll field name, separated by a '|'
|
||||
character. If the second element is absent, the tag name is used as the
|
||||
Recoll field names. You will also need to add specifications to the
|
||||
'fields' file to direct processing of the extracted data.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">
|
||||
<term><varname>pdfextrametafix</varname></term>
|
||||
<listitem><para>Define name of XMP field editing script. This
|
||||
defines the name of a script to be loaded for editing XMP field
|
||||
values. The script should define a 'MetaFixer' class with a metafix()
|
||||
method which will be called with the qualified tag name and value of each
|
||||
selected field, for editing or erasing. A new instance is created for
|
||||
each document, so that the object can keep state for, e.g. eliminating
|
||||
duplicate values.</para></listitem></varlistentry>
|
||||
</sect3>
|
||||
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS">
|
||||
<title>Parameters set for specific locations </title>
|
||||
|
||||
@ -20,8 +20,8 @@ alink="#0000FF">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h1 class="title"><a name="idp37528496" id=
|
||||
"idp37528496"></a>Recoll user manual</h1>
|
||||
<h1 class="title"><a name="idp35245072" id=
|
||||
"idp35245072"></a>Recoll user manual</h1>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
@ -109,13 +109,13 @@ alink="#0000FF">
|
||||
multiple indexes</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.1.3. <a href=
|
||||
"#idp43099712">Document types</a></span></dt>
|
||||
"#idp40818624">Document types</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.1.4. <a href=
|
||||
"#idp43124208">Indexing failures</a></span></dt>
|
||||
"#idp40843200">Indexing failures</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.1.5. <a href=
|
||||
"#idp43131216">Recovery</a></span></dt>
|
||||
"#idp40850208">Recovery</a></span></dt>
|
||||
</dl>
|
||||
</dd>
|
||||
|
||||
@ -172,29 +172,49 @@ alink="#0000FF">
|
||||
tags</a></span></dt>
|
||||
|
||||
<dt><span class="sect1">2.7. <a href=
|
||||
"#RCL.INDEXING.PDF">The PDF input
|
||||
handler</a></span></dt>
|
||||
|
||||
<dd>
|
||||
<dl>
|
||||
<dt><span class="sect2">2.7.1. <a href=
|
||||
"#RCL.INDEXING.PDF.OCR">OCR with
|
||||
Tesseract</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.7.2. <a href=
|
||||
"#RCL.INDEXING.PDF.XMP">XMP fields
|
||||
extraction</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.7.3. <a href=
|
||||
"#RCL.INDEXING.PDF.ATTACH">PDF attachment
|
||||
indexing</a></span></dt>
|
||||
</dl>
|
||||
</dd>
|
||||
|
||||
<dt><span class="sect1">2.8. <a href=
|
||||
"#RCL.INDEXING.PERIODIC">Periodic
|
||||
indexing</a></span></dt>
|
||||
|
||||
<dd>
|
||||
<dl>
|
||||
<dt><span class="sect2">2.7.1. <a href=
|
||||
<dt><span class="sect2">2.8.1. <a href=
|
||||
"#RCL.INDEXING.PERIODIC.EXEC">Running
|
||||
indexing</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.7.2. <a href=
|
||||
<dt><span class="sect2">2.8.2. <a href=
|
||||
"#RCL.INDEXING.PERIODIC.AUTOMAT">Using <span class=
|
||||
"command"><strong>cron</strong></span> to automate
|
||||
indexing</a></span></dt>
|
||||
</dl>
|
||||
</dd>
|
||||
|
||||
<dt><span class="sect1">2.8. <a href=
|
||||
<dt><span class="sect1">2.9. <a href=
|
||||
"#RCL.INDEXING.MONITOR">Real time
|
||||
indexing</a></span></dt>
|
||||
|
||||
<dd>
|
||||
<dl>
|
||||
<dt><span class="sect2">2.8.1. <a href=
|
||||
<dt><span class="sect2">2.9.1. <a href=
|
||||
"#RCL.INDEXING.MONITOR.FASTFILES">Slowing down the
|
||||
reindexing rate for fast changing
|
||||
files</a></span></dt>
|
||||
@ -768,7 +788,7 @@ alink="#0000FF">
|
||||
"application">Qt</span>.</p>
|
||||
|
||||
<p>The <a class="link" href="#RCL.INDEXING.PERIODIC.EXEC"
|
||||
title="2.7.1. Running indexing">indexing process</a>
|
||||
title="2.8.1. Running indexing">indexing process</a>
|
||||
is started automatically the first time you execute the
|
||||
<span class="command"><strong>recoll</strong></span> GUI.
|
||||
Indexing can also be performed by executing the
|
||||
@ -879,21 +899,21 @@ alink="#0000FF">
|
||||
"list-style-type: disc;">
|
||||
<li class="listitem">
|
||||
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
|
||||
title="2.7. Periodic indexing">Periodic (or
|
||||
title="2.8. Periodic indexing">Periodic (or
|
||||
batch) indexing:</a> </b>indexing takes place
|
||||
at discrete times, by executing the <span class=
|
||||
"command"><strong>recollindex</strong></span>
|
||||
command. The typical usage is to have a nightly
|
||||
indexing run <a class="link" href=
|
||||
"#RCL.INDEXING.PERIODIC.AUTOMAT" title=
|
||||
"2.7.2. Using cron to automate indexing">programmed</a>
|
||||
"2.8.2. Using cron to automate indexing">programmed</a>
|
||||
into your <span class=
|
||||
"command"><strong>cron</strong></span> file.</p>
|
||||
</li>
|
||||
|
||||
<li class="listitem">
|
||||
<p><b><a class="link" href="#RCL.INDEXING.MONITOR"
|
||||
title="2.8. Real time indexing">Real time
|
||||
title="2.9. Real time indexing">Real time
|
||||
indexing:</a> </b>indexing takes place as soon
|
||||
as a file is created or changed. <span class=
|
||||
"command"><strong>recollindex</strong></span> runs
|
||||
@ -997,8 +1017,8 @@ alink="#0000FF">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="idp43099712" id=
|
||||
"idp43099712"></a>2.1.3. Document types</h3>
|
||||
<h3 class="title"><a name="idp40818624" id=
|
||||
"idp40818624"></a>2.1.3. Document types</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@ -1111,8 +1131,8 @@ indexedmimetypes = application/pdf
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="idp43124208" id=
|
||||
"idp43124208"></a>2.1.4. Indexing
|
||||
<h3 class="title"><a name="idp40843200" id=
|
||||
"idp40843200"></a>2.1.4. Indexing
|
||||
failures</h3>
|
||||
</div>
|
||||
</div>
|
||||
@ -1152,8 +1172,8 @@ indexedmimetypes = application/pdf
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="idp43131216" id=
|
||||
"idp43131216"></a>2.1.5. Recovery</h3>
|
||||
<h3 class="title"><a name="idp40850208" id=
|
||||
"idp40850208"></a>2.1.5. Recovery</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@ -1911,13 +1931,151 @@ metadatacmds = ; tags = tmsu tags %f
|
||||
filename.</code></p>
|
||||
</div>
|
||||
|
||||
<div class="sect1">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h2 class="title" style="clear: both"><a name=
|
||||
"RCL.INDEXING.PDF" id=
|
||||
"RCL.INDEXING.PDF"></a>2.7. The PDF input
|
||||
handler</h2>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<p>The PDF format is very important for scientific and
|
||||
technical documentation, and document archival. It has
|
||||
extensive facilities for storing metadata along with the
|
||||
document, and these facilities are actually used in the
|
||||
real world.</p>
|
||||
|
||||
<p>In consequence, the <code class=
|
||||
"filename">rclpdf.py</code> PDF input handler has more
|
||||
complex capabilities than most others, and it is also more
|
||||
configurable. Specifically, <code class=
|
||||
"filename">rclpdf.py</code> can automatically use
|
||||
<span class="application">tesseract</span> to perform OCR
|
||||
if the document text is empty, it can be configured to
|
||||
extract specific metadata tags from an XMP packet, and to
|
||||
extract PDF attachments.</p>
|
||||
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="RCL.INDEXING.PDF.OCR"
|
||||
id="RCL.INDEXING.PDF.OCR"></a>2.7.1. OCR with
|
||||
Tesseract</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<p>If both <span class="application">tesseract</span> and
|
||||
<span class="command"><strong>pdftoppm</strong></span>
|
||||
(generally from the <span class=
|
||||
"application">poppler-utils</span> package) are
|
||||
installed, the PDF handler may attempt OCR on PDF files
|
||||
with no text content. This is controlled by the <a class=
|
||||
"link" href=
|
||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</a>
|
||||
configuration variable, which is false by default because
|
||||
OCR is very slow.</p>
|
||||
|
||||
<p>The choice of language is very important for
|
||||
successfull OCR. Recoll has currently no way to determine
|
||||
this from the document itself. You can set the language
|
||||
to use through the contents of a <code class=
|
||||
"filename">.ocrpdflang</code> text file in the same
|
||||
directory as the PDF document, or through the
|
||||
<code class="envar">RECOLL_TESSERACT_LANG</code>
|
||||
environment variable, or through the contents of an
|
||||
<code class="filename">ocrpdf</code> text file inside the
|
||||
configuration directory. If none of the above are used,
|
||||
<span class="application">Recoll</span> will try to guess
|
||||
the language from the NLS environment.</p>
|
||||
</div>
|
||||
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
|
||||
id="RCL.INDEXING.PDF.XMP"></a>2.7.2. XMP
|
||||
fields extraction</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<p>The <code class="filename">rclpdf.py</code> script in
|
||||
<span class="application">Recoll</span> version 1.23.2
|
||||
and later can extract XMP metadata fields by executing
|
||||
the <span class="command"><strong>pdfinfo</strong></span>
|
||||
command (usually found with <span class=
|
||||
"application">poppler-utils</span>). This is controlled
|
||||
by the <a class="link" href=
|
||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</a>
|
||||
configuration variable, which specifies which tags to
|
||||
extract and, possibly, how to rename them.</p>
|
||||
|
||||
<p>The <a class="link" href=
|
||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</a>
|
||||
variable can be used to designate a file with Python code
|
||||
to edit the metadata fields (available for <span class=
|
||||
"application">Recoll</span> 1.23.3 and later. 1.23.2 has
|
||||
equivalent code inside the handler script). Example:</p>
|
||||
<pre class="programlisting">
|
||||
import sys
|
||||
import re
|
||||
|
||||
class MetaFixer(object):
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def metafix(self, nm, txt):
|
||||
if nm == 'bibtex:pages':
|
||||
txt = re.sub(r'--', '-', txt)
|
||||
elif nm == 'someothername':
|
||||
# do something else
|
||||
pass
|
||||
elif nm == 'stillanother':
|
||||
# etc.
|
||||
pass
|
||||
|
||||
return txt
|
||||
|
||||
</pre>
|
||||
</div>
|
||||
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
|
||||
id="RCL.INDEXING.PDF.ATTACH"></a>2.7.3. PDF
|
||||
attachment indexing</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<p>If <span class="application">pdftk</span> is
|
||||
installed, and if the the <a class="link" href=
|
||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</a>
|
||||
configuration variable is set, the PDF input handler will
|
||||
try to extract PDF attachements for indexing as
|
||||
sub-documents of the PDF file. This is disabled by
|
||||
default, because it slows down PDF indexing a bit even if
|
||||
not one attachment is ever found (PDF attachments are
|
||||
uncommon in my experience).</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="sect1">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h2 class="title" style="clear: both"><a name=
|
||||
"RCL.INDEXING.PERIODIC" id=
|
||||
"RCL.INDEXING.PERIODIC"></a>2.7. Periodic
|
||||
"RCL.INDEXING.PERIODIC"></a>2.8. Periodic
|
||||
indexing</h2>
|
||||
</div>
|
||||
</div>
|
||||
@ -1929,7 +2087,7 @@ metadatacmds = ; tags = tmsu tags %f
|
||||
<div>
|
||||
<h3 class="title"><a name=
|
||||
"RCL.INDEXING.PERIODIC.EXEC" id=
|
||||
"RCL.INDEXING.PERIODIC.EXEC"></a>2.7.1. Running
|
||||
"RCL.INDEXING.PERIODIC.EXEC"></a>2.8.1. Running
|
||||
indexing</h3>
|
||||
</div>
|
||||
</div>
|
||||
@ -2037,7 +2195,7 @@ metadatacmds = ; tags = tmsu tags %f
|
||||
<div>
|
||||
<h3 class="title"><a name=
|
||||
"RCL.INDEXING.PERIODIC.AUTOMAT" id=
|
||||
"RCL.INDEXING.PERIODIC.AUTOMAT"></a>2.7.2. Using
|
||||
"RCL.INDEXING.PERIODIC.AUTOMAT"></a>2.8.2. Using
|
||||
<span class="command"><strong>cron</strong></span>
|
||||
to automate indexing</h3>
|
||||
</div>
|
||||
@ -2095,7 +2253,7 @@ metadatacmds = ; tags = tmsu tags %f
|
||||
<div>
|
||||
<h2 class="title" style="clear: both"><a name=
|
||||
"RCL.INDEXING.MONITOR" id=
|
||||
"RCL.INDEXING.MONITOR"></a>2.8. Real time
|
||||
"RCL.INDEXING.MONITOR"></a>2.9. Real time
|
||||
indexing</h2>
|
||||
</div>
|
||||
</div>
|
||||
@ -2225,7 +2383,7 @@ fs.inotify.max_user_watches=32768
|
||||
<div>
|
||||
<h3 class="title"><a name=
|
||||
"RCL.INDEXING.MONITOR.FASTFILES" id=
|
||||
"RCL.INDEXING.MONITOR.FASTFILES"></a>2.8.1. Slowing
|
||||
"RCL.INDEXING.MONITOR.FASTFILES"></a>2.9.1. Slowing
|
||||
down the reindexing rate for fast changing
|
||||
files</h3>
|
||||
</div>
|
||||
@ -9848,6 +10006,38 @@ thesame = "some string with spaces"
|
||||
because it does slow down PDF indexing a bit even
|
||||
if not one attachment is ever found.</p>
|
||||
</dd>
|
||||
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA"></a><span class="term"><code class="varname">pdfextrameta</code></span></dt>
|
||||
|
||||
<dd>
|
||||
<p>Extract text from selected XMP metadata tags.
|
||||
This is a space-separated list of qualified XMP tag
|
||||
names. Each element can also include a translation
|
||||
to a Recoll field name, separated by a '|'
|
||||
character. If the second element is absent, the tag
|
||||
name is used as the Recoll field names. You will
|
||||
also need to add specifications to the 'fields'
|
||||
file to direct processing of the extracted
|
||||
data.</p>
|
||||
</dd>
|
||||
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX"></a><span class="term"><code class="varname">pdfextrametafix</code></span></dt>
|
||||
|
||||
<dd>
|
||||
<p>Define name of XMP field editing script. This
|
||||
defines the name of a script to be loaded for
|
||||
editing XMP field values. The script should define
|
||||
a 'MetaFixer' class with a metafix() method which
|
||||
will be called with the qualified tag name and
|
||||
value of each selected field, for editing or
|
||||
erasing. A new instance is created for each
|
||||
document, so that the object can keep state for,
|
||||
e.g. eliminating duplicate values.</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</div>
|
||||
|
||||
|
||||
@ -1098,6 +1098,108 @@ metadatacmds = ; tags = tmsu tags %f
|
||||
</sect1>
|
||||
|
||||
|
||||
<sect1 id="RCL.INDEXING.PDF">
|
||||
<title>The PDF input handler</title>
|
||||
|
||||
<para>The PDF format is very important for scientific and technical
|
||||
documentation, and document archival. It has extensive
|
||||
facilities for storing metadata along with the document, and these
|
||||
facilities are actually used in the real world.</para>
|
||||
|
||||
<para>In consequence, the <filename>rclpdf.py</filename> PDF input
|
||||
handler has more complex capabilities than most others, and it is
|
||||
also more configurable. Specifically, <filename>rclpdf.py</filename>
|
||||
can automatically use <application>tesseract</application> to perform
|
||||
OCR if the document text is empty, it can be configured to extract
|
||||
specific metadata tags from an XMP packet, and to extract PDF
|
||||
attachments.</para>
|
||||
|
||||
<sect2 id="RCL.INDEXING.PDF.OCR">
|
||||
<title>OCR with Tesseract</title>
|
||||
|
||||
<para>If both <application>tesseract</application> and
|
||||
<command>pdftoppm</command> (generally from the
|
||||
<application>poppler-utils</application> package) are installed,
|
||||
the PDF handler may attempt OCR on PDF files with no text
|
||||
content. This is controlled by the <link
|
||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
|
||||
configuration variable, which is false by default because
|
||||
OCR is very slow.</para>
|
||||
|
||||
<para>The choice of language is very important for successfull
|
||||
OCR. Recoll has currently no way to determine this from the
|
||||
document itself. You can set the language to use through the
|
||||
contents of a <filename>.ocrpdflang</filename> text file in the
|
||||
same directory as the PDF document, or through the
|
||||
<envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
|
||||
through the contents of an <filename>ocrpdf</filename> text file
|
||||
inside the configuration directory. If none of the above are used,
|
||||
&RCL; will try to guess the language from the NLS
|
||||
environment.</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="RCL.INDEXING.PDF.XMP">
|
||||
<title>XMP fields extraction</title>
|
||||
|
||||
<para>The <filename>rclpdf.py</filename> script in &RCL; version
|
||||
1.23.2 and later can extract XMP metadata fields by executing the
|
||||
<command>pdfinfo</command> command (usually found with
|
||||
<application>poppler-utils</application>). This is controlled by
|
||||
the <link
|
||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</link>
|
||||
configuration variable, which specifies which tags to extract and,
|
||||
possibly, how to rename them.</para>
|
||||
|
||||
<para>The <link
|
||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</link>
|
||||
variable can be used to designate a file with Python code to edit
|
||||
the metadata fields (available for &RCL; 1.23.3 and later. 1.23.2
|
||||
has equivalent code inside the handler script). Example:</para>
|
||||
<programlisting>import sys
|
||||
import re
|
||||
|
||||
class MetaFixer(object):
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def metafix(self, nm, txt):
|
||||
if nm == 'bibtex:pages':
|
||||
txt = re.sub(r'--', '-', txt)
|
||||
elif nm == 'someothername':
|
||||
# do something else
|
||||
pass
|
||||
elif nm == 'stillanother':
|
||||
# etc.
|
||||
pass
|
||||
|
||||
return txt
|
||||
</programlisting>
|
||||
|
||||
|
||||
<!-- <para> There is a <ulink url="&WIKI;PDFXMP.wiki">complete example of XMP
|
||||
tags setup</ulink>, including a nice result list paragraph format in the
|
||||
&RCL; Wiki </para> -->
|
||||
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="RCL.INDEXING.PDF.ATTACH">
|
||||
<title>PDF attachment indexing</title>
|
||||
|
||||
<para>If <application>pdftk</application> is installed, and if the
|
||||
the <link
|
||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
|
||||
configuration variable is set, the PDF input handler will try to
|
||||
extract PDF attachements for indexing as sub-documents of the PDF
|
||||
file. This is disabled by default, because it slows down PDF
|
||||
indexing a bit even if not one attachment is ever found (PDF
|
||||
attachments are uncommon in my experience).</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="RCL.INDEXING.PERIODIC">
|
||||
<title>Periodic indexing</title>
|
||||
|
||||
|
||||
@ -98,6 +98,7 @@ class PDFExtractor:
|
||||
# (xmltag,rcltag) pairs
|
||||
self.extrameta = cf.getConfParam("pdfextrameta")
|
||||
if self.extrameta:
|
||||
self.extrametafix = cf.getConfParam("pdfextrametafix")
|
||||
self._initextrameta()
|
||||
|
||||
# Check if we need to escape portions of text where old
|
||||
@ -178,7 +179,16 @@ class PDFExtractor:
|
||||
self.re_xmlpacket = re.compile(r'<\?xpacket[ ]+begin.*\?>' +
|
||||
r'(.*)' + r'<\?xpacket[ ]+end',
|
||||
flags = re.DOTALL)
|
||||
|
||||
global EMF
|
||||
EMF = None
|
||||
if self.extrametafix:
|
||||
try:
|
||||
import imp
|
||||
EMF = imp.load_source('pdfextrametafix', self.extrametafix)
|
||||
except Exception as err:
|
||||
self.em.rclog("Import extrametafix failed: %s" % err)
|
||||
pass
|
||||
|
||||
# Extract all attachments if any into temporary directory
|
||||
def extractAttach(self):
|
||||
if self.attextractdone:
|
||||
@ -384,27 +394,12 @@ class PDFExtractor:
|
||||
# [e.text for e in elt.iter() if e.text]).strip()
|
||||
|
||||
|
||||
# This can be used for local field editing. For now you need to
|
||||
# change the program source. maybe we'll make it more dynamic one
|
||||
# day. The method receives an (original) field name, and the text
|
||||
# value, and should return the possibly modified text.
|
||||
def _extrametafix(self, nm, txt):
|
||||
if nm == 'bibtex:pages':
|
||||
txt = re.sub(r'--', '-', txt)
|
||||
elif nm == 'someothername':
|
||||
# do something else
|
||||
pass
|
||||
elif nm == 'stillanother':
|
||||
# etc.
|
||||
pass
|
||||
|
||||
return txt
|
||||
|
||||
|
||||
def _setextrameta(self, html):
|
||||
if not self.pdfinfo:
|
||||
return html
|
||||
|
||||
emf = EMF.MetaFixer() if EMF else None
|
||||
|
||||
all = subprocess.check_output([self.pdfinfo, "-meta", self.filename])
|
||||
|
||||
# Extract the XML packet
|
||||
@ -445,9 +440,10 @@ class PDFExtractor:
|
||||
continue
|
||||
if elt is not None:
|
||||
text = self._xmltreetext(elt).encode('UTF-8')
|
||||
if emf:
|
||||
text = emf.metafix(metanm, text)
|
||||
# Should we set empty values ?
|
||||
if text:
|
||||
text = self._extrametafix(metanm, text)
|
||||
# Can't use setfield as it only works for
|
||||
# text/plain output at the moment.
|
||||
metaheaders.append((rclnm, text))
|
||||
|
||||
@ -750,6 +750,27 @@ snippetMaxPosWalk = 1000000
|
||||
# not one attachment is ever found.</descr></var>
|
||||
#pdfattach = 0
|
||||
|
||||
# <var name="pdfextrameta" type="string">
|
||||
#
|
||||
# <brief>Extract text from selected XMP metadata tags.</brief><descr>This
|
||||
# is a space-separated list of qualified XMP tag names. Each element can also
|
||||
# include a translation to a Recoll field name, separated by a '|'
|
||||
# character. If the second element is absent, the tag name is used as the
|
||||
# Recoll field names. You will also need to add specifications to the
|
||||
# 'fields' file to direct processing of the extracted data.</descr></var>
|
||||
#pdfextrameta = bibtex:location|location bibtex:booktitle bibtex:pages
|
||||
|
||||
# <var name="pdfextrametafix" type="fn">
|
||||
#
|
||||
# <brief>Define name of XMP field editing script.</brief><descr>This
|
||||
# defines the name of a script to be loaded for editing XMP field
|
||||
# values. The script should define a 'MetaFixer' class with a metafix()
|
||||
# method which will be called with the qualified tag name and value of each
|
||||
# selected field, for editing or erasing. A new instance is created for
|
||||
# each document, so that the object can keep state for, e.g. eliminating
|
||||
# duplicate values.</descr></var>
|
||||
#pdfextrametafix = /path/to/fixerscript.py
|
||||
|
||||
|
||||
# <grouptitle id="SPECLOCATIONS">Parameters set for specific
|
||||
# locations</grouptitle>
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user