This commit is contained in:
Jean-Francois Dockes 2020-03-01 16:08:15 +01:00
parent fe86fa9e1f
commit c110b94738
2 changed files with 63 additions and 32 deletions

View File

@ -2131,19 +2131,32 @@ metadatacmds = ; <em class=
extensive facilities for storing metadata along with the extensive facilities for storing metadata along with the
document, and these facilities are actually used in the document, and these facilities are actually used in the
real world.</p> real world.</p>
<p>In consequence, the <code class= <p>In consequence, the <span class=
"filename">rclpdf.py</code> PDF input handler has more "command"><strong>rclpdf.py</strong></span> PDF input
complex capabilities than most others, and it is also more handler has more complex capabilities than most others, and
configurable. Specifically, <code class= it is also more configurable. Specifically, <span class=
"filename">rclpdf.py</code> can automatically use "command"><strong>rclpdf.py</strong></span> has the
<span class="application">tesseract</span> to perform OCR following features:</p>
if the document text is empty, it can be configured to <div class="itemizedlist">
extract specific metadata tags from an XMP packet, and to <ul class="itemizedlist" style="list-style-type: disc;">
extract PDF attachments.</p> <li class="listitem">
<p>The PDF handler can execute an external program to run <p>It can be configured to extract specific metadata
OCR if no text is found in the document. This is now tags from an XMP packet.</p>
described in a <a class="link" href="#RCL.INDEXING.OCR" </li>
title="2.9.&nbsp;Recoll and OCR">separate section</a>.</p> <li class="listitem">
<p>It can extract PDF attachments.</p>
</li>
<li class="listitem">
<p>It can automatically perform OCR if the document
text is empty. This is done by executing an external
program and is now described in a <a class="link"
href="#RCL.INDEXING.OCR" title=
"2.9.&nbsp;Recoll and OCR">separate section</a>,
because the OCR framework can also be used with
non-PDF image files.</p>
</li>
</ul>
</div>
<div class="sect2"> <div class="sect2">
<div class="titlepage"> <div class="titlepage">
<div> <div>
@ -2270,8 +2283,14 @@ metadatacmds = ; <em class=
</li> </li>
</ul> </ul>
</div> </div>
<p>Configuration. See the <a class="link" href= <p>To enable this feature, you need to install one of the
"#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title= supported OCR applications (<span class=
"application">tesseract</span> or <span class=
"application">ABBYY</span>), enable OCR in the PDF handler,
and tell <span class="application">Recoll</span> where the
appropriate command resides. The last parts are done by
setting configuration variables. See the <a class="link"
href="#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
"Parameters for OCR processing">relevant section</a>. All "Parameters for OCR processing">relevant section</a>. All
parameters can be localized in subdirectories through the parameters can be localized in subdirectories through the
usual main configuration mechanism (path sections).</p> usual main configuration mechanism (path sections).</p>

View File

@ -1402,21 +1402,27 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
<title>The PDF input handler</title> <title>The PDF input handler</title>
<para>The PDF format is very important for scientific and technical <para>The PDF format is very important for scientific and technical
documentation, and document archival. It has extensive documentation, and document archival. It has extensive
facilities for storing metadata along with the document, and these facilities for storing metadata along with the document, and these
facilities are actually used in the real world.</para> facilities are actually used in the real world.</para>
<para>In consequence, the <filename>rclpdf.py</filename> PDF input <para>In consequence, the <command>rclpdf.py</command> PDF input
handler has more complex capabilities than most others, and it is handler has more complex capabilities than most others, and it is
also more configurable. Specifically, <filename>rclpdf.py</filename> also more configurable. Specifically, <command>rclpdf.py</command>
can automatically use <application>tesseract</application> to perform has the following features:
OCR if the document text is empty, it can be configured to extract <itemizedlist>
specific metadata tags from an XMP packet, and to extract PDF <listitem><para>It can be configured to extract
attachments.</para> specific metadata tags from an XMP packet.</para></listitem>
<listitem><para>It can extract PDF
<para>The PDF handler can execute an external program to run OCR if attachments.</para></listitem>
no text is found in the document. This is now described in a <listitem><para>It can automatically perform
<link linkend="RCL.INDEXING.OCR">separate section</link>.</para> OCR if the document text is empty. This is done by
executing an external program and is now described in a
<link linkend="RCL.INDEXING.OCR">separate
section</link>, because the OCR framework can also be used
with non-PDF image files.</para></listitem>
</itemizedlist>
</para>
<sect2 id="RCL.INDEXING.PDF.XMP"> <sect2 id="RCL.INDEXING.PDF.XMP">
<title>XMP fields extraction</title> <title>XMP fields extraction</title>
@ -1477,7 +1483,7 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
<title>PDF attachment indexing</title> <title>PDF attachment indexing</title>
<para>If <application>pdftk</application> is installed, and if the <para>If <application>pdftk</application> is installed, and if the
the the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link> <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
configuration variable is set, the PDF input handler will try to configuration variable is set, the PDF input handler will try to
extract PDF attachements for indexing as sub-documents of the PDF extract PDF attachements for indexing as sub-documents of the PDF
@ -1489,6 +1495,7 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
</sect1> </sect1>
<sect1 id="RCL.INDEXING.OCR"> <sect1 id="RCL.INDEXING.OCR">
<title>Recoll and OCR</title> <title>Recoll and OCR</title>
@ -1521,8 +1528,13 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
</itemizedlist> </itemizedlist>
</para> </para>
<para>Configuration. See the <para>To enable this feature, you need to install one of
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR"> the supported OCR applications
(<application>tesseract</application>
or <application>ABBYY</application>), enable OCR in the PDF
handler, and tell &RCL; where the appropriate command resides. The
last parts are done by setting configuration variables. See the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
relevant section</link>. All parameters can be localized in relevant section</link>. All parameters can be localized in
subdirectories through the usual main configuration mechanism (path subdirectories through the usual main configuration mechanism (path
sections).</para> sections).</para>