doc
This commit is contained in:
parent
fe86fa9e1f
commit
c110b94738
@ -2131,19 +2131,32 @@ metadatacmds = ; <em class=
|
|||||||
extensive facilities for storing metadata along with the
|
extensive facilities for storing metadata along with the
|
||||||
document, and these facilities are actually used in the
|
document, and these facilities are actually used in the
|
||||||
real world.</p>
|
real world.</p>
|
||||||
<p>In consequence, the <code class=
|
<p>In consequence, the <span class=
|
||||||
"filename">rclpdf.py</code> PDF input handler has more
|
"command"><strong>rclpdf.py</strong></span> PDF input
|
||||||
complex capabilities than most others, and it is also more
|
handler has more complex capabilities than most others, and
|
||||||
configurable. Specifically, <code class=
|
it is also more configurable. Specifically, <span class=
|
||||||
"filename">rclpdf.py</code> can automatically use
|
"command"><strong>rclpdf.py</strong></span> has the
|
||||||
<span class="application">tesseract</span> to perform OCR
|
following features:</p>
|
||||||
if the document text is empty, it can be configured to
|
<div class="itemizedlist">
|
||||||
extract specific metadata tags from an XMP packet, and to
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
||||||
extract PDF attachments.</p>
|
<li class="listitem">
|
||||||
<p>The PDF handler can execute an external program to run
|
<p>It can be configured to extract specific metadata
|
||||||
OCR if no text is found in the document. This is now
|
tags from an XMP packet.</p>
|
||||||
described in a <a class="link" href="#RCL.INDEXING.OCR"
|
</li>
|
||||||
title="2.9. Recoll and OCR">separate section</a>.</p>
|
<li class="listitem">
|
||||||
|
<p>It can extract PDF attachments.</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>It can automatically perform OCR if the document
|
||||||
|
text is empty. This is done by executing an external
|
||||||
|
program and is now described in a <a class="link"
|
||||||
|
href="#RCL.INDEXING.OCR" title=
|
||||||
|
"2.9. Recoll and OCR">separate section</a>,
|
||||||
|
because the OCR framework can also be used with
|
||||||
|
non-PDF image files.</p>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
<div class="sect2">
|
<div class="sect2">
|
||||||
<div class="titlepage">
|
<div class="titlepage">
|
||||||
<div>
|
<div>
|
||||||
@ -2270,8 +2283,14 @@ metadatacmds = ; <em class=
|
|||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</div>
|
</div>
|
||||||
<p>Configuration. See the <a class="link" href=
|
<p>To enable this feature, you need to install one of the
|
||||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
|
supported OCR applications (<span class=
|
||||||
|
"application">tesseract</span> or <span class=
|
||||||
|
"application">ABBYY</span>), enable OCR in the PDF handler,
|
||||||
|
and tell <span class="application">Recoll</span> where the
|
||||||
|
appropriate command resides. The last parts are done by
|
||||||
|
setting configuration variables. See the <a class="link"
|
||||||
|
href="#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
|
||||||
"Parameters for OCR processing">relevant section</a>. All
|
"Parameters for OCR processing">relevant section</a>. All
|
||||||
parameters can be localized in subdirectories through the
|
parameters can be localized in subdirectories through the
|
||||||
usual main configuration mechanism (path sections).</p>
|
usual main configuration mechanism (path sections).</p>
|
||||||
|
|||||||
@ -1402,21 +1402,27 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
|||||||
<title>The PDF input handler</title>
|
<title>The PDF input handler</title>
|
||||||
|
|
||||||
<para>The PDF format is very important for scientific and technical
|
<para>The PDF format is very important for scientific and technical
|
||||||
documentation, and document archival. It has extensive
|
documentation, and document archival. It has extensive
|
||||||
facilities for storing metadata along with the document, and these
|
facilities for storing metadata along with the document, and these
|
||||||
facilities are actually used in the real world.</para>
|
facilities are actually used in the real world.</para>
|
||||||
|
|
||||||
<para>In consequence, the <filename>rclpdf.py</filename> PDF input
|
<para>In consequence, the <command>rclpdf.py</command> PDF input
|
||||||
handler has more complex capabilities than most others, and it is
|
handler has more complex capabilities than most others, and it is
|
||||||
also more configurable. Specifically, <filename>rclpdf.py</filename>
|
also more configurable. Specifically, <command>rclpdf.py</command>
|
||||||
can automatically use <application>tesseract</application> to perform
|
has the following features:
|
||||||
OCR if the document text is empty, it can be configured to extract
|
<itemizedlist>
|
||||||
specific metadata tags from an XMP packet, and to extract PDF
|
<listitem><para>It can be configured to extract
|
||||||
attachments.</para>
|
specific metadata tags from an XMP packet.</para></listitem>
|
||||||
|
<listitem><para>It can extract PDF
|
||||||
<para>The PDF handler can execute an external program to run OCR if
|
attachments.</para></listitem>
|
||||||
no text is found in the document. This is now described in a
|
<listitem><para>It can automatically perform
|
||||||
<link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
|
OCR if the document text is empty. This is done by
|
||||||
|
executing an external program and is now described in a
|
||||||
|
<link linkend="RCL.INDEXING.OCR">separate
|
||||||
|
section</link>, because the OCR framework can also be used
|
||||||
|
with non-PDF image files.</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
|
||||||
<sect2 id="RCL.INDEXING.PDF.XMP">
|
<sect2 id="RCL.INDEXING.PDF.XMP">
|
||||||
<title>XMP fields extraction</title>
|
<title>XMP fields extraction</title>
|
||||||
@ -1477,7 +1483,7 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
|||||||
<title>PDF attachment indexing</title>
|
<title>PDF attachment indexing</title>
|
||||||
|
|
||||||
<para>If <application>pdftk</application> is installed, and if the
|
<para>If <application>pdftk</application> is installed, and if the
|
||||||
the
|
the
|
||||||
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
|
||||||
configuration variable is set, the PDF input handler will try to
|
configuration variable is set, the PDF input handler will try to
|
||||||
extract PDF attachements for indexing as sub-documents of the PDF
|
extract PDF attachements for indexing as sub-documents of the PDF
|
||||||
@ -1489,6 +1495,7 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
|||||||
|
|
||||||
</sect1>
|
</sect1>
|
||||||
|
|
||||||
|
|
||||||
<sect1 id="RCL.INDEXING.OCR">
|
<sect1 id="RCL.INDEXING.OCR">
|
||||||
<title>Recoll and OCR</title>
|
<title>Recoll and OCR</title>
|
||||||
|
|
||||||
@ -1521,8 +1528,13 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
|||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>Configuration. See the
|
<para>To enable this feature, you need to install one of
|
||||||
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
|
the supported OCR applications
|
||||||
|
(<application>tesseract</application>
|
||||||
|
or <application>ABBYY</application>), enable OCR in the PDF
|
||||||
|
handler, and tell &RCL; where the appropriate command resides. The
|
||||||
|
last parts are done by setting configuration variables. See the
|
||||||
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
|
||||||
relevant section</link>. All parameters can be localized in
|
relevant section</link>. All parameters can be localized in
|
||||||
subdirectories through the usual main configuration mechanism (path
|
subdirectories through the usual main configuration mechanism (path
|
||||||
sections).</para>
|
sections).</para>
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user