document the new ocr function and its config

This commit is contained in:
Jean-Francois Dockes 2020-02-27 18:17:51 +01:00
parent 40ead3aa7e
commit 17d29774b0
4 changed files with 338 additions and 134 deletions

View File

@ -247,8 +247,8 @@ will reduce the index size. This can only be set for a whole index, not
for a subtree.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE">
<term><varname>dehyphenate</varname></term>
<listitem><para>Determines if we index
'coworker' also when the input is 'co-worker'. This is new
<listitem><para>Determines if we index 'coworker'
also when the input is 'co-worker'. This is new
in version 1.22, and on by default. Setting the variable to off allows
restoring the previous behaviour.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER">
@ -279,7 +279,8 @@ as large.</para></listitem></varlistentry>
<term><varname>indexstemminglanguages</varname></term>
<listitem><para>Languages for which to create stemming expansion
data. Stemmer names can be found by executing 'recollindex
-l', or this can also be set from a list in the GUI.</para></listitem></varlistentry>
-l', or this can also be set from a list in the GUI. The values are full
language names, e.g. english, french...</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET">
<term><varname>defaultcharset</varname></term>
<listitem><para>Default character
@ -608,9 +609,9 @@ space issues.</para></listitem></varlistentry>
<term><varname>aspellLanguage</varname></term>
<listitem><para>Language definitions to use when creating the aspell
dictionary. The value must match a set of aspell language
definition files. You can type "aspell dicts" to see a list The default
if this is not set is to use the NLS environment to guess the
value.</para></listitem></varlistentry>
definition files. You can type "aspell dicts" to see a list The default
if this is not set is to use the NLS environment to guess the value. The
values are the 2-letter language codes (e.g. 'en', 'fr'...)</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM">
<term><varname>aspellAddCreateParam</varname></term>
<listitem><para>Additional option and parameter to aspell dictionary creation
@ -650,14 +651,20 @@ patterns are matched with fnmatch(pattern, path, 0) You can quote entries
containing white space with double quotes (quote the whole entry, not the
pattern). The default is empty.
Example: mondelaypatterns = *.log:20 "*with spaces.*:30"</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO">
<term><varname>idxniceprio</varname></term>
<listitem><para>"nice" process priority for the indexing processes. Default: 19
(lowest) Appeared with 1.26.5. Prior versions were fixed at 19.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS">
<term><varname>monioniceclass</varname></term>
<listitem><para>ionice class for the real time indexing process On platforms where this is supported. The default value is
3.</para></listitem></varlistentry>
<listitem><para>ionice class for the indexing process. Despite the misleading name, and on platforms where this is
supported, this affects all indexing processes,
not only the real time/monitoring ones. The default value is 3 (use
lowest "Idle" priority).</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA">
<term><varname>monioniceclassdata</varname></term>
<listitem><para>ionice class parameter for the real time indexing process. On platforms where this is supported. The default is
empty.</para></listitem></varlistentry>
<listitem><para>ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no
levels.</para></listitem></varlistentry>
</variablelist></sect3>
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.QUERY">
<title>Query-time parameters (no impact on the index) </title><variablelist>
@ -700,14 +707,8 @@ with possibly meaning-altering missing words.</para></listitem></varlistentry>
<title>Parameters for the PDF input script </title><variablelist>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">
<term><varname>pdfocr</varname></term>
<listitem><para>Attempt OCR of PDF files with no text content if both tesseract and
pdftoppm are installed. This can be defined in subdirectories. The default is off because
OCR is so very slow.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG">
<term><varname>pdfocrlang</varname></term>
<listitem><para>Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
with tesseract. This can also be set through a configuration variable
or directory-local parameters. See the rclpdf.py script.</para></listitem></varlistentry>
<listitem><para>Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because
OCR is so very slow. Will only do anything if ocrprogs is defined.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">
<term><varname>pdfattach</varname></term>
<listitem><para>Enable PDF attachment extraction by executing pdftk (if
@ -732,6 +733,41 @@ selected field, for editing or erasing. A new instance is created for
each document, so that the object can keep state for, e.g. eliminating
duplicate values.</para></listitem></varlistentry>
</variablelist></sect3>
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
<title>Parameters for OCR processing </title><variablelist>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS">
<term><varname>ocrprogs</varname></term>
<listitem><para>OCR modules to try. The top OCR script will try to load the corresponding modules in
order and use the first which reports being capable of performing OCR on
the input file. Modules for tesseract and ABBYY FineReader are present in
the standard distribution.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR">
<term><varname>ocrcachedir</varname></term>
<listitem><para>Location for caching OCR data. The default if this is empty or undefined is to store the cached
OCR data under $RECOLL_CONFDIR/ocrcache.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG">
<term><varname>tesseractlang</varname></term>
<listitem><para>Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set
through the contents of a file in
the currently processed directory. See the rclocrtesseract.py
script. Example values: eng, fra... See the tesseract documentation.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD">
<term><varname>tesseractcmd</varname></term>
<listitem><para>Path for the tesseract command. This is mostly useful on Windows, or for specifying a non-default
tesseract command. e.g. on Windows:
C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG">
<term><varname>abbyylang</varname></term>
<listitem><para>Language to assume for abbyy OCR. Important for improving the OCR accuracy. This can also be set
through the contents of a file in
the currently processed directory. See the rclocrabbyy.py
script. Typical values: English, French... See the ABBYY documentation.
</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD">
<term><varname>abbyycmd</varname></term>
<listitem><para>Path for the abbyy command The ABBY directory is usually not in the path, so you should set this.
</para></listitem></varlistentry>
</variablelist></sect3>
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS">
<title>Parameters set for specific locations </title><variablelist>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS">

View File

@ -3,7 +3,7 @@
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.2.0">
"HTML Tidy for HTML5 for Linux version 5.6.0">
<meta http-equiv="Content-Type" content=
"text/html; charset=utf-8">
<title>Recoll user manual</title>
@ -157,20 +157,19 @@ alink="#0000FF">
<dd>
<dl>
<dt><span class="sect2">2.8.1. <a href=
"#RCL.INDEXING.PDF.OCR">OCR with
Tesseract</a></span></dt>
<dt><span class="sect2">2.8.2. <a href=
"#RCL.INDEXING.PDF.XMP">XMP fields
extraction</a></span></dt>
<dt><span class="sect2">2.8.3. <a href=
<dt><span class="sect2">2.8.2. <a href=
"#RCL.INDEXING.PDF.ATTACH">PDF attachment
indexing</a></span></dt>
</dl>
</dd>
<dt><span class="sect1">2.9. <a href=
"#RCL.INDEXING.OCR">Recoll and OCR</a></span></dt>
<dt><span class="sect1">2.10. <a href=
"#RCL.INDEXING.PERIODIC">Periodic
indexing</a></span></dt>
<dt><span class="sect1">2.10. <a href=
<dt><span class="sect1">2.11. <a href=
"#RCL.INDEXING.MONITOR"><span class=
"application">Unix</span>-like systems: real time
indexing</a></span></dt>
@ -781,7 +780,7 @@ alink="#0000FF">
"list-style-type: disc;">
<li class="listitem">
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
title="2.9.&nbsp;Periodic indexing">Periodic (or
title="2.10.&nbsp;Periodic indexing">Periodic (or
batch) indexing</a> .&nbsp;</b><span class=
"command"><strong>recollindex</strong></span> is
executed at discrete times. On <span class=
@ -799,7 +798,7 @@ alink="#0000FF">
<li class="listitem">
<p><b><a class="link" href="#RCL.INDEXING.MONITOR"
title=
"2.10.&nbsp;Unix-like systems: real time indexing">Real
"2.11.&nbsp;Unix-like systems: real time indexing">Real
time indexing</a> .&nbsp;</b>(Only available on
<span class="application">Unix</span>-like
systems). <span class=
@ -831,7 +830,7 @@ alink="#0000FF">
indexing on a small home directory), or, with
<span class="application">Recoll</span> 1.24 and newer,
by <a class="link" href="#RCL.INDEXING.MONITOR" title=
"2.10.&nbsp;Unix-like systems: real time indexing">configuring
"2.11.&nbsp;Unix-like systems: real time indexing">configuring
the index so that only a subset of the tree will be
monitored.</a></p>
<p>The choice of method and the parameters used can be
@ -1136,8 +1135,8 @@ alink="#0000FF">
different areas of the file system to different
indexes. For example, if you were to issue the
following command:</p>
<pre class="programlisting">
recoll -c ~/.indexes-email</pre>
<pre class=
"programlisting">recoll -c ~/.indexes-email</pre>
<p>Then <span class="application">Recoll</span> would
use configuration files stored in <code class=
"filename">~/.indexes-email/</code> and, (unless
@ -2141,45 +2140,16 @@ metadatacmds = ; <em class=
if the document text is empty, it can be configured to
extract specific metadata tags from an XMP packet, and to
extract PDF attachments.</p>
<div class="sect2">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name="RCL.INDEXING.PDF.OCR"
id="RCL.INDEXING.PDF.OCR"></a>2.8.1.&nbsp;OCR with
Tesseract</h3>
</div>
</div>
</div>
<p>If both <span class="application">tesseract</span> and
<span class="command"><strong>pdftoppm</strong></span>
(generally from the <span class=
"application">poppler-utils</span> package) are
installed, the PDF handler may attempt OCR on PDF files
with no text content. This is controlled by the <a class=
"link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</a>
configuration variable, which is false by default because
OCR is very slow.</p>
<p>The choice of language is very important for
successfull OCR. Recoll has currently no way to determine
this from the document itself. You can set the language
to use through the contents of a <code class=
"filename">.ocrpdflang</code> text file in the same
directory as the PDF document, or through the
<code class="envar">RECOLL_TESSERACT_LANG</code>
environment variable, or through the contents of an
<code class="filename">ocrpdf</code> text file inside the
configuration directory. If none of the above are used,
<span class="application">Recoll</span> will try to guess
the language from the NLS environment.</p>
</div>
<p>The PDF handler can execute an external program to run
OCR if no text is found in the document. This is now
described in a <a class="link" href="#RCL.INDEXING.OCR"
title="2.9.&nbsp;Recoll and OCR">separate section</a>.</p>
<div class="sect2">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
id="RCL.INDEXING.PDF.XMP"></a>2.8.2.&nbsp;XMP
id="RCL.INDEXING.PDF.XMP"></a>2.8.1.&nbsp;XMP
fields extraction</h3>
</div>
</div>
@ -2236,7 +2206,7 @@ metadatacmds = ; <em class=
<div>
<div>
<h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
id="RCL.INDEXING.PDF.ATTACH"></a>2.8.3.&nbsp;PDF
id="RCL.INDEXING.PDF.ATTACH"></a>2.8.2.&nbsp;PDF
attachment indexing</h3>
</div>
</div>
@ -2252,13 +2222,67 @@ metadatacmds = ; <em class=
uncommon in my experience).</p>
</div>
</div>
<div class="sect1">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a name=
"RCL.INDEXING.OCR" id=
"RCL.INDEXING.OCR"></a>2.9.&nbsp;Recoll and OCR</h2>
</div>
</div>
</div>
<p>This is new in <span class="application">Recoll</span>
1.26.5. Older versions had a more limited, non-caching
capability to execute an external OCR program in the PDF
handler. The new function has the following features:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style="list-style-type: disc;">
<li class="listitem">
<p>The OCR output is cached, stored as separate
files. The caching is ultimately based on a hash
value of the original file contents, so that it is
immune to file renames. A first path-based layer
ensures fast operation for unchanged (unmoved files),
and the data hash (which is still orders of magnitude
faster than OCR) is only re-computed if the file has
moved. OCR is only performed if the file was not
previously processed or if it changed.</p>
</li>
<li class="listitem">
<p>The support for a specific program is implemented
in a simple Python module. It should be
straightforward to add support for any OCR engine
with a capability to run from the command line.</p>
</li>
<li class="listitem">
<p>Modules initially exist for <span class=
"application">tesseract</span> (Linux and Windows),
and <span class="application">ABBYY FineReader</span>
(Linux, tested with version 11). ABBYY FineReader is
a commercial closed source program, but it sometimes
perform better than tesseract.</p>
</li>
<li class="listitem">
<p>The OCR is currently only called from the PDF
handler, but there should be no problem using it for
other image types.</p>
</li>
</ul>
</div>
<p>Configuration. See the <a class="link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
"Parameters for OCR processing">relevant section</a>. All
parameters can be localized in subdirectories through the
usual main configuration mechanism (path sections).</p>
</div>
<div class="sect1">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a name=
"RCL.INDEXING.PERIODIC" id=
"RCL.INDEXING.PERIODIC"></a>2.9.&nbsp;Periodic
"RCL.INDEXING.PERIODIC"></a>2.10.&nbsp;Periodic
indexing</h2>
</div>
</div>
@ -2431,7 +2455,7 @@ metadatacmds = ; <em class=
<div>
<h2 class="title" style="clear: both"><a name=
"RCL.INDEXING.MONITOR" id=
"RCL.INDEXING.MONITOR"></a>2.10.&nbsp;<span class=
"RCL.INDEXING.MONITOR"></a>2.11.&nbsp;<span class=
"application">Unix</span>-like systems: real time
indexing</h2>
</div>
@ -3759,8 +3783,8 @@ fs.inotify.max_user_watches=32768
that every user does not have to do it. The variable
should define a colon-separated list of index
directories, ie:</p>
<pre class="screen">
export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
<pre class=
"screen">export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
<p>Another environment variable, <code class=
"envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
the active list of indexes. This variable was suggested
@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768
parent folder expansion, usually creating a file
manager window on the folder where the container file
resides. E.g.:</p>
<pre class="programlisting">
&lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
<pre class=
"programlisting">&lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
<p>A link target defined as <code class=
"literal">R%N|<em class=
"replaceable"><code>scriptname</code></em></code>
@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768
<span class="application">javascript</span> program to
the documents, like the following example, which would
initiate a search by double-clicking any term:</p>
<pre class="programlisting">
&lt;script language="JavaScript"&gt;
<pre class=
"programlisting">&lt;script language="JavaScript"&gt;
function recollsearch() {
var t = document.getSelection();
window.location.href = 'recoll://search/query?qtp=a&amp;p=0&amp;q=' +
@ -8838,7 +8862,8 @@ for i in range(nres):
<p>Languages for which to create stemming
expansion data. Stemmer names can be found by
executing 'recollindex -l', or this can also be
set from a list in the GUI.</p>
set from a list in the GUI. The values are full
language names, e.g. english, french...</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id=
@ -9425,7 +9450,8 @@ for i in range(nres):
aspell language definition files. You can type
"aspell dicts" to see a list The default if this
is not set is to use the NLS environment to guess
the value.</p>
the value. The values are the 2-letter language
codes (e.g. 'en', 'fr'...)</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"
@ -9500,21 +9526,32 @@ for i in range(nres):
*.log:20 "*with spaces.*:30"</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO"></a><span class="term"><code class="varname">idxniceprio</code></span></dt>
<dd>
<p>"nice" process priority for the indexing
processes. Default: 19 (lowest) Appeared with
1.26.5. Prior versions were fixed at 19.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt>
<dd>
<p>ionice class for the real time indexing
process On platforms where this is supported. The
default value is 3.</p>
<p>ionice class for the indexing process. Despite
the misleading name, and on platforms where this
is supported, this affects all indexing
processes, not only the real time/monitoring
ones. The default value is 3 (use lowest "Idle"
priority).</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"
id=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt>
<dd>
<p>ionice class parameter for the real time
indexing process. On platforms where this is
supported. The default is empty.</p>
<p>ionice class level parameter if the class
supports it. The default is empty, as the default
"Idle" class has no levels.</p>
</dd>
</dl>
</div>
@ -9611,20 +9648,10 @@ for i in range(nres):
id=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt>
<dd>
<p>Attempt OCR of PDF files with no text content
if both tesseract and pdftoppm are installed.
<p>Attempt OCR of PDF files with no text content.
This can be defined in subdirectories. The
default is off because OCR is so very slow.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG"></a><span class="term"><code class="varname">pdfocrlang</code></span></dt>
<dd>
<p>Language to assume for PDF OCR. This is very
important for having a reasonable rate of errors
with tesseract. This can also be set through a
configuration variable or directory-local
parameters. See the rclpdf.py script.</p>
default is off because OCR is so very slow. Will
only do anything if ocrprogs is defined.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id=
@ -9666,6 +9693,80 @@ for i in range(nres):
</dl>
</div>
</div>
<div class="sect3">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR"></a>Parameters
for OCR processing</h4>
</div>
</div>
</div>
<div class="variablelist">
<dl class="variablelist">
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS"></a><span class="term"><code class="varname">ocrprogs</code></span></dt>
<dd>
<p>OCR modules to try. The top OCR script will
try to load the corresponding modules in order
and use the first which reports being capable of
performing OCR on the input file. Modules for
tesseract and ABBYY FineReader are present in the
standard distribution.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR"></a><span class="term"><code class="varname">ocrcachedir</code></span></dt>
<dd>
<p>Location for caching OCR data. The default if
this is empty or undefined is to store the cached
OCR data under $RECOLL_CONFDIR/ocrcache.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG"></a><span class="term"><code class="varname">tesseractlang</code></span></dt>
<dd>
<p>Language to assume for tesseract OCR.
Important for improving the OCR accuracy. This
can also be set through the contents of a file in
the currently processed directory. See the
rclocrtesseract.py script. Example values: eng,
fra... See the tesseract documentation.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD"></a><span class="term"><code class="varname">tesseractcmd</code></span></dt>
<dd>
<p>Path for the tesseract command. This is mostly
useful on Windows, or for specifying a
non-default tesseract command. e.g. on Windows:
C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG"></a><span class="term"><code class="varname">abbyylang</code></span></dt>
<dd>
<p>Language to assume for abbyy OCR. Important
for improving the OCR accuracy. This can also be
set through the contents of a file in the
currently processed directory. See the
rclocrabbyy.py script. Typical values: English,
French... See the ABBYY documentation.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD"></a><span class="term"><code class="varname">abbyycmd</code></span></dt>
<dd>
<p>Path for the abbyy command The ABBY directory
is usually not in the path, so you should set
this.</p>
</dd>
</dl>
</div>
</div>
<div class="sect3">
<div class="titlepage">
<div>
@ -9858,8 +9959,8 @@ for i in range(nres):
"filename">.xml</code> extension but should be handled
specially, which is possible because they are usually all
located in one place. Example:</p>
<pre class="programlisting">
[~/.kde/share/apps/okular/docdata]
<pre class=
"programlisting">[~/.kde/share/apps/okular/docdata]
.xml = application/x-okular-notes</pre>
<p>The <code class="varname">recoll_noindex</code>
<code class="filename">mimemap</code> variable has been

View File

@ -1414,30 +1414,9 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
specific metadata tags from an XMP packet, and to extract PDF
attachments.</para>
<sect2 id="RCL.INDEXING.PDF.OCR">
<title>OCR with Tesseract</title>
<para>If both <application>tesseract</application> and
<command>pdftoppm</command> (generally from the
<application>poppler-utils</application> package) are installed,
the PDF handler may attempt OCR on PDF files with no text
content. This is controlled by the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
configuration variable, which is false by default because
OCR is very slow.</para>
<para>The choice of language is very important for successfull
OCR. Recoll has currently no way to determine this from the
document itself. You can set the language to use through the
contents of a <filename>.ocrpdflang</filename> text file in the
same directory as the PDF document, or through the
<envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
through the contents of an <filename>ocrpdf</filename> text file
inside the configuration directory. If none of the above are used,
&RCL; will try to guess the language from the NLS
environment.</para>
</sect2>
<para>The PDF handler can execute an external program to run OCR if
no text is found in the document. This is now described in a
<link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
<sect2 id="RCL.INDEXING.PDF.XMP">
<title>XMP fields extraction</title>
@ -1510,6 +1489,47 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
</sect1>
<sect1 id="RCL.INDEXING.OCR">
<title>Recoll and OCR</title>
<para>This is new in &RCL; 1.26.5. Older versions had a more limited,
non-caching capability to execute an external OCR program in the PDF
handler. The new function has the following features:
<itemizedlist>
<listitem><para>The OCR output is cached, stored as separate
files. The caching is ultimately based on a hash value of the
original file contents, so that it is immune to file renames. A
first path-based layer ensures fast operation for unchanged
(unmoved files), and the data hash (which is still orders of
magnitude faster than OCR) is only re-computed if the file has
moved. OCR is only performed if the file was not previously
processed or if it changed.</para></listitem>
<listitem><para>The support for a specific program is implemented
in a simple Python module. It should be straightforward to add
support for any OCR engine with a capability to run from the
command line.</para></listitem>
<listitem><para>Modules initially exist for
<application>tesseract</application> (Linux and Windows), and
<application>ABBYY FineReader</application> (Linux, tested with
version 11). ABBYY FineReader is a commercial closed source
program, but it sometimes perform better than
tesseract.</para></listitem>
<listitem><para>The OCR is currently only called from the PDF
handler, but there should be no problem using it for other image
types.</para></listitem>
</itemizedlist>
</para>
<para>Configuration. See the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
relevant section</link>. All parameters can be localized in
subdirectories through the usual main configuration mechanism (path
sections).</para>
</sect1>
<sect1 id="RCL.INDEXING.PERIODIC">
<title>Periodic indexing</title>

View File

@ -350,7 +350,8 @@ indexStoreDocText = 1
#
# <brief>Languages for which to create stemming expansion
# data.</brief><descr>Stemmer names can be found by executing 'recollindex
# -l', or this can also be set from a list in the GUI.</descr></var>
# -l', or this can also be set from a list in the GUI. The values are full
# language names, e.g. english, french...</descr></var>
indexstemminglanguages = english
# <var name="defaultcharset" type="string"><brief>Default character
@ -760,9 +761,9 @@ checkneedretryindexscript = rclcheckneedretry.sh
#
# <brief>Language definitions to use when creating the aspell
# dictionary.</brief><descr>The value must match a set of aspell language
# definition files. You can type "aspell dicts" to see a list The default
# if this is not set is to use the NLS environment to guess the
# value.</descr></var>
# definition files. You can type "aspell dicts" to see a list The default
# if this is not set is to use the NLS environment to guess the value. The
# values are the 2-letter language codes (e.g. 'en', 'fr'...)</descr></var>
#aspellLanguage = en
# <var name="aspellAddCreateParam" type="string">
@ -902,19 +903,11 @@ snippetMaxPosWalk = 1000000
# <var name="pdfocr" type="bool">
#
# <brief>Attempt OCR of PDF files with no text content if both tesseract and
# pdftoppm are installed.</brief>
# <brief>Attempt OCR of PDF files with no text content.</brief>
# <descr>This can be defined in subdirectories. The default is off because
# OCR is so very slow.</descr></var>
#pdfocr = 0
# <var name="pdfocrlang" type="string">
# <brief>Language to assume for PDF OCR.</brief>
# <descr>This is very important for having a reasonable rate of errors
# with tesseract. This can also be set through a configuration variable
# or directory-local parameters. See the rclpdf.py script.</descr>
# OCR is so very slow. Will only do anything if ocrprogs is defined.</descr>
# </var>
#pdfocrlang = eng
#pdfocr = 0
# <var name="pdfattach" type="bool">
#
@ -946,6 +939,60 @@ snippetMaxPosWalk = 1000000
#pdfextrametafix = /path/to/fixerscript.py
# <grouptitle id="OCR">Parameters for OCR processing</grouptitle>
# <var name="ocrprogs" type="string">
# <brief>OCR modules to try.</brief>
# <descr>The top OCR script will try to load the corresponding modules in
# order and use the first which reports being capable of performing OCR on
# the input file. Modules for tesseract and ABBYY FineReader are present in
# the standard distribution.</descr>
# </var>
#ocrprogs = abbyy tesseract
# <var name="ocrcachedir" type="dfn">
# <brief>Location for caching OCR data.</brief>
# <descr>The default if this is empty or undefined is to store the cached
# OCR data under $RECOLL_CONFDIR/ocrcache.</descr>
# </var>
#ocrcachedir=
# <var name="tesseractlang" type="string">
# <brief>Language to assume for tesseract OCR.</brief>
# <descr>Important for improving the OCR accuracy. This can also be set
# through the contents of a file in
# the currently processed directory. See the rclocrtesseract.py
# script. Example values: eng, fra... See the tesseract documentation.</descr>
# </var>
#tesseractlang = eng
# <var name="tesseractcmd" type="fn">
# <brief>Path for the tesseract command.</brief>
# <descr>This is mostly useful on Windows, or for specifying a non-default
# tesseract command. e.g. on Windows:
# C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</descr>
# </var>
#tesseractcmd = c:/Program Files (x86)/Tesseract-OCR/tesseract.exe
# <var name="abbyylang" type="string">
# <brief>Language to assume for abbyy OCR.</brief>
# <descr>Important for improving the OCR accuracy. This can also be set
# through the contents of a file in
# the currently processed directory. See the rclocrabbyy.py
# script. Typical values: English, French... See the ABBYY documentation.
# </descr>
# </var>
#abbyylang = English
# <var name="abbyycmd" type="fn">
# <brief>Path for the abbyy command</brief>
# <descr>The ABBY directory is usually not in the path, so you should set this.
# </descr>
# </var>
abbyycmd = /opt/ABBYYOCR11/abbyyocr11
# <grouptitle id="SPECLOCATIONS">Parameters set for specific
# locations</grouptitle>