document the new ocr function and its config

This commit is contained in:
Jean-Francois Dockes 2020-02-27 18:17:51 +01:00
parent 40ead3aa7e
commit 17d29774b0
4 changed files with 338 additions and 134 deletions

View File

@ -247,8 +247,8 @@ will reduce the index size. This can only be set for a whole index, not
for a subtree.</para></listitem></varlistentry> for a subtree.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE">
<term><varname>dehyphenate</varname></term> <term><varname>dehyphenate</varname></term>
<listitem><para>Determines if we index <listitem><para>Determines if we index 'coworker'
'coworker' also when the input is 'co-worker'. This is new also when the input is 'co-worker'. This is new
in version 1.22, and on by default. Setting the variable to off allows in version 1.22, and on by default. Setting the variable to off allows
restoring the previous behaviour.</para></listitem></varlistentry> restoring the previous behaviour.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER">
@ -279,7 +279,8 @@ as large.</para></listitem></varlistentry>
<term><varname>indexstemminglanguages</varname></term> <term><varname>indexstemminglanguages</varname></term>
<listitem><para>Languages for which to create stemming expansion <listitem><para>Languages for which to create stemming expansion
data. Stemmer names can be found by executing 'recollindex data. Stemmer names can be found by executing 'recollindex
-l', or this can also be set from a list in the GUI.</para></listitem></varlistentry> -l', or this can also be set from a list in the GUI. The values are full
language names, e.g. english, french...</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET">
<term><varname>defaultcharset</varname></term> <term><varname>defaultcharset</varname></term>
<listitem><para>Default character <listitem><para>Default character
@ -608,9 +609,9 @@ space issues.</para></listitem></varlistentry>
<term><varname>aspellLanguage</varname></term> <term><varname>aspellLanguage</varname></term>
<listitem><para>Language definitions to use when creating the aspell <listitem><para>Language definitions to use when creating the aspell
dictionary. The value must match a set of aspell language dictionary. The value must match a set of aspell language
definition files. You can type "aspell dicts" to see a list The default definition files. You can type "aspell dicts" to see a list The default
if this is not set is to use the NLS environment to guess the if this is not set is to use the NLS environment to guess the value. The
value.</para></listitem></varlistentry> values are the 2-letter language codes (e.g. 'en', 'fr'...)</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM">
<term><varname>aspellAddCreateParam</varname></term> <term><varname>aspellAddCreateParam</varname></term>
<listitem><para>Additional option and parameter to aspell dictionary creation <listitem><para>Additional option and parameter to aspell dictionary creation
@ -650,14 +651,20 @@ patterns are matched with fnmatch(pattern, path, 0) You can quote entries
containing white space with double quotes (quote the whole entry, not the containing white space with double quotes (quote the whole entry, not the
pattern). The default is empty. pattern). The default is empty.
Example: mondelaypatterns = *.log:20 "*with spaces.*:30"</para></listitem></varlistentry> Example: mondelaypatterns = *.log:20 "*with spaces.*:30"</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO">
<term><varname>idxniceprio</varname></term>
<listitem><para>"nice" process priority for the indexing processes. Default: 19
(lowest) Appeared with 1.26.5. Prior versions were fixed at 19.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS">
<term><varname>monioniceclass</varname></term> <term><varname>monioniceclass</varname></term>
<listitem><para>ionice class for the real time indexing process On platforms where this is supported. The default value is <listitem><para>ionice class for the indexing process. Despite the misleading name, and on platforms where this is
3.</para></listitem></varlistentry> supported, this affects all indexing processes,
not only the real time/monitoring ones. The default value is 3 (use
lowest "Idle" priority).</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA">
<term><varname>monioniceclassdata</varname></term> <term><varname>monioniceclassdata</varname></term>
<listitem><para>ionice class parameter for the real time indexing process. On platforms where this is supported. The default is <listitem><para>ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no
empty.</para></listitem></varlistentry> levels.</para></listitem></varlistentry>
</variablelist></sect3> </variablelist></sect3>
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.QUERY"> <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.QUERY">
<title>Query-time parameters (no impact on the index) </title><variablelist> <title>Query-time parameters (no impact on the index) </title><variablelist>
@ -700,14 +707,8 @@ with possibly meaning-altering missing words.</para></listitem></varlistentry>
<title>Parameters for the PDF input script </title><variablelist> <title>Parameters for the PDF input script </title><variablelist>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">
<term><varname>pdfocr</varname></term> <term><varname>pdfocr</varname></term>
<listitem><para>Attempt OCR of PDF files with no text content if both tesseract and <listitem><para>Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because
pdftoppm are installed. This can be defined in subdirectories. The default is off because OCR is so very slow. Will only do anything if ocrprogs is defined.</para></listitem></varlistentry>
OCR is so very slow.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG">
<term><varname>pdfocrlang</varname></term>
<listitem><para>Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
with tesseract. This can also be set through a configuration variable
or directory-local parameters. See the rclpdf.py script.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">
<term><varname>pdfattach</varname></term> <term><varname>pdfattach</varname></term>
<listitem><para>Enable PDF attachment extraction by executing pdftk (if <listitem><para>Enable PDF attachment extraction by executing pdftk (if
@ -732,6 +733,41 @@ selected field, for editing or erasing. A new instance is created for
each document, so that the object can keep state for, e.g. eliminating each document, so that the object can keep state for, e.g. eliminating
duplicate values.</para></listitem></varlistentry> duplicate values.</para></listitem></varlistentry>
</variablelist></sect3> </variablelist></sect3>
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
<title>Parameters for OCR processing </title><variablelist>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS">
<term><varname>ocrprogs</varname></term>
<listitem><para>OCR modules to try. The top OCR script will try to load the corresponding modules in
order and use the first which reports being capable of performing OCR on
the input file. Modules for tesseract and ABBYY FineReader are present in
the standard distribution.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR">
<term><varname>ocrcachedir</varname></term>
<listitem><para>Location for caching OCR data. The default if this is empty or undefined is to store the cached
OCR data under $RECOLL_CONFDIR/ocrcache.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG">
<term><varname>tesseractlang</varname></term>
<listitem><para>Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set
through the contents of a file in
the currently processed directory. See the rclocrtesseract.py
script. Example values: eng, fra... See the tesseract documentation.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD">
<term><varname>tesseractcmd</varname></term>
<listitem><para>Path for the tesseract command. This is mostly useful on Windows, or for specifying a non-default
tesseract command. e.g. on Windows:
C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG">
<term><varname>abbyylang</varname></term>
<listitem><para>Language to assume for abbyy OCR. Important for improving the OCR accuracy. This can also be set
through the contents of a file in
the currently processed directory. See the rclocrabbyy.py
script. Typical values: English, French... See the ABBYY documentation.
</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD">
<term><varname>abbyycmd</varname></term>
<listitem><para>Path for the abbyy command The ABBY directory is usually not in the path, so you should set this.
</para></listitem></varlistentry>
</variablelist></sect3>
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS"> <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS">
<title>Parameters set for specific locations </title><variablelist> <title>Parameters set for specific locations </title><variablelist>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS"> <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS">

View File

@ -3,7 +3,7 @@
<html> <html>
<head> <head>
<meta name="generator" content= <meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.2.0"> "HTML Tidy for HTML5 for Linux version 5.6.0">
<meta http-equiv="Content-Type" content= <meta http-equiv="Content-Type" content=
"text/html; charset=utf-8"> "text/html; charset=utf-8">
<title>Recoll user manual</title> <title>Recoll user manual</title>
@ -157,20 +157,19 @@ alink="#0000FF">
<dd> <dd>
<dl> <dl>
<dt><span class="sect2">2.8.1. <a href= <dt><span class="sect2">2.8.1. <a href=
"#RCL.INDEXING.PDF.OCR">OCR with
Tesseract</a></span></dt>
<dt><span class="sect2">2.8.2. <a href=
"#RCL.INDEXING.PDF.XMP">XMP fields "#RCL.INDEXING.PDF.XMP">XMP fields
extraction</a></span></dt> extraction</a></span></dt>
<dt><span class="sect2">2.8.3. <a href= <dt><span class="sect2">2.8.2. <a href=
"#RCL.INDEXING.PDF.ATTACH">PDF attachment "#RCL.INDEXING.PDF.ATTACH">PDF attachment
indexing</a></span></dt> indexing</a></span></dt>
</dl> </dl>
</dd> </dd>
<dt><span class="sect1">2.9. <a href= <dt><span class="sect1">2.9. <a href=
"#RCL.INDEXING.OCR">Recoll and OCR</a></span></dt>
<dt><span class="sect1">2.10. <a href=
"#RCL.INDEXING.PERIODIC">Periodic "#RCL.INDEXING.PERIODIC">Periodic
indexing</a></span></dt> indexing</a></span></dt>
<dt><span class="sect1">2.10. <a href= <dt><span class="sect1">2.11. <a href=
"#RCL.INDEXING.MONITOR"><span class= "#RCL.INDEXING.MONITOR"><span class=
"application">Unix</span>-like systems: real time "application">Unix</span>-like systems: real time
indexing</a></span></dt> indexing</a></span></dt>
@ -781,7 +780,7 @@ alink="#0000FF">
"list-style-type: disc;"> "list-style-type: disc;">
<li class="listitem"> <li class="listitem">
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC" <p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
title="2.9.&nbsp;Periodic indexing">Periodic (or title="2.10.&nbsp;Periodic indexing">Periodic (or
batch) indexing</a> .&nbsp;</b><span class= batch) indexing</a> .&nbsp;</b><span class=
"command"><strong>recollindex</strong></span> is "command"><strong>recollindex</strong></span> is
executed at discrete times. On <span class= executed at discrete times. On <span class=
@ -799,7 +798,7 @@ alink="#0000FF">
<li class="listitem"> <li class="listitem">
<p><b><a class="link" href="#RCL.INDEXING.MONITOR" <p><b><a class="link" href="#RCL.INDEXING.MONITOR"
title= title=
"2.10.&nbsp;Unix-like systems: real time indexing">Real "2.11.&nbsp;Unix-like systems: real time indexing">Real
time indexing</a> .&nbsp;</b>(Only available on time indexing</a> .&nbsp;</b>(Only available on
<span class="application">Unix</span>-like <span class="application">Unix</span>-like
systems). <span class= systems). <span class=
@ -831,7 +830,7 @@ alink="#0000FF">
indexing on a small home directory), or, with indexing on a small home directory), or, with
<span class="application">Recoll</span> 1.24 and newer, <span class="application">Recoll</span> 1.24 and newer,
by <a class="link" href="#RCL.INDEXING.MONITOR" title= by <a class="link" href="#RCL.INDEXING.MONITOR" title=
"2.10.&nbsp;Unix-like systems: real time indexing">configuring "2.11.&nbsp;Unix-like systems: real time indexing">configuring
the index so that only a subset of the tree will be the index so that only a subset of the tree will be
monitored.</a></p> monitored.</a></p>
<p>The choice of method and the parameters used can be <p>The choice of method and the parameters used can be
@ -1136,8 +1135,8 @@ alink="#0000FF">
different areas of the file system to different different areas of the file system to different
indexes. For example, if you were to issue the indexes. For example, if you were to issue the
following command:</p> following command:</p>
<pre class="programlisting"> <pre class=
recoll -c ~/.indexes-email</pre> "programlisting">recoll -c ~/.indexes-email</pre>
<p>Then <span class="application">Recoll</span> would <p>Then <span class="application">Recoll</span> would
use configuration files stored in <code class= use configuration files stored in <code class=
"filename">~/.indexes-email/</code> and, (unless "filename">~/.indexes-email/</code> and, (unless
@ -2141,45 +2140,16 @@ metadatacmds = ; <em class=
if the document text is empty, it can be configured to if the document text is empty, it can be configured to
extract specific metadata tags from an XMP packet, and to extract specific metadata tags from an XMP packet, and to
extract PDF attachments.</p> extract PDF attachments.</p>
<div class="sect2"> <p>The PDF handler can execute an external program to run
<div class="titlepage"> OCR if no text is found in the document. This is now
<div> described in a <a class="link" href="#RCL.INDEXING.OCR"
<div> title="2.9.&nbsp;Recoll and OCR">separate section</a>.</p>
<h3 class="title"><a name="RCL.INDEXING.PDF.OCR"
id="RCL.INDEXING.PDF.OCR"></a>2.8.1.&nbsp;OCR with
Tesseract</h3>
</div>
</div>
</div>
<p>If both <span class="application">tesseract</span> and
<span class="command"><strong>pdftoppm</strong></span>
(generally from the <span class=
"application">poppler-utils</span> package) are
installed, the PDF handler may attempt OCR on PDF files
with no text content. This is controlled by the <a class=
"link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</a>
configuration variable, which is false by default because
OCR is very slow.</p>
<p>The choice of language is very important for
successfull OCR. Recoll has currently no way to determine
this from the document itself. You can set the language
to use through the contents of a <code class=
"filename">.ocrpdflang</code> text file in the same
directory as the PDF document, or through the
<code class="envar">RECOLL_TESSERACT_LANG</code>
environment variable, or through the contents of an
<code class="filename">ocrpdf</code> text file inside the
configuration directory. If none of the above are used,
<span class="application">Recoll</span> will try to guess
the language from the NLS environment.</p>
</div>
<div class="sect2"> <div class="sect2">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="RCL.INDEXING.PDF.XMP" <h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
id="RCL.INDEXING.PDF.XMP"></a>2.8.2.&nbsp;XMP id="RCL.INDEXING.PDF.XMP"></a>2.8.1.&nbsp;XMP
fields extraction</h3> fields extraction</h3>
</div> </div>
</div> </div>
@ -2236,7 +2206,7 @@ metadatacmds = ; <em class=
<div> <div>
<div> <div>
<h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH" <h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
id="RCL.INDEXING.PDF.ATTACH"></a>2.8.3.&nbsp;PDF id="RCL.INDEXING.PDF.ATTACH"></a>2.8.2.&nbsp;PDF
attachment indexing</h3> attachment indexing</h3>
</div> </div>
</div> </div>
@ -2252,13 +2222,67 @@ metadatacmds = ; <em class=
uncommon in my experience).</p> uncommon in my experience).</p>
</div> </div>
</div> </div>
<div class="sect1">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a name=
"RCL.INDEXING.OCR" id=
"RCL.INDEXING.OCR"></a>2.9.&nbsp;Recoll and OCR</h2>
</div>
</div>
</div>
<p>This is new in <span class="application">Recoll</span>
1.26.5. Older versions had a more limited, non-caching
capability to execute an external OCR program in the PDF
handler. The new function has the following features:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style="list-style-type: disc;">
<li class="listitem">
<p>The OCR output is cached, stored as separate
files. The caching is ultimately based on a hash
value of the original file contents, so that it is
immune to file renames. A first path-based layer
ensures fast operation for unchanged (unmoved files),
and the data hash (which is still orders of magnitude
faster than OCR) is only re-computed if the file has
moved. OCR is only performed if the file was not
previously processed or if it changed.</p>
</li>
<li class="listitem">
<p>The support for a specific program is implemented
in a simple Python module. It should be
straightforward to add support for any OCR engine
with a capability to run from the command line.</p>
</li>
<li class="listitem">
<p>Modules initially exist for <span class=
"application">tesseract</span> (Linux and Windows),
and <span class="application">ABBYY FineReader</span>
(Linux, tested with version 11). ABBYY FineReader is
a commercial closed source program, but it sometimes
perform better than tesseract.</p>
</li>
<li class="listitem">
<p>The OCR is currently only called from the PDF
handler, but there should be no problem using it for
other image types.</p>
</li>
</ul>
</div>
<p>Configuration. See the <a class="link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
"Parameters for OCR processing">relevant section</a>. All
parameters can be localized in subdirectories through the
usual main configuration mechanism (path sections).</p>
</div>
<div class="sect1"> <div class="sect1">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h2 class="title" style="clear: both"><a name= <h2 class="title" style="clear: both"><a name=
"RCL.INDEXING.PERIODIC" id= "RCL.INDEXING.PERIODIC" id=
"RCL.INDEXING.PERIODIC"></a>2.9.&nbsp;Periodic "RCL.INDEXING.PERIODIC"></a>2.10.&nbsp;Periodic
indexing</h2> indexing</h2>
</div> </div>
</div> </div>
@ -2431,7 +2455,7 @@ metadatacmds = ; <em class=
<div> <div>
<h2 class="title" style="clear: both"><a name= <h2 class="title" style="clear: both"><a name=
"RCL.INDEXING.MONITOR" id= "RCL.INDEXING.MONITOR" id=
"RCL.INDEXING.MONITOR"></a>2.10.&nbsp;<span class= "RCL.INDEXING.MONITOR"></a>2.11.&nbsp;<span class=
"application">Unix</span>-like systems: real time "application">Unix</span>-like systems: real time
indexing</h2> indexing</h2>
</div> </div>
@ -3759,8 +3783,8 @@ fs.inotify.max_user_watches=32768
that every user does not have to do it. The variable that every user does not have to do it. The variable
should define a colon-separated list of index should define a colon-separated list of index
directories, ie:</p> directories, ie:</p>
<pre class="screen"> <pre class=
export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre> "screen">export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
<p>Another environment variable, <code class= <p>Another environment variable, <code class=
"envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to "envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
the active list of indexes. This variable was suggested the active list of indexes. This variable was suggested
@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768
parent folder expansion, usually creating a file parent folder expansion, usually creating a file
manager window on the folder where the container file manager window on the folder where the container file
resides. E.g.:</p> resides. E.g.:</p>
<pre class="programlisting"> <pre class=
&lt;a href="F%N"&gt;%P&lt;/a&gt;</pre> "programlisting">&lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
<p>A link target defined as <code class= <p>A link target defined as <code class=
"literal">R%N|<em class= "literal">R%N|<em class=
"replaceable"><code>scriptname</code></em></code> "replaceable"><code>scriptname</code></em></code>
@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768
<span class="application">javascript</span> program to <span class="application">javascript</span> program to
the documents, like the following example, which would the documents, like the following example, which would
initiate a search by double-clicking any term:</p> initiate a search by double-clicking any term:</p>
<pre class="programlisting"> <pre class=
&lt;script language="JavaScript"&gt; "programlisting">&lt;script language="JavaScript"&gt;
function recollsearch() { function recollsearch() {
var t = document.getSelection(); var t = document.getSelection();
window.location.href = 'recoll://search/query?qtp=a&amp;p=0&amp;q=' + window.location.href = 'recoll://search/query?qtp=a&amp;p=0&amp;q=' +
@ -8838,7 +8862,8 @@ for i in range(nres):
<p>Languages for which to create stemming <p>Languages for which to create stemming
expansion data. Stemmer names can be found by expansion data. Stemmer names can be found by
executing 'recollindex -l', or this can also be executing 'recollindex -l', or this can also be
set from a list in the GUI.</p> set from a list in the GUI. The values are full
language names, e.g. english, french...</p>
</dd> </dd>
<dt><a name= <dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id= "RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id=
@ -9425,7 +9450,8 @@ for i in range(nres):
aspell language definition files. You can type aspell language definition files. You can type
"aspell dicts" to see a list The default if this "aspell dicts" to see a list The default if this
is not set is to use the NLS environment to guess is not set is to use the NLS environment to guess
the value.</p> the value. The values are the 2-letter language
codes (e.g. 'en', 'fr'...)</p>
</dd> </dd>
<dt><a name= <dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM" "RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"
@ -9500,21 +9526,32 @@ for i in range(nres):
*.log:20 "*with spaces.*:30"</p> *.log:20 "*with spaces.*:30"</p>
</dd> </dd>
<dt><a name= <dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO"></a><span class="term"><code class="varname">idxniceprio</code></span></dt>
<dd>
<p>"nice" process priority for the indexing
processes. Default: 19 (lowest) Appeared with
1.26.5. Prior versions were fixed at 19.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id= "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt> "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt>
<dd> <dd>
<p>ionice class for the real time indexing <p>ionice class for the indexing process. Despite
process On platforms where this is supported. The the misleading name, and on platforms where this
default value is 3.</p> is supported, this affects all indexing
processes, not only the real time/monitoring
ones. The default value is 3 (use lowest "Idle"
priority).</p>
</dd> </dd>
<dt><a name= <dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA" "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"
id= id=
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt> "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt>
<dd> <dd>
<p>ionice class parameter for the real time <p>ionice class level parameter if the class
indexing process. On platforms where this is supports it. The default is empty, as the default
supported. The default is empty.</p> "Idle" class has no levels.</p>
</dd> </dd>
</dl> </dl>
</div> </div>
@ -9611,20 +9648,10 @@ for i in range(nres):
id= id=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt> "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt>
<dd> <dd>
<p>Attempt OCR of PDF files with no text content <p>Attempt OCR of PDF files with no text content.
if both tesseract and pdftoppm are installed.
This can be defined in subdirectories. The This can be defined in subdirectories. The
default is off because OCR is so very slow.</p> default is off because OCR is so very slow. Will
</dd> only do anything if ocrprogs is defined.</p>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG"></a><span class="term"><code class="varname">pdfocrlang</code></span></dt>
<dd>
<p>Language to assume for PDF OCR. This is very
important for having a reasonable rate of errors
with tesseract. This can also be set through a
configuration variable or directory-local
parameters. See the rclpdf.py script.</p>
</dd> </dd>
<dt><a name= <dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id= "RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id=
@ -9666,6 +9693,80 @@ for i in range(nres):
</dl> </dl>
</div> </div>
</div> </div>
<div class="sect3">
<div class="titlepage">
<div>
<div>
<h4 class="title"><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR"></a>Parameters
for OCR processing</h4>
</div>
</div>
</div>
<div class="variablelist">
<dl class="variablelist">
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS"></a><span class="term"><code class="varname">ocrprogs</code></span></dt>
<dd>
<p>OCR modules to try. The top OCR script will
try to load the corresponding modules in order
and use the first which reports being capable of
performing OCR on the input file. Modules for
tesseract and ABBYY FineReader are present in the
standard distribution.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR"></a><span class="term"><code class="varname">ocrcachedir</code></span></dt>
<dd>
<p>Location for caching OCR data. The default if
this is empty or undefined is to store the cached
OCR data under $RECOLL_CONFDIR/ocrcache.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG"></a><span class="term"><code class="varname">tesseractlang</code></span></dt>
<dd>
<p>Language to assume for tesseract OCR.
Important for improving the OCR accuracy. This
can also be set through the contents of a file in
the currently processed directory. See the
rclocrtesseract.py script. Example values: eng,
fra... See the tesseract documentation.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD"></a><span class="term"><code class="varname">tesseractcmd</code></span></dt>
<dd>
<p>Path for the tesseract command. This is mostly
useful on Windows, or for specifying a
non-default tesseract command. e.g. on Windows:
C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG"></a><span class="term"><code class="varname">abbyylang</code></span></dt>
<dd>
<p>Language to assume for abbyy OCR. Important
for improving the OCR accuracy. This can also be
set through the contents of a file in the
currently processed directory. See the
rclocrabbyy.py script. Typical values: English,
French... See the ABBYY documentation.</p>
</dd>
<dt><a name=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD" id=
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD"></a><span class="term"><code class="varname">abbyycmd</code></span></dt>
<dd>
<p>Path for the abbyy command The ABBY directory
is usually not in the path, so you should set
this.</p>
</dd>
</dl>
</div>
</div>
<div class="sect3"> <div class="sect3">
<div class="titlepage"> <div class="titlepage">
<div> <div>
@ -9858,8 +9959,8 @@ for i in range(nres):
"filename">.xml</code> extension but should be handled "filename">.xml</code> extension but should be handled
specially, which is possible because they are usually all specially, which is possible because they are usually all
located in one place. Example:</p> located in one place. Example:</p>
<pre class="programlisting"> <pre class=
[~/.kde/share/apps/okular/docdata] "programlisting">[~/.kde/share/apps/okular/docdata]
.xml = application/x-okular-notes</pre> .xml = application/x-okular-notes</pre>
<p>The <code class="varname">recoll_noindex</code> <p>The <code class="varname">recoll_noindex</code>
<code class="filename">mimemap</code> variable has been <code class="filename">mimemap</code> variable has been

View File

@ -1414,30 +1414,9 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
specific metadata tags from an XMP packet, and to extract PDF specific metadata tags from an XMP packet, and to extract PDF
attachments.</para> attachments.</para>
<sect2 id="RCL.INDEXING.PDF.OCR"> <para>The PDF handler can execute an external program to run OCR if
<title>OCR with Tesseract</title> no text is found in the document. This is now described in a
<link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
<para>If both <application>tesseract</application> and
<command>pdftoppm</command> (generally from the
<application>poppler-utils</application> package) are installed,
the PDF handler may attempt OCR on PDF files with no text
content. This is controlled by the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
configuration variable, which is false by default because
OCR is very slow.</para>
<para>The choice of language is very important for successfull
OCR. Recoll has currently no way to determine this from the
document itself. You can set the language to use through the
contents of a <filename>.ocrpdflang</filename> text file in the
same directory as the PDF document, or through the
<envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
through the contents of an <filename>ocrpdf</filename> text file
inside the configuration directory. If none of the above are used,
&RCL; will try to guess the language from the NLS
environment.</para>
</sect2>
<sect2 id="RCL.INDEXING.PDF.XMP"> <sect2 id="RCL.INDEXING.PDF.XMP">
<title>XMP fields extraction</title> <title>XMP fields extraction</title>
@ -1510,6 +1489,47 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
</sect1> </sect1>
<sect1 id="RCL.INDEXING.OCR">
<title>Recoll and OCR</title>
<para>This is new in &RCL; 1.26.5. Older versions had a more limited,
non-caching capability to execute an external OCR program in the PDF
handler. The new function has the following features:
<itemizedlist>
<listitem><para>The OCR output is cached, stored as separate
files. The caching is ultimately based on a hash value of the
original file contents, so that it is immune to file renames. A
first path-based layer ensures fast operation for unchanged
(unmoved files), and the data hash (which is still orders of
magnitude faster than OCR) is only re-computed if the file has
moved. OCR is only performed if the file was not previously
processed or if it changed.</para></listitem>
<listitem><para>The support for a specific program is implemented
in a simple Python module. It should be straightforward to add
support for any OCR engine with a capability to run from the
command line.</para></listitem>
<listitem><para>Modules initially exist for
<application>tesseract</application> (Linux and Windows), and
<application>ABBYY FineReader</application> (Linux, tested with
version 11). ABBYY FineReader is a commercial closed source
program, but it sometimes perform better than
tesseract.</para></listitem>
<listitem><para>The OCR is currently only called from the PDF
handler, but there should be no problem using it for other image
types.</para></listitem>
</itemizedlist>
</para>
<para>Configuration. See the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
relevant section</link>. All parameters can be localized in
subdirectories through the usual main configuration mechanism (path
sections).</para>
</sect1>
<sect1 id="RCL.INDEXING.PERIODIC"> <sect1 id="RCL.INDEXING.PERIODIC">
<title>Periodic indexing</title> <title>Periodic indexing</title>

View File

@ -350,7 +350,8 @@ indexStoreDocText = 1
# #
# <brief>Languages for which to create stemming expansion # <brief>Languages for which to create stemming expansion
# data.</brief><descr>Stemmer names can be found by executing 'recollindex # data.</brief><descr>Stemmer names can be found by executing 'recollindex
# -l', or this can also be set from a list in the GUI.</descr></var> # -l', or this can also be set from a list in the GUI. The values are full
# language names, e.g. english, french...</descr></var>
indexstemminglanguages = english indexstemminglanguages = english
# <var name="defaultcharset" type="string"><brief>Default character # <var name="defaultcharset" type="string"><brief>Default character
@ -760,9 +761,9 @@ checkneedretryindexscript = rclcheckneedretry.sh
# #
# <brief>Language definitions to use when creating the aspell # <brief>Language definitions to use when creating the aspell
# dictionary.</brief><descr>The value must match a set of aspell language # dictionary.</brief><descr>The value must match a set of aspell language
# definition files. You can type "aspell dicts" to see a list The default # definition files. You can type "aspell dicts" to see a list The default
# if this is not set is to use the NLS environment to guess the # if this is not set is to use the NLS environment to guess the value. The
# value.</descr></var> # values are the 2-letter language codes (e.g. 'en', 'fr'...)</descr></var>
#aspellLanguage = en #aspellLanguage = en
# <var name="aspellAddCreateParam" type="string"> # <var name="aspellAddCreateParam" type="string">
@ -902,19 +903,11 @@ snippetMaxPosWalk = 1000000
# <var name="pdfocr" type="bool"> # <var name="pdfocr" type="bool">
# #
# <brief>Attempt OCR of PDF files with no text content if both tesseract and # <brief>Attempt OCR of PDF files with no text content.</brief>
# pdftoppm are installed.</brief>
# <descr>This can be defined in subdirectories. The default is off because # <descr>This can be defined in subdirectories. The default is off because
# OCR is so very slow.</descr></var> # OCR is so very slow. Will only do anything if ocrprogs is defined.</descr>
#pdfocr = 0
# <var name="pdfocrlang" type="string">
# <brief>Language to assume for PDF OCR.</brief>
# <descr>This is very important for having a reasonable rate of errors
# with tesseract. This can also be set through a configuration variable
# or directory-local parameters. See the rclpdf.py script.</descr>
# </var> # </var>
#pdfocrlang = eng #pdfocr = 0
# <var name="pdfattach" type="bool"> # <var name="pdfattach" type="bool">
# #
@ -946,6 +939,60 @@ snippetMaxPosWalk = 1000000
#pdfextrametafix = /path/to/fixerscript.py #pdfextrametafix = /path/to/fixerscript.py
# <grouptitle id="OCR">Parameters for OCR processing</grouptitle>
# <var name="ocrprogs" type="string">
# <brief>OCR modules to try.</brief>
# <descr>The top OCR script will try to load the corresponding modules in
# order and use the first which reports being capable of performing OCR on
# the input file. Modules for tesseract and ABBYY FineReader are present in
# the standard distribution.</descr>
# </var>
#ocrprogs = abbyy tesseract
# <var name="ocrcachedir" type="dfn">
# <brief>Location for caching OCR data.</brief>
# <descr>The default if this is empty or undefined is to store the cached
# OCR data under $RECOLL_CONFDIR/ocrcache.</descr>
# </var>
#ocrcachedir=
# <var name="tesseractlang" type="string">
# <brief>Language to assume for tesseract OCR.</brief>
# <descr>Important for improving the OCR accuracy. This can also be set
# through the contents of a file in
# the currently processed directory. See the rclocrtesseract.py
# script. Example values: eng, fra... See the tesseract documentation.</descr>
# </var>
#tesseractlang = eng
# <var name="tesseractcmd" type="fn">
# <brief>Path for the tesseract command.</brief>
# <descr>This is mostly useful on Windows, or for specifying a non-default
# tesseract command. e.g. on Windows:
# C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</descr>
# </var>
#tesseractcmd = c:/Program Files (x86)/Tesseract-OCR/tesseract.exe
# <var name="abbyylang" type="string">
# <brief>Language to assume for abbyy OCR.</brief>
# <descr>Important for improving the OCR accuracy. This can also be set
# through the contents of a file in
# the currently processed directory. See the rclocrabbyy.py
# script. Typical values: English, French... See the ABBYY documentation.
# </descr>
# </var>
#abbyylang = English
# <var name="abbyycmd" type="fn">
# <brief>Path for the abbyy command</brief>
# <descr>The ABBY directory is usually not in the path, so you should set this.
# </descr>
# </var>
abbyycmd = /opt/ABBYYOCR11/abbyyocr11
# <grouptitle id="SPECLOCATIONS">Parameters set for specific # <grouptitle id="SPECLOCATIONS">Parameters set for specific
# locations</grouptitle> # locations</grouptitle>