document the new ocr function and its config
This commit is contained in:
parent
40ead3aa7e
commit
17d29774b0
@ -247,8 +247,8 @@ will reduce the index size. This can only be set for a whole index, not
|
||||
for a subtree.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE">
|
||||
<term><varname>dehyphenate</varname></term>
|
||||
<listitem><para>Determines if we index
|
||||
'coworker' also when the input is 'co-worker'. This is new
|
||||
<listitem><para>Determines if we index 'coworker'
|
||||
also when the input is 'co-worker'. This is new
|
||||
in version 1.22, and on by default. Setting the variable to off allows
|
||||
restoring the previous behaviour.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER">
|
||||
@ -279,7 +279,8 @@ as large.</para></listitem></varlistentry>
|
||||
<term><varname>indexstemminglanguages</varname></term>
|
||||
<listitem><para>Languages for which to create stemming expansion
|
||||
data. Stemmer names can be found by executing 'recollindex
|
||||
-l', or this can also be set from a list in the GUI.</para></listitem></varlistentry>
|
||||
-l', or this can also be set from a list in the GUI. The values are full
|
||||
language names, e.g. english, french...</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET">
|
||||
<term><varname>defaultcharset</varname></term>
|
||||
<listitem><para>Default character
|
||||
@ -608,9 +609,9 @@ space issues.</para></listitem></varlistentry>
|
||||
<term><varname>aspellLanguage</varname></term>
|
||||
<listitem><para>Language definitions to use when creating the aspell
|
||||
dictionary. The value must match a set of aspell language
|
||||
definition files. You can type "aspell dicts" to see a list The default
|
||||
if this is not set is to use the NLS environment to guess the
|
||||
value.</para></listitem></varlistentry>
|
||||
definition files. You can type "aspell dicts" to see a list The default
|
||||
if this is not set is to use the NLS environment to guess the value. The
|
||||
values are the 2-letter language codes (e.g. 'en', 'fr'...)</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM">
|
||||
<term><varname>aspellAddCreateParam</varname></term>
|
||||
<listitem><para>Additional option and parameter to aspell dictionary creation
|
||||
@ -650,14 +651,20 @@ patterns are matched with fnmatch(pattern, path, 0) You can quote entries
|
||||
containing white space with double quotes (quote the whole entry, not the
|
||||
pattern). The default is empty.
|
||||
Example: mondelaypatterns = *.log:20 "*with spaces.*:30"</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO">
|
||||
<term><varname>idxniceprio</varname></term>
|
||||
<listitem><para>"nice" process priority for the indexing processes. Default: 19
|
||||
(lowest) Appeared with 1.26.5. Prior versions were fixed at 19.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS">
|
||||
<term><varname>monioniceclass</varname></term>
|
||||
<listitem><para>ionice class for the real time indexing process On platforms where this is supported. The default value is
|
||||
3.</para></listitem></varlistentry>
|
||||
<listitem><para>ionice class for the indexing process. Despite the misleading name, and on platforms where this is
|
||||
supported, this affects all indexing processes,
|
||||
not only the real time/monitoring ones. The default value is 3 (use
|
||||
lowest "Idle" priority).</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA">
|
||||
<term><varname>monioniceclassdata</varname></term>
|
||||
<listitem><para>ionice class parameter for the real time indexing process. On platforms where this is supported. The default is
|
||||
empty.</para></listitem></varlistentry>
|
||||
<listitem><para>ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no
|
||||
levels.</para></listitem></varlistentry>
|
||||
</variablelist></sect3>
|
||||
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.QUERY">
|
||||
<title>Query-time parameters (no impact on the index) </title><variablelist>
|
||||
@ -700,14 +707,8 @@ with possibly meaning-altering missing words.</para></listitem></varlistentry>
|
||||
<title>Parameters for the PDF input script </title><variablelist>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">
|
||||
<term><varname>pdfocr</varname></term>
|
||||
<listitem><para>Attempt OCR of PDF files with no text content if both tesseract and
|
||||
pdftoppm are installed. This can be defined in subdirectories. The default is off because
|
||||
OCR is so very slow.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG">
|
||||
<term><varname>pdfocrlang</varname></term>
|
||||
<listitem><para>Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
|
||||
with tesseract. This can also be set through a configuration variable
|
||||
or directory-local parameters. See the rclpdf.py script.</para></listitem></varlistentry>
|
||||
<listitem><para>Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because
|
||||
OCR is so very slow. Will only do anything if ocrprogs is defined.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">
|
||||
<term><varname>pdfattach</varname></term>
|
||||
<listitem><para>Enable PDF attachment extraction by executing pdftk (if
|
||||
@ -732,6 +733,41 @@ selected field, for editing or erasing. A new instance is created for
|
||||
each document, so that the object can keep state for, e.g. eliminating
|
||||
duplicate values.</para></listitem></varlistentry>
|
||||
</variablelist></sect3>
|
||||
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
|
||||
<title>Parameters for OCR processing </title><variablelist>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS">
|
||||
<term><varname>ocrprogs</varname></term>
|
||||
<listitem><para>OCR modules to try. The top OCR script will try to load the corresponding modules in
|
||||
order and use the first which reports being capable of performing OCR on
|
||||
the input file. Modules for tesseract and ABBYY FineReader are present in
|
||||
the standard distribution.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR">
|
||||
<term><varname>ocrcachedir</varname></term>
|
||||
<listitem><para>Location for caching OCR data. The default if this is empty or undefined is to store the cached
|
||||
OCR data under $RECOLL_CONFDIR/ocrcache.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG">
|
||||
<term><varname>tesseractlang</varname></term>
|
||||
<listitem><para>Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set
|
||||
through the contents of a file in
|
||||
the currently processed directory. See the rclocrtesseract.py
|
||||
script. Example values: eng, fra... See the tesseract documentation.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD">
|
||||
<term><varname>tesseractcmd</varname></term>
|
||||
<listitem><para>Path for the tesseract command. This is mostly useful on Windows, or for specifying a non-default
|
||||
tesseract command. e.g. on Windows:
|
||||
C:/Program Files (x86)/Tesseract-OCR/tesseract.exe</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG">
|
||||
<term><varname>abbyylang</varname></term>
|
||||
<listitem><para>Language to assume for abbyy OCR. Important for improving the OCR accuracy. This can also be set
|
||||
through the contents of a file in
|
||||
the currently processed directory. See the rclocrabbyy.py
|
||||
script. Typical values: English, French... See the ABBYY documentation.
|
||||
</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD">
|
||||
<term><varname>abbyycmd</varname></term>
|
||||
<listitem><para>Path for the abbyy command The ABBY directory is usually not in the path, so you should set this.
|
||||
</para></listitem></varlistentry>
|
||||
</variablelist></sect3>
|
||||
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS">
|
||||
<title>Parameters set for specific locations </title><variablelist>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS">
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
<html>
|
||||
<head>
|
||||
<meta name="generator" content=
|
||||
"HTML Tidy for HTML5 for Linux version 5.2.0">
|
||||
"HTML Tidy for HTML5 for Linux version 5.6.0">
|
||||
<meta http-equiv="Content-Type" content=
|
||||
"text/html; charset=utf-8">
|
||||
<title>Recoll user manual</title>
|
||||
@ -157,20 +157,19 @@ alink="#0000FF">
|
||||
<dd>
|
||||
<dl>
|
||||
<dt><span class="sect2">2.8.1. <a href=
|
||||
"#RCL.INDEXING.PDF.OCR">OCR with
|
||||
Tesseract</a></span></dt>
|
||||
<dt><span class="sect2">2.8.2. <a href=
|
||||
"#RCL.INDEXING.PDF.XMP">XMP fields
|
||||
extraction</a></span></dt>
|
||||
<dt><span class="sect2">2.8.3. <a href=
|
||||
<dt><span class="sect2">2.8.2. <a href=
|
||||
"#RCL.INDEXING.PDF.ATTACH">PDF attachment
|
||||
indexing</a></span></dt>
|
||||
</dl>
|
||||
</dd>
|
||||
<dt><span class="sect1">2.9. <a href=
|
||||
"#RCL.INDEXING.OCR">Recoll and OCR</a></span></dt>
|
||||
<dt><span class="sect1">2.10. <a href=
|
||||
"#RCL.INDEXING.PERIODIC">Periodic
|
||||
indexing</a></span></dt>
|
||||
<dt><span class="sect1">2.10. <a href=
|
||||
<dt><span class="sect1">2.11. <a href=
|
||||
"#RCL.INDEXING.MONITOR"><span class=
|
||||
"application">Unix</span>-like systems: real time
|
||||
indexing</a></span></dt>
|
||||
@ -781,7 +780,7 @@ alink="#0000FF">
|
||||
"list-style-type: disc;">
|
||||
<li class="listitem">
|
||||
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
|
||||
title="2.9. Periodic indexing">Periodic (or
|
||||
title="2.10. Periodic indexing">Periodic (or
|
||||
batch) indexing</a> . </b><span class=
|
||||
"command"><strong>recollindex</strong></span> is
|
||||
executed at discrete times. On <span class=
|
||||
@ -799,7 +798,7 @@ alink="#0000FF">
|
||||
<li class="listitem">
|
||||
<p><b><a class="link" href="#RCL.INDEXING.MONITOR"
|
||||
title=
|
||||
"2.10. Unix-like systems: real time indexing">Real
|
||||
"2.11. Unix-like systems: real time indexing">Real
|
||||
time indexing</a> . </b>(Only available on
|
||||
<span class="application">Unix</span>-like
|
||||
systems). <span class=
|
||||
@ -831,7 +830,7 @@ alink="#0000FF">
|
||||
indexing on a small home directory), or, with
|
||||
<span class="application">Recoll</span> 1.24 and newer,
|
||||
by <a class="link" href="#RCL.INDEXING.MONITOR" title=
|
||||
"2.10. Unix-like systems: real time indexing">configuring
|
||||
"2.11. Unix-like systems: real time indexing">configuring
|
||||
the index so that only a subset of the tree will be
|
||||
monitored.</a></p>
|
||||
<p>The choice of method and the parameters used can be
|
||||
@ -1136,8 +1135,8 @@ alink="#0000FF">
|
||||
different areas of the file system to different
|
||||
indexes. For example, if you were to issue the
|
||||
following command:</p>
|
||||
<pre class="programlisting">
|
||||
recoll -c ~/.indexes-email</pre>
|
||||
<pre class=
|
||||
"programlisting">recoll -c ~/.indexes-email</pre>
|
||||
<p>Then <span class="application">Recoll</span> would
|
||||
use configuration files stored in <code class=
|
||||
"filename">~/.indexes-email/</code> and, (unless
|
||||
@ -2141,45 +2140,16 @@ metadatacmds = ; <em class=
|
||||
if the document text is empty, it can be configured to
|
||||
extract specific metadata tags from an XMP packet, and to
|
||||
extract PDF attachments.</p>
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="RCL.INDEXING.PDF.OCR"
|
||||
id="RCL.INDEXING.PDF.OCR"></a>2.8.1. OCR with
|
||||
Tesseract</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>If both <span class="application">tesseract</span> and
|
||||
<span class="command"><strong>pdftoppm</strong></span>
|
||||
(generally from the <span class=
|
||||
"application">poppler-utils</span> package) are
|
||||
installed, the PDF handler may attempt OCR on PDF files
|
||||
with no text content. This is controlled by the <a class=
|
||||
"link" href=
|
||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</a>
|
||||
configuration variable, which is false by default because
|
||||
OCR is very slow.</p>
|
||||
<p>The choice of language is very important for
|
||||
successfull OCR. Recoll has currently no way to determine
|
||||
this from the document itself. You can set the language
|
||||
to use through the contents of a <code class=
|
||||
"filename">.ocrpdflang</code> text file in the same
|
||||
directory as the PDF document, or through the
|
||||
<code class="envar">RECOLL_TESSERACT_LANG</code>
|
||||
environment variable, or through the contents of an
|
||||
<code class="filename">ocrpdf</code> text file inside the
|
||||
configuration directory. If none of the above are used,
|
||||
<span class="application">Recoll</span> will try to guess
|
||||
the language from the NLS environment.</p>
|
||||
</div>
|
||||
<p>The PDF handler can execute an external program to run
|
||||
OCR if no text is found in the document. This is now
|
||||
described in a <a class="link" href="#RCL.INDEXING.OCR"
|
||||
title="2.9. Recoll and OCR">separate section</a>.</p>
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
|
||||
id="RCL.INDEXING.PDF.XMP"></a>2.8.2. XMP
|
||||
id="RCL.INDEXING.PDF.XMP"></a>2.8.1. XMP
|
||||
fields extraction</h3>
|
||||
</div>
|
||||
</div>
|
||||
@ -2236,7 +2206,7 @@ metadatacmds = ; <em class=
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
|
||||
id="RCL.INDEXING.PDF.ATTACH"></a>2.8.3. PDF
|
||||
id="RCL.INDEXING.PDF.ATTACH"></a>2.8.2. PDF
|
||||
attachment indexing</h3>
|
||||
</div>
|
||||
</div>
|
||||
@ -2252,13 +2222,67 @@ metadatacmds = ; <em class=
|
||||
uncommon in my experience).</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="sect1">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h2 class="title" style="clear: both"><a name=
|
||||
"RCL.INDEXING.OCR" id=
|
||||
"RCL.INDEXING.OCR"></a>2.9. Recoll and OCR</h2>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>This is new in <span class="application">Recoll</span>
|
||||
1.26.5. Older versions had a more limited, non-caching
|
||||
capability to execute an external OCR program in the PDF
|
||||
handler. The new function has the following features:</p>
|
||||
<div class="itemizedlist">
|
||||
<ul class="itemizedlist" style="list-style-type: disc;">
|
||||
<li class="listitem">
|
||||
<p>The OCR output is cached, stored as separate
|
||||
files. The caching is ultimately based on a hash
|
||||
value of the original file contents, so that it is
|
||||
immune to file renames. A first path-based layer
|
||||
ensures fast operation for unchanged (unmoved files),
|
||||
and the data hash (which is still orders of magnitude
|
||||
faster than OCR) is only re-computed if the file has
|
||||
moved. OCR is only performed if the file was not
|
||||
previously processed or if it changed.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>The support for a specific program is implemented
|
||||
in a simple Python module. It should be
|
||||
straightforward to add support for any OCR engine
|
||||
with a capability to run from the command line.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>Modules initially exist for <span class=
|
||||
"application">tesseract</span> (Linux and Windows),
|
||||
and <span class="application">ABBYY FineReader</span>
|
||||
(Linux, tested with version 11). ABBYY FineReader is
|
||||
a commercial closed source program, but it sometimes
|
||||
perform better than tesseract.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>The OCR is currently only called from the PDF
|
||||
handler, but there should be no problem using it for
|
||||
other image types.</p>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<p>Configuration. See the <a class="link" href=
|
||||
"#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
|
||||
"Parameters for OCR processing">relevant section</a>. All
|
||||
parameters can be localized in subdirectories through the
|
||||
usual main configuration mechanism (path sections).</p>
|
||||
</div>
|
||||
<div class="sect1">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h2 class="title" style="clear: both"><a name=
|
||||
"RCL.INDEXING.PERIODIC" id=
|
||||
"RCL.INDEXING.PERIODIC"></a>2.9. Periodic
|
||||
"RCL.INDEXING.PERIODIC"></a>2.10. Periodic
|
||||
indexing</h2>
|
||||
</div>
|
||||
</div>
|
||||
@ -2431,7 +2455,7 @@ metadatacmds = ; <em class=
|
||||
<div>
|
||||
<h2 class="title" style="clear: both"><a name=
|
||||
"RCL.INDEXING.MONITOR" id=
|
||||
"RCL.INDEXING.MONITOR"></a>2.10. <span class=
|
||||
"RCL.INDEXING.MONITOR"></a>2.11. <span class=
|
||||
"application">Unix</span>-like systems: real time
|
||||
indexing</h2>
|
||||
</div>
|
||||
@ -3759,8 +3783,8 @@ fs.inotify.max_user_watches=32768
|
||||
that every user does not have to do it. The variable
|
||||
should define a colon-separated list of index
|
||||
directories, ie:</p>
|
||||
<pre class="screen">
|
||||
export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
|
||||
<pre class=
|
||||
"screen">export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
|
||||
<p>Another environment variable, <code class=
|
||||
"envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
|
||||
the active list of indexes. This variable was suggested
|
||||
@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768
|
||||
parent folder expansion, usually creating a file
|
||||
manager window on the folder where the container file
|
||||
resides. E.g.:</p>
|
||||
<pre class="programlisting">
|
||||
<a href="F%N">%P</a></pre>
|
||||
<pre class=
|
||||
"programlisting"><a href="F%N">%P</a></pre>
|
||||
<p>A link target defined as <code class=
|
||||
"literal">R%N|<em class=
|
||||
"replaceable"><code>scriptname</code></em></code>
|
||||
@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768
|
||||
<span class="application">javascript</span> program to
|
||||
the documents, like the following example, which would
|
||||
initiate a search by double-clicking any term:</p>
|
||||
<pre class="programlisting">
|
||||
<script language="JavaScript">
|
||||
<pre class=
|
||||
"programlisting"><script language="JavaScript">
|
||||
function recollsearch() {
|
||||
var t = document.getSelection();
|
||||
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
||||
@ -8838,7 +8862,8 @@ for i in range(nres):
|
||||
<p>Languages for which to create stemming
|
||||
expansion data. Stemmer names can be found by
|
||||
executing 'recollindex -l', or this can also be
|
||||
set from a list in the GUI.</p>
|
||||
set from a list in the GUI. The values are full
|
||||
language names, e.g. english, french...</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id=
|
||||
@ -9425,7 +9450,8 @@ for i in range(nres):
|
||||
aspell language definition files. You can type
|
||||
"aspell dicts" to see a list The default if this
|
||||
is not set is to use the NLS environment to guess
|
||||
the value.</p>
|
||||
the value. The values are the 2-letter language
|
||||
codes (e.g. 'en', 'fr'...)</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"
|
||||
@ -9500,21 +9526,32 @@ for i in range(nres):
|
||||
*.log:20 "*with spaces.*:30"</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO"></a><span class="term"><code class="varname">idxniceprio</code></span></dt>
|
||||
<dd>
|
||||
<p>"nice" process priority for the indexing
|
||||
processes. Default: 19 (lowest) Appeared with
|
||||
1.26.5. Prior versions were fixed at 19.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt>
|
||||
<dd>
|
||||
<p>ionice class for the real time indexing
|
||||
process On platforms where this is supported. The
|
||||
default value is 3.</p>
|
||||
<p>ionice class for the indexing process. Despite
|
||||
the misleading name, and on platforms where this
|
||||
is supported, this affects all indexing
|
||||
processes, not only the real time/monitoring
|
||||
ones. The default value is 3 (use lowest "Idle"
|
||||
priority).</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"
|
||||
id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt>
|
||||
<dd>
|
||||
<p>ionice class parameter for the real time
|
||||
indexing process. On platforms where this is
|
||||
supported. The default is empty.</p>
|
||||
<p>ionice class level parameter if the class
|
||||
supports it. The default is empty, as the default
|
||||
"Idle" class has no levels.</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</div>
|
||||
@ -9611,20 +9648,10 @@ for i in range(nres):
|
||||
id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt>
|
||||
<dd>
|
||||
<p>Attempt OCR of PDF files with no text content
|
||||
if both tesseract and pdftoppm are installed.
|
||||
<p>Attempt OCR of PDF files with no text content.
|
||||
This can be defined in subdirectories. The
|
||||
default is off because OCR is so very slow.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG"></a><span class="term"><code class="varname">pdfocrlang</code></span></dt>
|
||||
<dd>
|
||||
<p>Language to assume for PDF OCR. This is very
|
||||
important for having a reasonable rate of errors
|
||||
with tesseract. This can also be set through a
|
||||
configuration variable or directory-local
|
||||
parameters. See the rclpdf.py script.</p>
|
||||
default is off because OCR is so very slow. Will
|
||||
only do anything if ocrprogs is defined.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id=
|
||||
@ -9666,6 +9693,80 @@ for i in range(nres):
|
||||
</dl>
|
||||
</div>
|
||||
</div>
|
||||
<div class="sect3">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h4 class="title"><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR"></a>Parameters
|
||||
for OCR processing</h4>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="variablelist">
|
||||
<dl class="variablelist">
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS"></a><span class="term"><code class="varname">ocrprogs</code></span></dt>
|
||||
<dd>
|
||||
<p>OCR modules to try. The top OCR script will
|
||||
try to load the corresponding modules in order
|
||||
and use the first which reports being capable of
|
||||
performing OCR on the input file. Modules for
|
||||
tesseract and ABBYY FineReader are present in the
|
||||
standard distribution.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR"></a><span class="term"><code class="varname">ocrcachedir</code></span></dt>
|
||||
<dd>
|
||||
<p>Location for caching OCR data. The default if
|
||||
this is empty or undefined is to store the cached
|
||||
OCR data under $RECOLL_CONFDIR/ocrcache.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG"></a><span class="term"><code class="varname">tesseractlang</code></span></dt>
|
||||
<dd>
|
||||
<p>Language to assume for tesseract OCR.
|
||||
Important for improving the OCR accuracy. This
|
||||
can also be set through the contents of a file in
|
||||
the currently processed directory. See the
|
||||
rclocrtesseract.py script. Example values: eng,
|
||||
fra... See the tesseract documentation.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD"></a><span class="term"><code class="varname">tesseractcmd</code></span></dt>
|
||||
<dd>
|
||||
<p>Path for the tesseract command. This is mostly
|
||||
useful on Windows, or for specifying a
|
||||
non-default tesseract command. e.g. on Windows:
|
||||
C:/Program Files (x86)/Tesseract-OCR/tesseract.exe</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG"></a><span class="term"><code class="varname">abbyylang</code></span></dt>
|
||||
<dd>
|
||||
<p>Language to assume for abbyy OCR. Important
|
||||
for improving the OCR accuracy. This can also be
|
||||
set through the contents of a file in the
|
||||
currently processed directory. See the
|
||||
rclocrabbyy.py script. Typical values: English,
|
||||
French... See the ABBYY documentation.</p>
|
||||
</dd>
|
||||
<dt><a name=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD" id=
|
||||
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD"></a><span class="term"><code class="varname">abbyycmd</code></span></dt>
|
||||
<dd>
|
||||
<p>Path for the abbyy command The ABBY directory
|
||||
is usually not in the path, so you should set
|
||||
this.</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</div>
|
||||
</div>
|
||||
<div class="sect3">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
@ -9858,8 +9959,8 @@ for i in range(nres):
|
||||
"filename">.xml</code> extension but should be handled
|
||||
specially, which is possible because they are usually all
|
||||
located in one place. Example:</p>
|
||||
<pre class="programlisting">
|
||||
[~/.kde/share/apps/okular/docdata]
|
||||
<pre class=
|
||||
"programlisting">[~/.kde/share/apps/okular/docdata]
|
||||
.xml = application/x-okular-notes</pre>
|
||||
<p>The <code class="varname">recoll_noindex</code>
|
||||
<code class="filename">mimemap</code> variable has been
|
||||
|
||||
@ -1414,30 +1414,9 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
||||
specific metadata tags from an XMP packet, and to extract PDF
|
||||
attachments.</para>
|
||||
|
||||
<sect2 id="RCL.INDEXING.PDF.OCR">
|
||||
<title>OCR with Tesseract</title>
|
||||
|
||||
<para>If both <application>tesseract</application> and
|
||||
<command>pdftoppm</command> (generally from the
|
||||
<application>poppler-utils</application> package) are installed,
|
||||
the PDF handler may attempt OCR on PDF files with no text
|
||||
content. This is controlled by the
|
||||
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
|
||||
configuration variable, which is false by default because
|
||||
OCR is very slow.</para>
|
||||
|
||||
<para>The choice of language is very important for successfull
|
||||
OCR. Recoll has currently no way to determine this from the
|
||||
document itself. You can set the language to use through the
|
||||
contents of a <filename>.ocrpdflang</filename> text file in the
|
||||
same directory as the PDF document, or through the
|
||||
<envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
|
||||
through the contents of an <filename>ocrpdf</filename> text file
|
||||
inside the configuration directory. If none of the above are used,
|
||||
&RCL; will try to guess the language from the NLS
|
||||
environment.</para>
|
||||
|
||||
</sect2>
|
||||
<para>The PDF handler can execute an external program to run OCR if
|
||||
no text is found in the document. This is now described in a
|
||||
<link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
|
||||
|
||||
<sect2 id="RCL.INDEXING.PDF.XMP">
|
||||
<title>XMP fields extraction</title>
|
||||
@ -1510,6 +1489,47 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="RCL.INDEXING.OCR">
|
||||
<title>Recoll and OCR</title>
|
||||
|
||||
<para>This is new in &RCL; 1.26.5. Older versions had a more limited,
|
||||
non-caching capability to execute an external OCR program in the PDF
|
||||
handler. The new function has the following features:
|
||||
|
||||
<itemizedlist>
|
||||
<listitem><para>The OCR output is cached, stored as separate
|
||||
files. The caching is ultimately based on a hash value of the
|
||||
original file contents, so that it is immune to file renames. A
|
||||
first path-based layer ensures fast operation for unchanged
|
||||
(unmoved files), and the data hash (which is still orders of
|
||||
magnitude faster than OCR) is only re-computed if the file has
|
||||
moved. OCR is only performed if the file was not previously
|
||||
processed or if it changed.</para></listitem>
|
||||
<listitem><para>The support for a specific program is implemented
|
||||
in a simple Python module. It should be straightforward to add
|
||||
support for any OCR engine with a capability to run from the
|
||||
command line.</para></listitem>
|
||||
<listitem><para>Modules initially exist for
|
||||
<application>tesseract</application> (Linux and Windows), and
|
||||
<application>ABBYY FineReader</application> (Linux, tested with
|
||||
version 11). ABBYY FineReader is a commercial closed source
|
||||
program, but it sometimes perform better than
|
||||
tesseract.</para></listitem>
|
||||
<listitem><para>The OCR is currently only called from the PDF
|
||||
handler, but there should be no problem using it for other image
|
||||
types.</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
|
||||
<para>Configuration. See the
|
||||
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
|
||||
relevant section</link>. All parameters can be localized in
|
||||
subdirectories through the usual main configuration mechanism (path
|
||||
sections).</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
|
||||
<sect1 id="RCL.INDEXING.PERIODIC">
|
||||
<title>Periodic indexing</title>
|
||||
|
||||
|
||||
@ -350,7 +350,8 @@ indexStoreDocText = 1
|
||||
#
|
||||
# <brief>Languages for which to create stemming expansion
|
||||
# data.</brief><descr>Stemmer names can be found by executing 'recollindex
|
||||
# -l', or this can also be set from a list in the GUI.</descr></var>
|
||||
# -l', or this can also be set from a list in the GUI. The values are full
|
||||
# language names, e.g. english, french...</descr></var>
|
||||
indexstemminglanguages = english
|
||||
|
||||
# <var name="defaultcharset" type="string"><brief>Default character
|
||||
@ -760,9 +761,9 @@ checkneedretryindexscript = rclcheckneedretry.sh
|
||||
#
|
||||
# <brief>Language definitions to use when creating the aspell
|
||||
# dictionary.</brief><descr>The value must match a set of aspell language
|
||||
# definition files. You can type "aspell dicts" to see a list The default
|
||||
# if this is not set is to use the NLS environment to guess the
|
||||
# value.</descr></var>
|
||||
# definition files. You can type "aspell dicts" to see a list The default
|
||||
# if this is not set is to use the NLS environment to guess the value. The
|
||||
# values are the 2-letter language codes (e.g. 'en', 'fr'...)</descr></var>
|
||||
#aspellLanguage = en
|
||||
|
||||
# <var name="aspellAddCreateParam" type="string">
|
||||
@ -902,19 +903,11 @@ snippetMaxPosWalk = 1000000
|
||||
|
||||
# <var name="pdfocr" type="bool">
|
||||
#
|
||||
# <brief>Attempt OCR of PDF files with no text content if both tesseract and
|
||||
# pdftoppm are installed.</brief>
|
||||
# <brief>Attempt OCR of PDF files with no text content.</brief>
|
||||
# <descr>This can be defined in subdirectories. The default is off because
|
||||
# OCR is so very slow.</descr></var>
|
||||
#pdfocr = 0
|
||||
|
||||
# <var name="pdfocrlang" type="string">
|
||||
# <brief>Language to assume for PDF OCR.</brief>
|
||||
# <descr>This is very important for having a reasonable rate of errors
|
||||
# with tesseract. This can also be set through a configuration variable
|
||||
# or directory-local parameters. See the rclpdf.py script.</descr>
|
||||
# OCR is so very slow. Will only do anything if ocrprogs is defined.</descr>
|
||||
# </var>
|
||||
#pdfocrlang = eng
|
||||
#pdfocr = 0
|
||||
|
||||
# <var name="pdfattach" type="bool">
|
||||
#
|
||||
@ -946,6 +939,60 @@ snippetMaxPosWalk = 1000000
|
||||
#pdfextrametafix = /path/to/fixerscript.py
|
||||
|
||||
|
||||
# <grouptitle id="OCR">Parameters for OCR processing</grouptitle>
|
||||
|
||||
|
||||
# <var name="ocrprogs" type="string">
|
||||
# <brief>OCR modules to try.</brief>
|
||||
# <descr>The top OCR script will try to load the corresponding modules in
|
||||
# order and use the first which reports being capable of performing OCR on
|
||||
# the input file. Modules for tesseract and ABBYY FineReader are present in
|
||||
# the standard distribution.</descr>
|
||||
# </var>
|
||||
#ocrprogs = abbyy tesseract
|
||||
|
||||
# <var name="ocrcachedir" type="dfn">
|
||||
# <brief>Location for caching OCR data.</brief>
|
||||
# <descr>The default if this is empty or undefined is to store the cached
|
||||
# OCR data under $RECOLL_CONFDIR/ocrcache.</descr>
|
||||
# </var>
|
||||
#ocrcachedir=
|
||||
|
||||
|
||||
# <var name="tesseractlang" type="string">
|
||||
# <brief>Language to assume for tesseract OCR.</brief>
|
||||
# <descr>Important for improving the OCR accuracy. This can also be set
|
||||
# through the contents of a file in
|
||||
# the currently processed directory. See the rclocrtesseract.py
|
||||
# script. Example values: eng, fra... See the tesseract documentation.</descr>
|
||||
# </var>
|
||||
#tesseractlang = eng
|
||||
|
||||
# <var name="tesseractcmd" type="fn">
|
||||
# <brief>Path for the tesseract command.</brief>
|
||||
# <descr>This is mostly useful on Windows, or for specifying a non-default
|
||||
# tesseract command. e.g. on Windows:
|
||||
# C:/Program Files (x86)/Tesseract-OCR/tesseract.exe</descr>
|
||||
# </var>
|
||||
#tesseractcmd = c:/Program Files (x86)/Tesseract-OCR/tesseract.exe
|
||||
|
||||
# <var name="abbyylang" type="string">
|
||||
# <brief>Language to assume for abbyy OCR.</brief>
|
||||
# <descr>Important for improving the OCR accuracy. This can also be set
|
||||
# through the contents of a file in
|
||||
# the currently processed directory. See the rclocrabbyy.py
|
||||
# script. Typical values: English, French... See the ABBYY documentation.
|
||||
# </descr>
|
||||
# </var>
|
||||
#abbyylang = English
|
||||
|
||||
# <var name="abbyycmd" type="fn">
|
||||
# <brief>Path for the abbyy command</brief>
|
||||
# <descr>The ABBY directory is usually not in the path, so you should set this.
|
||||
# </descr>
|
||||
# </var>
|
||||
abbyycmd = /opt/ABBYYOCR11/abbyyocr11
|
||||
|
||||
# <grouptitle id="SPECLOCATIONS">Parameters set for specific
|
||||
# locations</grouptitle>
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user