document the new ocr function and its config

2020-02-27 18:17:51 +01:00 · 2020-02-27 18:17:51 +01:00 · 17d29774b0
commit 17d29774b0
parent 40ead3aa7e
4 changed files with 338 additions and 134 deletions
--- a/src/doc/user/recoll.conf.xml
+++ b/src/doc/user/recoll.conf.xml
@ -247,8 +247,8 @@ will reduce the index size. This can only be set for a whole index, not
 for a subtree.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE">
 <term><varname>dehyphenate</varname></term>
-<listitem><para>Determines if we index
-'coworker' also when the input is 'co-worker'. This is new
+<listitem><para>Determines if we index 'coworker'
+also when the input is 'co-worker'. This is new
 in version 1.22, and on by default. Setting the variable to off allows
 restoring the previous behaviour.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER">
@ -279,7 +279,8 @@ as large.</para></listitem></varlistentry>
 <term><varname>indexstemminglanguages</varname></term>
 <listitem><para>Languages for which to create stemming expansion
 data. Stemmer names can be found by executing 'recollindex
-l', or this can also be set from a list in the GUI.</para></listitem></varlistentry>
+-l', or this can also be set from a list in the GUI. The values are full
+language names, e.g. english, french...</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET">
 <term><varname>defaultcharset</varname></term>
 <listitem><para>Default character
@ -608,9 +609,9 @@ space issues.</para></listitem></varlistentry>
 <term><varname>aspellLanguage</varname></term>
 <listitem><para>Language definitions to use when creating the aspell
 dictionary. The value must match a set of aspell language
-definition files. You can type "aspell dicts"  to see a list The default
-if this is not set is to use the NLS environment to guess the
-value.</para></listitem></varlistentry>
+definition files. You can type "aspell dicts" to see a list The default
+if this is not set is to use the NLS environment to guess the value. The
+values are the 2-letter language codes (e.g. 'en', 'fr'...)</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM">
 <term><varname>aspellAddCreateParam</varname></term>
 <listitem><para>Additional option and parameter to aspell dictionary creation
@ -650,14 +651,20 @@ patterns are matched with fnmatch(pattern, path, 0) You can quote entries
 containing white space with double quotes (quote the whole entry, not the
 pattern). The default is empty.
 Example: mondelaypatterns = *.log:20 "*with spaces.*:30"</para></listitem></varlistentry>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO">
+<term><varname>idxniceprio</varname></term>
+<listitem><para>"nice" process priority for the indexing processes. Default: 19
+(lowest) Appeared with 1.26.5. Prior versions were fixed at 19.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS">
 <term><varname>monioniceclass</varname></term>
-<listitem><para>ionice class for the real time indexing process On platforms where this is supported. The default value is
-3.</para></listitem></varlistentry>
+<listitem><para>ionice class for the indexing process. Despite the misleading name, and on platforms where this is
+supported, this affects all indexing processes,
+not only the real time/monitoring ones. The default value is 3 (use
+lowest "Idle" priority).</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA">
 <term><varname>monioniceclassdata</varname></term>
-<listitem><para>ionice class parameter for the real time indexing process. On platforms where this is supported. The default is
-empty.</para></listitem></varlistentry>
+<listitem><para>ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no
+levels.</para></listitem></varlistentry>
 </variablelist></sect3>
 <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.QUERY">
 <title>Query-time parameters (no impact on the index) </title><variablelist>
@ -700,14 +707,8 @@ with possibly meaning-altering missing words.</para></listitem></varlistentry>
 <title>Parameters for the PDF input script </title><variablelist>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">
 <term><varname>pdfocr</varname></term>
-<listitem><para>Attempt OCR of PDF files with no text content if both tesseract and
-pdftoppm are installed. This can be defined in subdirectories. The default is off because
-OCR is so very slow.</para></listitem></varlistentry>
-<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG">
-<term><varname>pdfocrlang</varname></term>
-<listitem><para>Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
-with tesseract. This can also be set through a configuration variable
-or directory-local parameters. See the rclpdf.py script.</para></listitem></varlistentry>
+<listitem><para>Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because
+OCR is so very slow. Will only do anything if ocrprogs is defined.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">
 <term><varname>pdfattach</varname></term>
 <listitem><para>Enable PDF attachment extraction by executing pdftk (if
@ -732,6 +733,41 @@ selected field, for editing or erasing. A new instance is created for
 each document, so that the object can keep state for, e.g. eliminating
 duplicate values.</para></listitem></varlistentry>
 </variablelist></sect3>
+<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
+<title>Parameters for OCR processing </title><variablelist>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS">
+<term><varname>ocrprogs</varname></term>
+<listitem><para>OCR modules to try. The top OCR script will try to load the corresponding modules in
+order and use the first which reports being capable of performing OCR on
+the input file. Modules for tesseract and ABBYY FineReader are present in
+the standard distribution.</para></listitem></varlistentry>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR">
+<term><varname>ocrcachedir</varname></term>
+<listitem><para>Location for caching OCR data. The default if this is empty or undefined is to store the cached
+OCR data under $RECOLL_CONFDIR/ocrcache.</para></listitem></varlistentry>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG">
+<term><varname>tesseractlang</varname></term>
+<listitem><para>Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set
+through the contents of a file in
+the currently processed directory. See the rclocrtesseract.py
+script. Example values: eng, fra... See the tesseract documentation.</para></listitem></varlistentry>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD">
+<term><varname>tesseractcmd</varname></term>
+<listitem><para>Path for the tesseract command. This is mostly useful on Windows, or for specifying a non-default
+tesseract command. e.g. on Windows:
+C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</para></listitem></varlistentry>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG">
+<term><varname>abbyylang</varname></term>
+<listitem><para>Language to assume for abbyy OCR. Important for improving the OCR accuracy. This can also be set
+through the contents of a file in
+the currently processed directory. See the rclocrabbyy.py
+script. Typical values: English, French... See the ABBYY documentation.
+</para></listitem></varlistentry>
+<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD">
+<term><varname>abbyycmd</varname></term>
+<listitem><para>Path for the abbyy command The ABBY directory is usually not in the path, so you should set this.
+</para></listitem></varlistentry>
+</variablelist></sect3>
 <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS">
 <title>Parameters set for specific locations </title><variablelist>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS">
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@ -3,7 +3,7 @@
 <html>
 <head>
  <meta name="generator" content=
-  "HTML Tidy for HTML5 for Linux version 5.2.0">
+  "HTML Tidy for HTML5 for Linux version 5.6.0">
  <meta http-equiv="Content-Type" content=
  "text/html; charset=utf-8">
  <title>Recoll user manual</title>
@ -157,20 +157,19 @@ alink="#0000FF">
            <dd>
              <dl>
                <dt><span class="sect2">2.8.1. <a href=
-                "#RCL.INDEXING.PDF.OCR">OCR with
-                Tesseract</a></span></dt>
-                <dt><span class="sect2">2.8.2. <a href=
                "#RCL.INDEXING.PDF.XMP">XMP fields
                extraction</a></span></dt>
-                <dt><span class="sect2">2.8.3. <a href=
+                <dt><span class="sect2">2.8.2. <a href=
                "#RCL.INDEXING.PDF.ATTACH">PDF attachment
                indexing</a></span></dt>
              </dl>
            </dd>
            <dt><span class="sect1">2.9. <a href=
+            "#RCL.INDEXING.OCR">Recoll and OCR</a></span></dt>
+            <dt><span class="sect1">2.10. <a href=
            "#RCL.INDEXING.PERIODIC">Periodic
            indexing</a></span></dt>
-            <dt><span class="sect1">2.10. <a href=
+            <dt><span class="sect1">2.11. <a href=
            "#RCL.INDEXING.MONITOR"><span class=
            "application">Unix</span>-like systems: real time
            indexing</a></span></dt>
@ -781,7 +780,7 @@ alink="#0000FF">
            "list-style-type: disc;">
              <li class="listitem">
                <p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
-                title="2.9.&nbsp;Periodic indexing">Periodic (or
+                title="2.10.&nbsp;Periodic indexing">Periodic (or
                batch) indexing</a> .&nbsp;</b><span class=
                "command"><strong>recollindex</strong></span> is
                executed at discrete times. On <span class=
@ -799,7 +798,7 @@ alink="#0000FF">
              <li class="listitem">
                <p><b><a class="link" href="#RCL.INDEXING.MONITOR"
                title=
-                "2.10.&nbsp;Unix-like systems: real time indexing">Real
+                "2.11.&nbsp;Unix-like systems: real time indexing">Real
                time indexing</a> .&nbsp;</b>(Only available on
                <span class="application">Unix</span>-like
                systems). <span class=
@ -831,7 +830,7 @@ alink="#0000FF">
            indexing on a small home directory), or, with
            <span class="application">Recoll</span> 1.24 and newer,
            by <a class="link" href="#RCL.INDEXING.MONITOR" title=
-            "2.10.&nbsp;Unix-like systems: real time indexing">configuring
+            "2.11.&nbsp;Unix-like systems: real time indexing">configuring
            the index so that only a subset of the tree will be
            monitored.</a></p>
            <p>The choice of method and the parameters used can be
@ -1136,8 +1135,8 @@ alink="#0000FF">
              different areas of the file system to different
              indexes. For example, if you were to issue the
              following command:</p>
-              <pre class="programlisting">
-              recoll -c ~/.indexes-email</pre>
+              <pre class=
+              "programlisting">recoll -c ~/.indexes-email</pre>
              <p>Then <span class="application">Recoll</span> would
              use configuration files stored in <code class=
              "filename">~/.indexes-email/</code> and, (unless
@ -2141,45 +2140,16 @@ metadatacmds = ; <em class=
        if the document text is empty, it can be configured to
        extract specific metadata tags from an XMP packet, and to
        extract PDF attachments.</p>
-        <div class="sect2">
-          <div class="titlepage">
-            <div>
-              <div>
-                <h3 class="title"><a name="RCL.INDEXING.PDF.OCR"
-                id="RCL.INDEXING.PDF.OCR"></a>2.8.1.&nbsp;OCR with
-                Tesseract</h3>
-              </div>
-            </div>
-          </div>
-          <p>If both <span class="application">tesseract</span> and
-          <span class="command"><strong>pdftoppm</strong></span>
-          (generally from the <span class=
-          "application">poppler-utils</span> package) are
-          installed, the PDF handler may attempt OCR on PDF files
-          with no text content. This is controlled by the <a class=
-          "link" href=
-          "#RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</a>
-          configuration variable, which is false by default because
-          OCR is very slow.</p>
-          <p>The choice of language is very important for
-          successfull OCR. Recoll has currently no way to determine
-          this from the document itself. You can set the language
-          to use through the contents of a <code class=
-          "filename">.ocrpdflang</code> text file in the same
-          directory as the PDF document, or through the
-          <code class="envar">RECOLL_TESSERACT_LANG</code>
-          environment variable, or through the contents of an
-          <code class="filename">ocrpdf</code> text file inside the
-          configuration directory. If none of the above are used,
-          <span class="application">Recoll</span> will try to guess
-          the language from the NLS environment.</p>
-        </div>
+        <p>The PDF handler can execute an external program to run
+        OCR if no text is found in the document. This is now
+        described in a <a class="link" href="#RCL.INDEXING.OCR"
+        title="2.9.&nbsp;Recoll and OCR">separate section</a>.</p>
        <div class="sect2">
          <div class="titlepage">
            <div>
              <div>
                <h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
-                id="RCL.INDEXING.PDF.XMP"></a>2.8.2.&nbsp;XMP
+                id="RCL.INDEXING.PDF.XMP"></a>2.8.1.&nbsp;XMP
                fields extraction</h3>
              </div>
            </div>
@ -2236,7 +2206,7 @@ metadatacmds = ; <em class=
            <div>
              <div>
                <h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
-                id="RCL.INDEXING.PDF.ATTACH"></a>2.8.3.&nbsp;PDF
+                id="RCL.INDEXING.PDF.ATTACH"></a>2.8.2.&nbsp;PDF
                attachment indexing</h3>
              </div>
            </div>
@ -2252,13 +2222,67 @@ metadatacmds = ; <em class=
          uncommon in my experience).</p>
        </div>
      </div>
+      <div class="sect1">
+        <div class="titlepage">
+          <div>
+            <div>
+              <h2 class="title" style="clear: both"><a name=
+              "RCL.INDEXING.OCR" id=
+              "RCL.INDEXING.OCR"></a>2.9.&nbsp;Recoll and OCR</h2>
+            </div>
+          </div>
+        </div>
+        <p>This is new in <span class="application">Recoll</span>
+        1.26.5. Older versions had a more limited, non-caching
+        capability to execute an external OCR program in the PDF
+        handler. The new function has the following features:</p>
+        <div class="itemizedlist">
+          <ul class="itemizedlist" style="list-style-type: disc;">
+            <li class="listitem">
+              <p>The OCR output is cached, stored as separate
+              files. The caching is ultimately based on a hash
+              value of the original file contents, so that it is
+              immune to file renames. A first path-based layer
+              ensures fast operation for unchanged (unmoved files),
+              and the data hash (which is still orders of magnitude
+              faster than OCR) is only re-computed if the file has
+              moved. OCR is only performed if the file was not
+              previously processed or if it changed.</p>
+            </li>
+            <li class="listitem">
+              <p>The support for a specific program is implemented
+              in a simple Python module. It should be
+              straightforward to add support for any OCR engine
+              with a capability to run from the command line.</p>
+            </li>
+            <li class="listitem">
+              <p>Modules initially exist for <span class=
+              "application">tesseract</span> (Linux and Windows),
+              and <span class="application">ABBYY FineReader</span>
+              (Linux, tested with version 11). ABBYY FineReader is
+              a commercial closed source program, but it sometimes
+              perform better than tesseract.</p>
+            </li>
+            <li class="listitem">
+              <p>The OCR is currently only called from the PDF
+              handler, but there should be no problem using it for
+              other image types.</p>
+            </li>
+          </ul>
+        </div>
+        <p>Configuration. See the <a class="link" href=
+        "#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
+        "Parameters for OCR processing">relevant section</a>. All
+        parameters can be localized in subdirectories through the
+        usual main configuration mechanism (path sections).</p>
+      </div>
      <div class="sect1">
        <div class="titlepage">
          <div>
            <div>
              <h2 class="title" style="clear: both"><a name=
              "RCL.INDEXING.PERIODIC" id=
-              "RCL.INDEXING.PERIODIC"></a>2.9.&nbsp;Periodic
+              "RCL.INDEXING.PERIODIC"></a>2.10.&nbsp;Periodic
              indexing</h2>
            </div>
          </div>
@ -2431,7 +2455,7 @@ metadatacmds = ; <em class=
            <div>
              <h2 class="title" style="clear: both"><a name=
              "RCL.INDEXING.MONITOR" id=
-              "RCL.INDEXING.MONITOR"></a>2.10.&nbsp;<span class=
+              "RCL.INDEXING.MONITOR"></a>2.11.&nbsp;<span class=
              "application">Unix</span>-like systems: real time
              indexing</h2>
            </div>
@ -3759,8 +3783,8 @@ fs.inotify.max_user_watches=32768
          that every user does not have to do it. The variable
          should define a colon-separated list of index
          directories, ie:</p>
-          <pre class="screen">
-          export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
+          <pre class=
+          "screen">export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
          <p>Another environment variable, <code class=
          "envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
          the active list of indexes. This variable was suggested
@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768
              parent folder expansion, usually creating a file
              manager window on the folder where the container file
              resides. E.g.:</p>
-              <pre class="programlisting">
-              &lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
+              <pre class=
+              "programlisting">&lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
              <p>A link target defined as <code class=
              "literal">R%N|<em class=
              "replaceable"><code>scriptname</code></em></code>
@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768
          <span class="application">javascript</span> program to
          the documents, like the following example, which would
          initiate a search by double-clicking any term:</p>
-          <pre class="programlisting">
-          &lt;script language="JavaScript"&gt;
+          <pre class=
+          "programlisting">&lt;script language="JavaScript"&gt;
        function recollsearch() {
        var t = document.getSelection();
        window.location.href = 'recoll://search/query?qtp=a&amp;p=0&amp;q=' +
@ -8838,7 +8862,8 @@ for i in range(nres):
                  <p>Languages for which to create stemming
                  expansion data. Stemmer names can be found by
                  executing 'recollindex -l', or this can also be
-                  set from a list in the GUI.</p>
+                  set from a list in the GUI. The values are full
+                  language names, e.g. english, french...</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id=
@ -9425,7 +9450,8 @@ for i in range(nres):
                  aspell language definition files. You can type
                  "aspell dicts" to see a list The default if this
                  is not set is to use the NLS environment to guess
-                  the value.</p>
+                  the value. The values are the 2-letter language
+                  codes (e.g. 'en', 'fr'...)</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"
@ -9500,21 +9526,32 @@ for i in range(nres):
                  *.log:20 "*with spaces.*:30"</p>
                </dd>
                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO"></a><span class="term"><code class="varname">idxniceprio</code></span></dt>
+                <dd>
+                  <p>"nice" process priority for the indexing
+                  processes. Default: 19 (lowest) Appeared with
+                  1.26.5. Prior versions were fixed at 19.</p>
+                </dd>
+                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt>
                <dd>
-                  <p>ionice class for the real time indexing
-                  process On platforms where this is supported. The
-                  default value is 3.</p>
+                  <p>ionice class for the indexing process. Despite
+                  the misleading name, and on platforms where this
+                  is supported, this affects all indexing
+                  processes, not only the real time/monitoring
+                  ones. The default value is 3 (use lowest "Idle"
+                  priority).</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"
                id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt>
                <dd>
-                  <p>ionice class parameter for the real time
-                  indexing process. On platforms where this is
-                  supported. The default is empty.</p>
+                  <p>ionice class level parameter if the class
+                  supports it. The default is empty, as the default
+                  "Idle" class has no levels.</p>
                </dd>
              </dl>
            </div>
@ -9611,20 +9648,10 @@ for i in range(nres):
                id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt>
                <dd>
-                  <p>Attempt OCR of PDF files with no text content
-                  if both tesseract and pdftoppm are installed.
+                  <p>Attempt OCR of PDF files with no text content.
                  This can be defined in subdirectories. The
-                  default is off because OCR is so very slow.</p>
-                </dd>
-                <dt><a name=
-                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG" id=
-                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG"></a><span class="term"><code class="varname">pdfocrlang</code></span></dt>
-                <dd>
-                  <p>Language to assume for PDF OCR. This is very
-                  important for having a reasonable rate of errors
-                  with tesseract. This can also be set through a
-                  configuration variable or directory-local
-                  parameters. See the rclpdf.py script.</p>
+                  default is off because OCR is so very slow. Will
+                  only do anything if ocrprogs is defined.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id=
@ -9666,6 +9693,80 @@ for i in range(nres):
              </dl>
            </div>
          </div>
+          <div class="sect3">
+            <div class="titlepage">
+              <div>
+                <div>
+                  <h4 class="title"><a name=
+                  "RCL.INSTALL.CONFIG.RECOLLCONF.OCR" id=
+                  "RCL.INSTALL.CONFIG.RECOLLCONF.OCR"></a>Parameters
+                  for OCR processing</h4>
+                </div>
+              </div>
+            </div>
+            <div class="variablelist">
+              <dl class="variablelist">
+                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS"></a><span class="term"><code class="varname">ocrprogs</code></span></dt>
+                <dd>
+                  <p>OCR modules to try. The top OCR script will
+                  try to load the corresponding modules in order
+                  and use the first which reports being capable of
+                  performing OCR on the input file. Modules for
+                  tesseract and ABBYY FineReader are present in the
+                  standard distribution.</p>
+                </dd>
+                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR"></a><span class="term"><code class="varname">ocrcachedir</code></span></dt>
+                <dd>
+                  <p>Location for caching OCR data. The default if
+                  this is empty or undefined is to store the cached
+                  OCR data under $RECOLL_CONFDIR/ocrcache.</p>
+                </dd>
+                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG"></a><span class="term"><code class="varname">tesseractlang</code></span></dt>
+                <dd>
+                  <p>Language to assume for tesseract OCR.
+                  Important for improving the OCR accuracy. This
+                  can also be set through the contents of a file in
+                  the currently processed directory. See the
+                  rclocrtesseract.py script. Example values: eng,
+                  fra... See the tesseract documentation.</p>
+                </dd>
+                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD"></a><span class="term"><code class="varname">tesseractcmd</code></span></dt>
+                <dd>
+                  <p>Path for the tesseract command. This is mostly
+                  useful on Windows, or for specifying a
+                  non-default tesseract command. e.g. on Windows:
+                  C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</p>
+                </dd>
+                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG"></a><span class="term"><code class="varname">abbyylang</code></span></dt>
+                <dd>
+                  <p>Language to assume for abbyy OCR. Important
+                  for improving the OCR accuracy. This can also be
+                  set through the contents of a file in the
+                  currently processed directory. See the
+                  rclocrabbyy.py script. Typical values: English,
+                  French... See the ABBYY documentation.</p>
+                </dd>
+                <dt><a name=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD" id=
+                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD"></a><span class="term"><code class="varname">abbyycmd</code></span></dt>
+                <dd>
+                  <p>Path for the abbyy command The ABBY directory
+                  is usually not in the path, so you should set
+                  this.</p>
+                </dd>
+              </dl>
+            </div>
+          </div>
          <div class="sect3">
            <div class="titlepage">
              <div>
@ -9858,8 +9959,8 @@ for i in range(nres):
          "filename">.xml</code> extension but should be handled
          specially, which is possible because they are usually all
          located in one place. Example:</p>
-          <pre class="programlisting">
-          [~/.kde/share/apps/okular/docdata]
+          <pre class=
+          "programlisting">[~/.kde/share/apps/okular/docdata]
        .xml = application/x-okular-notes</pre>
          <p>The <code class="varname">recoll_noindex</code>
          <code class="filename">mimemap</code> variable has been
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@ -1414,30 +1414,9 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
      specific metadata tags from an XMP packet, and to extract PDF
      attachments.</para>

-      <sect2 id="RCL.INDEXING.PDF.OCR">
-        <title>OCR with Tesseract</title>
-
-        <para>If both <application>tesseract</application> and
-        <command>pdftoppm</command> (generally from the
-        <application>poppler-utils</application> package) are installed,
-        the PDF handler may attempt OCR on PDF files with no text
-        content. This is controlled by the
-        <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
-        configuration variable, which is false by default because
-        OCR is very slow.</para>
-
-        <para>The choice of language is very important for successfull
-        OCR. Recoll has currently no way to determine this from the
-        document itself. You can set the language to use through the
-        contents of a <filename>.ocrpdflang</filename> text file in the
-        same directory as the PDF document, or through the
-        <envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
-        through the contents of an <filename>ocrpdf</filename> text file
-        inside the configuration directory. If none of the above are used,
-        &RCL; will try to guess the language from the NLS
-        environment.</para>
-
-      </sect2>
+	  <para>The PDF handler can execute an external program to run OCR if
+	  no text is found in the document. This is now described in a 
+	  <link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
      
      <sect2 id="RCL.INDEXING.PDF.XMP">
        <title>XMP fields extraction</title>
@ -1510,6 +1489,47 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
      
    </sect1>

+	<sect1 id="RCL.INDEXING.OCR">
+      <title>Recoll and OCR</title>
+
+	  <para>This is new in &RCL; 1.26.5. Older versions had a more limited,
+	  non-caching capability to execute an external OCR program in the PDF
+	  handler. The new function has the following features:
+
+	  <itemizedlist>
+		<listitem><para>The OCR output is cached, stored as separate
+		files. The caching is ultimately based on a hash value of the
+		original file contents, so that it is immune to file renames. A
+		first path-based layer ensures fast operation for unchanged
+		(unmoved files), and the data hash (which is still orders of
+		magnitude faster than OCR) is only re-computed if the file has
+		moved. OCR is only performed if the file was not previously
+		processed or if it changed.</para></listitem>
+		<listitem><para>The support for a specific program is implemented
+		in a simple Python module. It should be straightforward to add
+		support for any OCR engine with a capability to run from the
+		command line.</para></listitem>
+		<listitem><para>Modules initially exist for
+		<application>tesseract</application> (Linux and Windows), and
+		<application>ABBYY FineReader</application> (Linux, tested with
+		version 11). ABBYY FineReader is a commercial closed source
+		program, but it sometimes perform better than
+		tesseract.</para></listitem>
+		<listitem><para>The OCR is currently only called from the PDF
+		handler, but there should be no problem using it for other image
+		types.</para></listitem>
+	  </itemizedlist>
+	</para>
+
+	<para>Configuration. See the 
+	  <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
+		relevant section</link>. All parameters can be localized in
+		subdirectories through the usual main configuration mechanism (path
+		sections).</para>
+
+    </sect1>
+
+
    <sect1 id="RCL.INDEXING.PERIODIC">
      <title>Periodic indexing</title>

--- a/src/sampleconf/recoll.conf
+++ b/src/sampleconf/recoll.conf
@ -350,7 +350,8 @@ indexStoreDocText = 1
 #
 # <brief>Languages for which to create stemming expansion
 # data.</brief><descr>Stemmer names can be found by executing 'recollindex
-# -l', or this can also be set from a list in the GUI.</descr></var>
+# -l', or this can also be set from a list in the GUI. The values are full
+# language names, e.g. english, french...</descr></var>
 indexstemminglanguages = english 

 # <var name="defaultcharset" type="string"><brief>Default character
@ -760,9 +761,9 @@ checkneedretryindexscript = rclcheckneedretry.sh
 #
 # <brief>Language definitions to use when creating the aspell
 # dictionary.</brief><descr>The value must match a set of aspell language
-# definition files. You can type "aspell dicts"  to see a list The default
-# if this is not set is to use the NLS environment to guess the
-# value.</descr></var>
+# definition files. You can type "aspell dicts" to see a list The default
+# if this is not set is to use the NLS environment to guess the value. The
+# values are the 2-letter language codes (e.g. 'en', 'fr'...)</descr></var>
 #aspellLanguage = en

 # <var name="aspellAddCreateParam" type="string">
@ -902,19 +903,11 @@ snippetMaxPosWalk = 1000000

 # <var name="pdfocr" type="bool">
 #
-# <brief>Attempt OCR of PDF files with no text content if both tesseract and
-# pdftoppm are installed.</brief>
+# <brief>Attempt OCR of PDF files with no text content.</brief>
 # <descr>This can be defined in subdirectories. The default is off because
-# OCR is so very slow.</descr></var>
-#pdfocr = 0
-
-# <var name="pdfocrlang" type="string">
-#  <brief>Language to assume for PDF OCR.</brief>
-#  <descr>This is very important for having a reasonable rate of errors
-#   with tesseract. This can also be set through a configuration variable
-#   or directory-local parameters. See the rclpdf.py script.</descr>
+# OCR is so very slow. Will only do anything if ocrprogs is defined.</descr>
 # </var>
-#pdfocrlang = eng
+#pdfocr = 0

 # <var name="pdfattach" type="bool">
 #
@ -946,6 +939,60 @@ snippetMaxPosWalk = 1000000
 #pdfextrametafix =  /path/to/fixerscript.py


+# <grouptitle id="OCR">Parameters for OCR processing</grouptitle>
+
+
+# <var name="ocrprogs" type="string">
+# <brief>OCR modules to try.</brief>
+# <descr>The top OCR script will try to load the corresponding modules in
+# order and use the first which reports being capable of performing OCR on
+# the input file. Modules for tesseract and ABBYY FineReader are present in
+# the standard distribution.</descr>
+# </var>
+#ocrprogs = abbyy tesseract
+
+# <var name="ocrcachedir" type="dfn">
+# <brief>Location for caching OCR data.</brief>
+# <descr>The default if this is empty or undefined is to store the cached
+# OCR data under $RECOLL_CONFDIR/ocrcache.</descr>
+# </var>
+#ocrcachedir=
+
+
+# <var name="tesseractlang" type="string">
+#  <brief>Language to assume for tesseract OCR.</brief>
+#  <descr>Important for improving the OCR accuracy. This can also be set
+#  through the contents of a file in
+#  the currently processed directory. See the rclocrtesseract.py
+#  script. Example values: eng, fra... See the tesseract documentation.</descr>
+# </var>
+#tesseractlang = eng
+
+# <var name="tesseractcmd" type="fn">
+# <brief>Path for the tesseract command.</brief>
+# <descr>This is mostly useful on Windows, or for specifying a non-default
+# tesseract command. e.g. on Windows:
+# C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</descr>
+# </var>
+#tesseractcmd = c:/Program Files (x86)/Tesseract-OCR/tesseract.exe
+
+# <var name="abbyylang" type="string">
+# <brief>Language to assume for abbyy OCR.</brief>
+# <descr>Important for improving the OCR accuracy. This can also be set
+# through the contents of a file in
+# the currently processed directory. See the rclocrabbyy.py
+# script. Typical values: English, French... See the ABBYY documentation.
+# </descr>
+# </var>
+#abbyylang = English
+
+# <var name="abbyycmd" type="fn">
+# <brief>Path for the abbyy command</brief>
+# <descr>The ABBY directory is usually not in the path, so you should set this.
+# </descr>
+# </var>
+abbyycmd = /opt/ABBYYOCR11/abbyyocr11
+
 # <grouptitle id="SPECLOCATIONS">Parameters set for specific
 # locations</grouptitle>