document the new ocr function and its config

2020-02-27 18:17:51 +01:00 · 2020-02-27 18:17:51 +01:00 · 17d29774b0
commit 17d29774b0
parent 40ead3aa7e
4 changed files with 338 additions and 134 deletions
--- a/src/doc/user/recoll.conf.xml
+++ b/src/doc/user/recoll.conf.xml
@ -247,8 +247,8 @@ will reduce the index size. This can only be set for a whole index, not
 for a subtree.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE">
 <term><varname>dehyphenate</varname></term>
-<listitem><para>Determines if we index
+<listitem><para>Determines if we index 'coworker'
-'coworker' also when the input is 'co-worker'. This is new
+also when the input is 'co-worker'. This is new
 in version 1.22, and on by default. Setting the variable to off allows
 restoring the previous behaviour.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER">
@ -279,7 +279,8 @@ as large.</para></listitem></varlistentry>
 <term><varname>indexstemminglanguages</varname></term>
 <listitem><para>Languages for which to create stemming expansion
 data. Stemmer names can be found by executing 'recollindex
-l', or this can also be set from a list in the GUI.</para></listitem></varlistentry>
+-l', or this can also be set from a list in the GUI. The values are full
 language names, e.g. english, french...</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET">
 <term><varname>defaultcharset</varname></term>
 <listitem><para>Default character
@ -608,9 +609,9 @@ space issues.</para></listitem></varlistentry>
 <term><varname>aspellLanguage</varname></term>
 <listitem><para>Language definitions to use when creating the aspell
 dictionary. The value must match a set of aspell language
-definition files. You can type "aspell dicts"  to see a list The default
+definition files. You can type "aspell dicts" to see a list The default
-if this is not set is to use the NLS environment to guess the
+if this is not set is to use the NLS environment to guess the value. The
-value.</para></listitem></varlistentry>
+values are the 2-letter language codes (e.g. 'en', 'fr'...)</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM">
 <term><varname>aspellAddCreateParam</varname></term>
 <listitem><para>Additional option and parameter to aspell dictionary creation
@ -650,14 +651,20 @@ patterns are matched with fnmatch(pattern, path, 0) You can quote entries
 containing white space with double quotes (quote the whole entry, not the
 pattern). The default is empty.
 Example: mondelaypatterns = *.log:20 "*with spaces.*:30"</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO">
 <term><varname>idxniceprio</varname></term>
 <listitem><para>"nice" process priority for the indexing processes. Default: 19
 (lowest) Appeared with 1.26.5. Prior versions were fixed at 19.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS">
 <term><varname>monioniceclass</varname></term>
-<listitem><para>ionice class for the real time indexing process On platforms where this is supported. The default value is
+<listitem><para>ionice class for the indexing process. Despite the misleading name, and on platforms where this is
-3.</para></listitem></varlistentry>
+supported, this affects all indexing processes,
 not only the real time/monitoring ones. The default value is 3 (use
 lowest "Idle" priority).</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA">
 <term><varname>monioniceclassdata</varname></term>
-<listitem><para>ionice class parameter for the real time indexing process. On platforms where this is supported. The default is
+<listitem><para>ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no
-empty.</para></listitem></varlistentry>
+levels.</para></listitem></varlistentry>
 </variablelist></sect3>
 <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.QUERY">
 <title>Query-time parameters (no impact on the index) </title><variablelist>
@ -700,14 +707,8 @@ with possibly meaning-altering missing words.</para></listitem></varlistentry>
 <title>Parameters for the PDF input script </title><variablelist>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">
 <term><varname>pdfocr</varname></term>
-<listitem><para>Attempt OCR of PDF files with no text content if both tesseract and
+<listitem><para>Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because
-pdftoppm are installed. This can be defined in subdirectories. The default is off because
+OCR is so very slow. Will only do anything if ocrprogs is defined.</para></listitem></varlistentry>
 OCR is so very slow.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG">
 <term><varname>pdfocrlang</varname></term>
 <listitem><para>Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
 with tesseract. This can also be set through a configuration variable
 or directory-local parameters. See the rclpdf.py script.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">
 <term><varname>pdfattach</varname></term>
 <listitem><para>Enable PDF attachment extraction by executing pdftk (if
@ -732,6 +733,41 @@ selected field, for editing or erasing. A new instance is created for
 each document, so that the object can keep state for, e.g. eliminating
 duplicate values.</para></listitem></varlistentry>
 </variablelist></sect3>
 <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
 <title>Parameters for OCR processing </title><variablelist>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS">
 <term><varname>ocrprogs</varname></term>
 <listitem><para>OCR modules to try. The top OCR script will try to load the corresponding modules in
 order and use the first which reports being capable of performing OCR on
 the input file. Modules for tesseract and ABBYY FineReader are present in
 the standard distribution.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR">
 <term><varname>ocrcachedir</varname></term>
 <listitem><para>Location for caching OCR data. The default if this is empty or undefined is to store the cached
 OCR data under $RECOLL_CONFDIR/ocrcache.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG">
 <term><varname>tesseractlang</varname></term>
 <listitem><para>Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set
 through the contents of a file in
 the currently processed directory. See the rclocrtesseract.py
 script. Example values: eng, fra... See the tesseract documentation.</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD">
 <term><varname>tesseractcmd</varname></term>
 <listitem><para>Path for the tesseract command. This is mostly useful on Windows, or for specifying a non-default
 tesseract command. e.g. on Windows:
 C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG">
 <term><varname>abbyylang</varname></term>
 <listitem><para>Language to assume for abbyy OCR. Important for improving the OCR accuracy. This can also be set
 through the contents of a file in
 the currently processed directory. See the rclocrabbyy.py
 script. Typical values: English, French... See the ABBYY documentation.
 </para></listitem></varlistentry>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD">
 <term><varname>abbyycmd</varname></term>
 <listitem><para>Path for the abbyy command The ABBY directory is usually not in the path, so you should set this.
 </para></listitem></varlistentry>
 </variablelist></sect3>
 <sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS">
 <title>Parameters set for specific locations </title><variablelist>
 <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS">
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@ -3,7 +3,7 @@
 <html>
 <head>
  <meta name="generator" content=
-  "HTML Tidy for HTML5 for Linux version 5.2.0">
+  "HTML Tidy for HTML5 for Linux version 5.6.0">
  <meta http-equiv="Content-Type" content=
  "text/html; charset=utf-8">
  <title>Recoll user manual</title>
@ -157,20 +157,19 @@ alink="#0000FF">
            <dd>
              <dl>
                <dt><span class="sect2">2.8.1. <a href=
                "#RCL.INDEXING.PDF.OCR">OCR with
                Tesseract</a></span></dt>
                <dt><span class="sect2">2.8.2. <a href=
                "#RCL.INDEXING.PDF.XMP">XMP fields
                extraction</a></span></dt>
-                <dt><span class="sect2">2.8.3. <a href=
+                <dt><span class="sect2">2.8.2. <a href=
                "#RCL.INDEXING.PDF.ATTACH">PDF attachment
                indexing</a></span></dt>
              </dl>
            </dd>
            <dt><span class="sect1">2.9. <a href=
            "#RCL.INDEXING.OCR">Recoll and OCR</a></span></dt>
            <dt><span class="sect1">2.10. <a href=
            "#RCL.INDEXING.PERIODIC">Periodic
            indexing</a></span></dt>
-            <dt><span class="sect1">2.10. <a href=
+            <dt><span class="sect1">2.11. <a href=
            "#RCL.INDEXING.MONITOR"><span class=
            "application">Unix</span>-like systems: real time
            indexing</a></span></dt>
@ -781,7 +780,7 @@ alink="#0000FF">
            "list-style-type: disc;">
              <li class="listitem">
                <p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
-                title="2.9.&nbsp;Periodic indexing">Periodic (or
+                title="2.10.&nbsp;Periodic indexing">Periodic (or
                batch) indexing</a> .&nbsp;</b><span class=
                "command"><strong>recollindex</strong></span> is
                executed at discrete times. On <span class=
@ -799,7 +798,7 @@ alink="#0000FF">
              <li class="listitem">
                <p><b><a class="link" href="#RCL.INDEXING.MONITOR"
                title=
-                "2.10.&nbsp;Unix-like systems: real time indexing">Real
+                "2.11.&nbsp;Unix-like systems: real time indexing">Real
                time indexing</a> .&nbsp;</b>(Only available on
                <span class="application">Unix</span>-like
                systems). <span class=
@ -831,7 +830,7 @@ alink="#0000FF">
            indexing on a small home directory), or, with
            <span class="application">Recoll</span> 1.24 and newer,
            by <a class="link" href="#RCL.INDEXING.MONITOR" title=
-            "2.10.&nbsp;Unix-like systems: real time indexing">configuring
+            "2.11.&nbsp;Unix-like systems: real time indexing">configuring
            the index so that only a subset of the tree will be
            monitored.</a></p>
            <p>The choice of method and the parameters used can be
@ -1136,8 +1135,8 @@ alink="#0000FF">
              different areas of the file system to different
              indexes. For example, if you were to issue the
              following command:</p>
-              <pre class="programlisting">
+              <pre class=
-              recoll -c ~/.indexes-email</pre>
+              "programlisting">recoll -c ~/.indexes-email</pre>
              <p>Then <span class="application">Recoll</span> would
              use configuration files stored in <code class=
              "filename">~/.indexes-email/</code> and, (unless
@ -2141,45 +2140,16 @@ metadatacmds = ; <em class=
        if the document text is empty, it can be configured to
        extract specific metadata tags from an XMP packet, and to
        extract PDF attachments.</p>
-        <div class="sect2">
+        <p>The PDF handler can execute an external program to run
-          <div class="titlepage">
+        OCR if no text is found in the document. This is now
-            <div>
+        described in a <a class="link" href="#RCL.INDEXING.OCR"
-              <div>
+        title="2.9.&nbsp;Recoll and OCR">separate section</a>.</p>
                <h3 class="title"><a name="RCL.INDEXING.PDF.OCR"
                id="RCL.INDEXING.PDF.OCR"></a>2.8.1.&nbsp;OCR with
                Tesseract</h3>
              </div>
            </div>
          </div>
          <p>If both <span class="application">tesseract</span> and
          <span class="command"><strong>pdftoppm</strong></span>
          (generally from the <span class=
          "application">poppler-utils</span> package) are
          installed, the PDF handler may attempt OCR on PDF files
          with no text content. This is controlled by the <a class=
          "link" href=
          "#RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</a>
          configuration variable, which is false by default because
          OCR is very slow.</p>
          <p>The choice of language is very important for
          successfull OCR. Recoll has currently no way to determine
          this from the document itself. You can set the language
          to use through the contents of a <code class=
          "filename">.ocrpdflang</code> text file in the same
          directory as the PDF document, or through the
          <code class="envar">RECOLL_TESSERACT_LANG</code>
          environment variable, or through the contents of an
          <code class="filename">ocrpdf</code> text file inside the
          configuration directory. If none of the above are used,
          <span class="application">Recoll</span> will try to guess
          the language from the NLS environment.</p>
        </div>
        <div class="sect2">
          <div class="titlepage">
            <div>
              <div>
                <h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
-                id="RCL.INDEXING.PDF.XMP"></a>2.8.2.&nbsp;XMP
+                id="RCL.INDEXING.PDF.XMP"></a>2.8.1.&nbsp;XMP
                fields extraction</h3>
              </div>
            </div>
@ -2236,7 +2206,7 @@ metadatacmds = ; <em class=
            <div>
              <div>
                <h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
-                id="RCL.INDEXING.PDF.ATTACH"></a>2.8.3.&nbsp;PDF
+                id="RCL.INDEXING.PDF.ATTACH"></a>2.8.2.&nbsp;PDF
                attachment indexing</h3>
              </div>
            </div>
@ -2252,13 +2222,67 @@ metadatacmds = ; <em class=
          uncommon in my experience).</p>
        </div>
      </div>
      <div class="sect1">
        <div class="titlepage">
          <div>
            <div>
              <h2 class="title" style="clear: both"><a name=
              "RCL.INDEXING.OCR" id=
              "RCL.INDEXING.OCR"></a>2.9.&nbsp;Recoll and OCR</h2>
            </div>
          </div>
        </div>
        <p>This is new in <span class="application">Recoll</span>
        1.26.5. Older versions had a more limited, non-caching
        capability to execute an external OCR program in the PDF
        handler. The new function has the following features:</p>
        <div class="itemizedlist">
          <ul class="itemizedlist" style="list-style-type: disc;">
            <li class="listitem">
              <p>The OCR output is cached, stored as separate
              files. The caching is ultimately based on a hash
              value of the original file contents, so that it is
              immune to file renames. A first path-based layer
              ensures fast operation for unchanged (unmoved files),
              and the data hash (which is still orders of magnitude
              faster than OCR) is only re-computed if the file has
              moved. OCR is only performed if the file was not
              previously processed or if it changed.</p>
            </li>
            <li class="listitem">
              <p>The support for a specific program is implemented
              in a simple Python module. It should be
              straightforward to add support for any OCR engine
              with a capability to run from the command line.</p>
            </li>
            <li class="listitem">
              <p>Modules initially exist for <span class=
              "application">tesseract</span> (Linux and Windows),
              and <span class="application">ABBYY FineReader</span>
              (Linux, tested with version 11). ABBYY FineReader is
              a commercial closed source program, but it sometimes
              perform better than tesseract.</p>
            </li>
            <li class="listitem">
              <p>The OCR is currently only called from the PDF
              handler, but there should be no problem using it for
              other image types.</p>
            </li>
          </ul>
        </div>
        <p>Configuration. See the <a class="link" href=
        "#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
        "Parameters for OCR processing">relevant section</a>. All
        parameters can be localized in subdirectories through the
        usual main configuration mechanism (path sections).</p>
      </div>
      <div class="sect1">
        <div class="titlepage">
          <div>
            <div>
              <h2 class="title" style="clear: both"><a name=
              "RCL.INDEXING.PERIODIC" id=
-              "RCL.INDEXING.PERIODIC"></a>2.9.&nbsp;Periodic
+              "RCL.INDEXING.PERIODIC"></a>2.10.&nbsp;Periodic
              indexing</h2>
            </div>
          </div>
@ -2431,7 +2455,7 @@ metadatacmds = ; <em class=
            <div>
              <h2 class="title" style="clear: both"><a name=
              "RCL.INDEXING.MONITOR" id=
-              "RCL.INDEXING.MONITOR"></a>2.10.&nbsp;<span class=
+              "RCL.INDEXING.MONITOR"></a>2.11.&nbsp;<span class=
              "application">Unix</span>-like systems: real time
              indexing</h2>
            </div>
@ -3759,8 +3783,8 @@ fs.inotify.max_user_watches=32768
          that every user does not have to do it. The variable
          should define a colon-separated list of index
          directories, ie:</p>
-          <pre class="screen">
+          <pre class=
-          export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
+          "screen">export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
          <p>Another environment variable, <code class=
          "envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
          the active list of indexes. This variable was suggested
@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768
              parent folder expansion, usually creating a file
              manager window on the folder where the container file
              resides. E.g.:</p>
-              <pre class="programlisting">
+              <pre class=
-              &lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
+              "programlisting">&lt;a href="F%N"&gt;%P&lt;/a&gt;</pre>
              <p>A link target defined as <code class=
              "literal">R%N|<em class=
              "replaceable"><code>scriptname</code></em></code>
@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768
          <span class="application">javascript</span> program to
          the documents, like the following example, which would
          initiate a search by double-clicking any term:</p>
-          <pre class="programlisting">
+          <pre class=
-          &lt;script language="JavaScript"&gt;
+          "programlisting">&lt;script language="JavaScript"&gt;
        function recollsearch() {
        var t = document.getSelection();
        window.location.href = 'recoll://search/query?qtp=a&amp;p=0&amp;q=' +
@ -8838,7 +8862,8 @@ for i in range(nres):
                  <p>Languages for which to create stemming
                  expansion data. Stemmer names can be found by
                  executing 'recollindex -l', or this can also be
-                  set from a list in the GUI.</p>
+                  set from a list in the GUI. The values are full
                  language names, e.g. english, french...</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id=
@ -9425,7 +9450,8 @@ for i in range(nres):
                  aspell language definition files. You can type
                  "aspell dicts" to see a list The default if this
                  is not set is to use the NLS environment to guess
-                  the value.</p>
+                  the value. The values are the 2-letter language
                  codes (e.g. 'en', 'fr'...)</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"
@ -9500,21 +9526,32 @@ for i in range(nres):
                  *.log:20 "*with spaces.*:30"</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO"></a><span class="term"><code class="varname">idxniceprio</code></span></dt>
                <dd>
                  <p>"nice" process priority for the indexing
                  processes. Default: 19 (lowest) Appeared with
                  1.26.5. Prior versions were fixed at 19.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt>
                <dd>
-                  <p>ionice class for the real time indexing
+                  <p>ionice class for the indexing process. Despite
-                  process On platforms where this is supported. The
+                  the misleading name, and on platforms where this
-                  default value is 3.</p>
+                  is supported, this affects all indexing
                  processes, not only the real time/monitoring
                  ones. The default value is 3 (use lowest "Idle"
                  priority).</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"
                id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt>
                <dd>
-                  <p>ionice class parameter for the real time
+                  <p>ionice class level parameter if the class
-                  indexing process. On platforms where this is
+                  supports it. The default is empty, as the default
-                  supported. The default is empty.</p>
+                  "Idle" class has no levels.</p>
                </dd>
              </dl>
            </div>
@ -9611,20 +9648,10 @@ for i in range(nres):
                id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt>
                <dd>
-                  <p>Attempt OCR of PDF files with no text content
+                  <p>Attempt OCR of PDF files with no text content.
                  if both tesseract and pdftoppm are installed.
                  This can be defined in subdirectories. The
-                  default is off because OCR is so very slow.</p>
+                  default is off because OCR is so very slow. Will
-                </dd>
+                  only do anything if ocrprogs is defined.</p>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCRLANG"></a><span class="term"><code class="varname">pdfocrlang</code></span></dt>
                <dd>
                  <p>Language to assume for PDF OCR. This is very
                  important for having a reasonable rate of errors
                  with tesseract. This can also be set through a
                  configuration variable or directory-local
                  parameters. See the rclpdf.py script.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id=
@ -9666,6 +9693,80 @@ for i in range(nres):
              </dl>
            </div>
          </div>
          <div class="sect3">
            <div class="titlepage">
              <div>
                <div>
                  <h4 class="title"><a name=
                  "RCL.INSTALL.CONFIG.RECOLLCONF.OCR" id=
                  "RCL.INSTALL.CONFIG.RECOLLCONF.OCR"></a>Parameters
                  for OCR processing</h4>
                </div>
              </div>
            </div>
            <div class="variablelist">
              <dl class="variablelist">
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS"></a><span class="term"><code class="varname">ocrprogs</code></span></dt>
                <dd>
                  <p>OCR modules to try. The top OCR script will
                  try to load the corresponding modules in order
                  and use the first which reports being capable of
                  performing OCR on the input file. Modules for
                  tesseract and ABBYY FineReader are present in the
                  standard distribution.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR"></a><span class="term"><code class="varname">ocrcachedir</code></span></dt>
                <dd>
                  <p>Location for caching OCR data. The default if
                  this is empty or undefined is to store the cached
                  OCR data under $RECOLL_CONFDIR/ocrcache.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG"></a><span class="term"><code class="varname">tesseractlang</code></span></dt>
                <dd>
                  <p>Language to assume for tesseract OCR.
                  Important for improving the OCR accuracy. This
                  can also be set through the contents of a file in
                  the currently processed directory. See the
                  rclocrtesseract.py script. Example values: eng,
                  fra... See the tesseract documentation.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD"></a><span class="term"><code class="varname">tesseractcmd</code></span></dt>
                <dd>
                  <p>Path for the tesseract command. This is mostly
                  useful on Windows, or for specifying a
                  non-default tesseract command. e.g. on Windows:
                  C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG"></a><span class="term"><code class="varname">abbyylang</code></span></dt>
                <dd>
                  <p>Language to assume for abbyy OCR. Important
                  for improving the OCR accuracy. This can also be
                  set through the contents of a file in the
                  currently processed directory. See the
                  rclocrabbyy.py script. Typical values: English,
                  French... See the ABBYY documentation.</p>
                </dd>
                <dt><a name=
                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD" id=
                "RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD"></a><span class="term"><code class="varname">abbyycmd</code></span></dt>
                <dd>
                  <p>Path for the abbyy command The ABBY directory
                  is usually not in the path, so you should set
                  this.</p>
                </dd>
              </dl>
            </div>
          </div>
          <div class="sect3">
            <div class="titlepage">
              <div>
@ -9858,8 +9959,8 @@ for i in range(nres):
          "filename">.xml</code> extension but should be handled
          specially, which is possible because they are usually all
          located in one place. Example:</p>
-          <pre class="programlisting">
+          <pre class=
-          [~/.kde/share/apps/okular/docdata]
+          "programlisting">[~/.kde/share/apps/okular/docdata]
        .xml = application/x-okular-notes</pre>
          <p>The <code class="varname">recoll_noindex</code>
          <code class="filename">mimemap</code> variable has been
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@ -1414,30 +1414,9 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
      specific metadata tags from an XMP packet, and to extract PDF
      attachments.</para>
-      <sect2 id="RCL.INDEXING.PDF.OCR">
+	  <para>The PDF handler can execute an external program to run OCR if
-        <title>OCR with Tesseract</title>
+	  no text is found in the document. This is now described in a 
-
+	  <link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
        <para>If both <application>tesseract</application> and
        <command>pdftoppm</command> (generally from the
        <application>poppler-utils</application> package) are installed,
        the PDF handler may attempt OCR on PDF files with no text
        content. This is controlled by the
        <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
        configuration variable, which is false by default because
        OCR is very slow.</para>
        <para>The choice of language is very important for successfull
        OCR. Recoll has currently no way to determine this from the
        document itself. You can set the language to use through the
        contents of a <filename>.ocrpdflang</filename> text file in the
        same directory as the PDF document, or through the
        <envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
        through the contents of an <filename>ocrpdf</filename> text file
        inside the configuration directory. If none of the above are used,
        &RCL; will try to guess the language from the NLS
        environment.</para>
      </sect2>
      <sect2 id="RCL.INDEXING.PDF.XMP">
        <title>XMP fields extraction</title>
@ -1510,6 +1489,47 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
    </sect1>
 	<sect1 id="RCL.INDEXING.OCR">
      <title>Recoll and OCR</title>
 	  <para>This is new in &RCL; 1.26.5. Older versions had a more limited,
 	  non-caching capability to execute an external OCR program in the PDF
 	  handler. The new function has the following features:
 	  <itemizedlist>
 		<listitem><para>The OCR output is cached, stored as separate
 		files. The caching is ultimately based on a hash value of the
 		original file contents, so that it is immune to file renames. A
 		first path-based layer ensures fast operation for unchanged
 		(unmoved files), and the data hash (which is still orders of
 		magnitude faster than OCR) is only re-computed if the file has
 		moved. OCR is only performed if the file was not previously
 		processed or if it changed.</para></listitem>
 		<listitem><para>The support for a specific program is implemented
 		in a simple Python module. It should be straightforward to add
 		support for any OCR engine with a capability to run from the
 		command line.</para></listitem>
 		<listitem><para>Modules initially exist for
 		<application>tesseract</application> (Linux and Windows), and
 		<application>ABBYY FineReader</application> (Linux, tested with
 		version 11). ABBYY FineReader is a commercial closed source
 		program, but it sometimes perform better than
 		tesseract.</para></listitem>
 		<listitem><para>The OCR is currently only called from the PDF
 		handler, but there should be no problem using it for other image
 		types.</para></listitem>
 	  </itemizedlist>
 	</para>
 	<para>Configuration. See the 
 	  <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
 		relevant section</link>. All parameters can be localized in
 		subdirectories through the usual main configuration mechanism (path
 		sections).</para>
    </sect1>
    <sect1 id="RCL.INDEXING.PERIODIC">
      <title>Periodic indexing</title>
--- a/src/sampleconf/recoll.conf
+++ b/src/sampleconf/recoll.conf
@ -350,7 +350,8 @@ indexStoreDocText = 1
 #
 # <brief>Languages for which to create stemming expansion
 # data.</brief><descr>Stemmer names can be found by executing 'recollindex
-# -l', or this can also be set from a list in the GUI.</descr></var>
+# -l', or this can also be set from a list in the GUI. The values are full
 # language names, e.g. english, french...</descr></var>
 indexstemminglanguages = english 
 # <var name="defaultcharset" type="string"><brief>Default character
@ -760,9 +761,9 @@ checkneedretryindexscript = rclcheckneedretry.sh
 #
 # <brief>Language definitions to use when creating the aspell
 # dictionary.</brief><descr>The value must match a set of aspell language
-# definition files. You can type "aspell dicts"  to see a list The default
+# definition files. You can type "aspell dicts" to see a list The default
-# if this is not set is to use the NLS environment to guess the
+# if this is not set is to use the NLS environment to guess the value. The
-# value.</descr></var>
+# values are the 2-letter language codes (e.g. 'en', 'fr'...)</descr></var>
 #aspellLanguage = en
 # <var name="aspellAddCreateParam" type="string">
@ -902,19 +903,11 @@ snippetMaxPosWalk = 1000000
 # <var name="pdfocr" type="bool">
 #
-# <brief>Attempt OCR of PDF files with no text content if both tesseract and
+# <brief>Attempt OCR of PDF files with no text content.</brief>
 # pdftoppm are installed.</brief>
 # <descr>This can be defined in subdirectories. The default is off because
-# OCR is so very slow.</descr></var>
+# OCR is so very slow. Will only do anything if ocrprogs is defined.</descr>
 #pdfocr = 0
 # <var name="pdfocrlang" type="string">
 #  <brief>Language to assume for PDF OCR.</brief>
 #  <descr>This is very important for having a reasonable rate of errors
 #   with tesseract. This can also be set through a configuration variable
 #   or directory-local parameters. See the rclpdf.py script.</descr>
 # </var>
-#pdfocrlang = eng
+#pdfocr = 0
 # <var name="pdfattach" type="bool">
 #
@ -946,6 +939,60 @@ snippetMaxPosWalk = 1000000
 #pdfextrametafix =  /path/to/fixerscript.py
 # <grouptitle id="OCR">Parameters for OCR processing</grouptitle>
 # <var name="ocrprogs" type="string">
 # <brief>OCR modules to try.</brief>
 # <descr>The top OCR script will try to load the corresponding modules in
 # order and use the first which reports being capable of performing OCR on
 # the input file. Modules for tesseract and ABBYY FineReader are present in
 # the standard distribution.</descr>
 # </var>
 #ocrprogs = abbyy tesseract
 # <var name="ocrcachedir" type="dfn">
 # <brief>Location for caching OCR data.</brief>
 # <descr>The default if this is empty or undefined is to store the cached
 # OCR data under $RECOLL_CONFDIR/ocrcache.</descr>
 # </var>
 #ocrcachedir=
 # <var name="tesseractlang" type="string">
 #  <brief>Language to assume for tesseract OCR.</brief>
 #  <descr>Important for improving the OCR accuracy. This can also be set
 #  through the contents of a file in
 #  the currently processed directory. See the rclocrtesseract.py
 #  script. Example values: eng, fra... See the tesseract documentation.</descr>
 # </var>
 #tesseractlang = eng
 # <var name="tesseractcmd" type="fn">
 # <brief>Path for the tesseract command.</brief>
 # <descr>This is mostly useful on Windows, or for specifying a non-default
 # tesseract command. e.g. on Windows:
 # C:/Program&nbsp;Files&nbsp;(x86)/Tesseract-OCR/tesseract.exe</descr>
 # </var>
 #tesseractcmd = c:/Program Files (x86)/Tesseract-OCR/tesseract.exe
 # <var name="abbyylang" type="string">
 # <brief>Language to assume for abbyy OCR.</brief>
 # <descr>Important for improving the OCR accuracy. This can also be set
 # through the contents of a file in
 # the currently processed directory. See the rclocrabbyy.py
 # script. Typical values: English, French... See the ABBYY documentation.
 # </descr>
 # </var>
 #abbyylang = English
 # <var name="abbyycmd" type="fn">
 # <brief>Path for the abbyy command</brief>
 # <descr>The ABBY directory is usually not in the path, so you should set this.
 # </descr>
 # </var>
 abbyycmd = /opt/ABBYYOCR11/abbyyocr11
 # <grouptitle id="SPECLOCATIONS">Parameters set for specific
 # locations</grouptitle>