doc

2020-03-01 16:08:15 +01:00 · 2020-03-01 16:08:15 +01:00 · c110b94738
commit c110b94738
parent fe86fa9e1f
2 changed files with 63 additions and 32 deletions
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@ -2131,19 +2131,32 @@ metadatacmds = ; <em class=
        extensive facilities for storing metadata along with the
        document, and these facilities are actually used in the
        real world.</p>
-        <p>In consequence, the <code class=
+        <p>In consequence, the <span class=
-        "filename">rclpdf.py</code> PDF input handler has more
+        "command"><strong>rclpdf.py</strong></span> PDF input
-        complex capabilities than most others, and it is also more
+        handler has more complex capabilities than most others, and
-        configurable. Specifically, <code class=
+        it is also more configurable. Specifically, <span class=
-        "filename">rclpdf.py</code> can automatically use
+        "command"><strong>rclpdf.py</strong></span> has the
-        <span class="application">tesseract</span> to perform OCR
+        following features:</p>
-        if the document text is empty, it can be configured to
+        <div class="itemizedlist">
-        extract specific metadata tags from an XMP packet, and to
+          <ul class="itemizedlist" style="list-style-type: disc;">
-        extract PDF attachments.</p>
+            <li class="listitem">
-        <p>The PDF handler can execute an external program to run
+              <p>It can be configured to extract specific metadata
-        OCR if no text is found in the document. This is now
+              tags from an XMP packet.</p>
-        described in a <a class="link" href="#RCL.INDEXING.OCR"
+            </li>
-        title="2.9.&nbsp;Recoll and OCR">separate section</a>.</p>
+            <li class="listitem">
              <p>It can extract PDF attachments.</p>
            </li>
            <li class="listitem">
              <p>It can automatically perform OCR if the document
              text is empty. This is done by executing an external
              program and is now described in a <a class="link"
              href="#RCL.INDEXING.OCR" title=
              "2.9.&nbsp;Recoll and OCR">separate section</a>,
              because the OCR framework can also be used with
              non-PDF image files.</p>
            </li>
          </ul>
        </div>
        <div class="sect2">
          <div class="titlepage">
            <div>
@ -2270,8 +2283,14 @@ metadatacmds = ; <em class=
            </li>
          </ul>
        </div>
-        <p>Configuration. See the <a class="link" href=
+        <p>To enable this feature, you need to install one of the
-        "#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
+        supported OCR applications (<span class=
        "application">tesseract</span> or <span class=
        "application">ABBYY</span>), enable OCR in the PDF handler,
        and tell <span class="application">Recoll</span> where the
        appropriate command resides. The last parts are done by
        setting configuration variables. See the <a class="link"
        href="#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
        "Parameters for OCR processing">relevant section</a>. All
        parameters can be localized in subdirectories through the
        usual main configuration mechanism (path sections).</p>
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@ -1402,21 +1402,27 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
      <title>The PDF input handler</title>
      <para>The PDF format is very important for scientific and technical
-      documentation, and document archival. It has extensive
+		documentation, and document archival. It has extensive
-      facilities for storing metadata along with the document, and these
+		facilities for storing metadata along with the document, and these
-      facilities are actually used in the real world.</para>
+		facilities are actually used in the real world.</para>
-      <para>In consequence, the <filename>rclpdf.py</filename> PDF input
+      <para>In consequence, the <command>rclpdf.py</command> PDF input
-      handler has more complex capabilities than most others, and it is
+		handler has more complex capabilities than most others, and it is
-      also more configurable. Specifically, <filename>rclpdf.py</filename>
+		also more configurable. Specifically, <command>rclpdf.py</command>
-      can automatically use <application>tesseract</application> to perform
+		has the following features:
-      OCR if the document text is empty, it can be configured to extract
+		<itemizedlist>
-      specific metadata tags from an XMP packet, and to extract PDF
+		  <listitem><para>It can be configured to extract
-      attachments.</para>
+			  specific metadata tags from an XMP packet.</para></listitem>
-
+		  <listitem><para>It can extract PDF
-	  <para>The PDF handler can execute an external program to run OCR if
+			  attachments.</para></listitem>
-	  no text is found in the document. This is now described in a 
+		  <listitem><para>It can automatically perform
-	  <link linkend="RCL.INDEXING.OCR">separate section</link>.</para>
+			  OCR if the document text is empty. This is done by 
 			  executing an external program and is now described in a 
 			  <link linkend="RCL.INDEXING.OCR">separate
 				section</link>, because the OCR framework can also be used
 				with non-PDF image files.</para></listitem>
 		</itemizedlist>
 	  </para>
      <sect2 id="RCL.INDEXING.PDF.XMP">
        <title>XMP fields extraction</title>
@ -1477,7 +1483,7 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
        <title>PDF attachment indexing</title>
        <para>If <application>pdftk</application> is installed, and if the
-        the
+          the
        <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
        configuration variable is set, the PDF input handler will try to
        extract PDF attachements for indexing as sub-documents of the PDF
@ -1489,6 +1495,7 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
    </sect1>
 	<sect1 id="RCL.INDEXING.OCR">
      <title>Recoll and OCR</title>
@ -1521,8 +1528,13 @@ metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
 	  </itemizedlist>
 	</para>
-	<para>Configuration. See the 
+	  <para>To enable this feature, you need to install one of
-	  <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
+		the supported OCR applications
 		(<application>tesseract</application>
 		or <application>ABBYY</application>), enable OCR in the PDF
 		handler, and tell &RCL; where the appropriate command resides. The
 		last parts are done by setting configuration variables. See the
 		<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
 		relevant section</link>. All parameters can be localized in
 		subdirectories through the usual main configuration mechanism (path
 		sections).</para>