doc

2019-03-22 12:32:00 +01:00 · 2019-03-22 12:32:00 +01:00 · 2d88b2ade6
commit 2d88b2ade6
parent f5fd7dd158
2 changed files with 202 additions and 69 deletions
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@ -5719,14 +5719,17 @@ recollindex -c "$confdir"
        cooperate to translate from the multitude of input document
        formats, simple ones as <span class=
        "application">opendocument</span>, <span class=
-        "application">acrobat</span>), or compound ones such as
+        "application">acrobat</span>, or compound ones such as
        <span class="application">Zip</span> or <span class=
        "application">Email</span>, into the final <span class=
        "application">Recoll</span> indexing input format, which is
-        plain text. Most input handlers are executable programs or
+        plain text (in many cases the processing pipeline has an
-        scripts. A few handlers are coded in C++ and live inside
+        intermediary HTML step, which may be used for better
-        <span class="command"><strong>recollindex</strong></span>.
+        previewing presentation). Most input handlers are
-        This latter kind will not be described here.</p>
+        executable programs or scripts. A few handlers are coded in
        C++ and live inside <span class=
        "command"><strong>recollindex</strong></span>. This latter
        kind will not be described here.</p>
        <p>There are currently (since version 1.13) two kinds of
        external executable input handlers:</p>
        <div class="itemizedlist">
@ -5741,26 +5744,47 @@ recollindex -c "$confdir"
              document to the standard output. Their output can be
              plain text or HTML. HTML is usually preferred because
              it can store metadata fields and it allows preserving
-              some of the formatting for the GUI preview.</p>
+              some of the formatting for the GUI preview. However,
              these handlers have limitations:</p>
              <div class="itemizedlist">
                <ul class="itemizedlist" style=
                "list-style-type: circle;">
                  <li class="listitem">
                    <p>They can only process one document per
                    file.</p>
                  </li>
                  <li class="listitem">
                    <p>The output MIME type must be known and
                    fixed.</p>
                  </li>
                  <li class="listitem">
                    <p>The character encoding, if relevant, must be
                    known and fixed (or possibly just depending on
                    location).</p>
                  </li>
                </ul>
              </div>
            </li>
            <li class="listitem">
              <p>Multiple <code class="literal">execm</code>
              handlers can process multiple files (sparing the
              process startup time which can be very significant),
-              or multiple documents per file (e.g.: for
+              or multiple documents per file (e.g.: for archives or
-              <span class="application">zip</span> or <span class=
+              multi-chapter publications). They communicate with
-              "application">chm</span> files). They communicate
+              the indexer through a simple protocol, but are
              with the indexer through a simple protocol, but are
              nevertheless a bit more complicated than the older
-              kind. Most of new handlers are written in
+              kind. Most of the new handlers are written in
-              <span class="application">Python</span>, using a
+              <span class="application">Python</span> (exception:
-              common module to handle the protocol. There is an
+              <span class="command"><strong>rclimg</strong></span>
-              exception, <span class=
+              which is written in Perl because <code class=
-              "command"><strong>rclimg</strong></span> which is
+              "literal">exiftool</code> has no real Python
-              written in Perl. The subdocuments output by these
+              equivalent). The Python handlers use common modules
-              handlers can be directly indexable (text or HTML), or
+              to factor out the boilerplate, which can make them
-              they can be other simple or compound documents that
+              very simple in favorable cases. The subdocuments
-              will need to be processed by another handler.</p>
+              output by these handlers can be directly indexable
              (text or HTML), or they can be other simple or
              compound documents that will need to be processed by
              another handler.</p>
            </li>
          </ul>
        </div>
@ -5786,10 +5810,13 @@ recollindex -c "$confdir"
        <p>The handlers that can handle multiple documents per file
        return a single piece of data to identify each document
        inside the file. This piece of data, called an <code class=
-        "literal">ipath element</code> will be sent back by
+        "literal">ipath</code> will be sent back by <span class=
-        <span class="application">Recoll</span> to extract the
+        "application">Recoll</span> to extract the document at
-        document at query time, for previewing, or for creating a
+        query time, for previewing, or for creating a temporary
-        temporary file to be opened by a viewer.</p>
+        file to be opened by a viewer. These handlers can also
        return metadata either as HTML <code class=
        "literal">meta</code> tags, or as named data through the
        communication protocol.</p>
        <p>The following section describes the simple handlers, and
        the next one gives a few explanations about the
        <code class="literal">execm</code> ones. You could
@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
          </div>
          <p>If you can program and want to write an <code class=
          "literal">execm</code> handler, it should not be too
-          difficult to make sense of one of the existing modules.
+          difficult to make sense of one of the existing
-          There is a sample one with many comments, not actually
+          handlers.</p>
-          used by <span class="application">Recoll</span>, which
+          <p>The existing handlers differ in the amount of helper
-          would index a text file as one document per line. Look
+          code which they are using:</p>
-          for <code class="filename">rcltxtlines.py</code> in the
+          <div class="itemizedlist">
-          <code class="filename">src/filters</code> directory in
+            <ul class="itemizedlist" style=
-          the <span class="application">Recoll</span> <a class=
+            "list-style-type: disc;">
-          "ulink" href="https://bitbucket.org/medoc/recoll/src"
+              <li class="listitem">
-          target="_top">BitBucket repository</a> (the sample not in
+                <p><code class="literal">rclimg</code> is written
-          the distributed release at the moment).</p>
+                in Perl and handles the execm protocol all by
                itself (showing how trivial it is).</p>
              </li>
              <li class="listitem">
                <p>All the Python handlers share at least the
                <code class="filename">rclexecm.py</code> module,
                which handles the communication. Have a look at,
                for example, <code class="filename">rclzip</code>
                for a handler which uses <code class=
                "filename">rclexecm.py</code> directly.</p>
              </li>
              <li class="listitem">
                <p>Most Python handlers which process
                single-document files by executing another command
                are further abstracted by using the <code class=
                "filename">rclexec1.py</code> module. See for
                example <code class="filename">rclrtf.py</code> for
                a simple one, or <code class=
                "filename">rcldoc.py</code> for a slightly more
                complicated one (possibly executing several
                commands).</p>
              </li>
              <li class="listitem">
                <p>Handlers which extract text from an XML document
                by using an XSLT style sheet are now executed
                inside <span class=
                "command"><strong>recollindex</strong></span>, with
                only the style sheet stored in the <code class=
                "filename">filters/</code> directory. These can use
                a single style sheet (e.g. <code class=
                "filename">abiword.xsl</code>), or two sheets for
                the data and metadata (e.g. <code class=
                "filename">opendoc-body.xsl</code> and <code class=
                "filename">opendoc-meta.xsl</code>). The
                <code class="filename">mimeconf</code>
                configuration file defines how the sheets are used,
                have a look. Before the C++ import, the xsl-based
                handlers used a common module <code class=
                "filename">rclgenxslt.py</code>, it is still around
                but unused. The handler for OpenXML presentations
                is still the Python version because the format did
                not fit with what the C++ code does. It would be a
                good base for another similar issue.</p>
              </li>
            </ul>
          </div>
          <p>There is a sample trivial handler based on
          <code class="filename">rclexecm.py</code>, with many
          comments, not actually used by <span class=
          "application">Recoll</span>. It would index a text file
          as one document per line. Look for <code class=
          "filename">rcltxtlines.py</code> in the <code class=
          "filename">src/filters</code> directory in the online
          <span class="application">Recoll</span> <a class="ulink"
          href="https://opensourceprojects.eu/p/recoll1/" target=
          "_top">Git repository</a> (the sample not in the
          distributed release at the moment).</p>
          <p>You can also have a look at the slightly more complex
          <span class="command"><strong>rclzip</strong></span>
          which uses Zip file paths as identifiers (<code class=
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@ -4392,16 +4392,16 @@ recollindex -c "$confdir"
      still used in many places though.</para></note>
      <para>&RCL; input handlers cooperate to translate from the multitude
-      of input document formats, simple ones
+      of input document formats, simple ones as
-      as <application>opendocument</application>, 
+      <application>opendocument</application>,
-      <application>acrobat</application>), or compound ones such
+      <application>acrobat</application>, or compound ones such as
-      as <application>Zip</application>
+      <application>Zip</application> or <application>Email</application>,
-      or <application>Email</application>, into the final &RCL;
+      into the final &RCL; indexing input format, which is plain text (in
-      indexing input format, which is plain text.
+      many cases the processing pipeline has an intermediary HTML step,
-      Most input handlers are executable
+      which may be used for better previewing presentation).  Most input
-      programs or scripts. A few handlers are coded in C++ and live
+      handlers are executable programs or scripts. A few handlers are coded
-      inside <command>recollindex</command>. This latter kind will not
+      in C++ and live inside <command>recollindex</command>. This latter
-      be described here.</para>
+      kind will not be described here.</para>
      <para>There are currently (since version 1.13) two kinds of
      external executable input handlers:
@ -4414,23 +4414,32 @@ recollindex -c "$confdir"
        output. Their output can be plain text or HTML. HTML is
        usually preferred because it can store metadata fields and
        it allows preserving some of the formatting for the GUI
-        preview.</para>
+        preview. However, these handlers have limitations:
        <itemizedlist>
          <listitem><para>They can only process one document
          per file.</para></listitem>
          <listitem><para>The output MIME type must be known and
          fixed.</para></listitem>
          <listitem><para>The character encoding, if relevant, must be
          known and fixed (or possibly just depending on
          location).</para></listitem>
        </itemizedlist>
        </para>
        </listitem>
-        <listitem><para>Multiple <literal>execm</literal> handlers
+        <listitem><para>Multiple <literal>execm</literal> handlers can
-        can process multiple files (sparing the process startup
+        process multiple files (sparing the process startup time which can
-        time which can be very significant), or multiple documents
+        be very significant), or multiple documents per file (e.g.: for
-        per file (e.g.: for <application>zip</application> or
+        archives or multi-chapter publications). They communicate with the
-        <application>chm</application> files). They communicate
+        indexer through a simple protocol, but are nevertheless a bit more
-        with the indexer through a simple protocol, but are
+        complicated than the older kind. Most of the new handlers are
-        nevertheless a bit more complicated than the older
+        written in <application>Python</application> (exception:
-        kind. Most of new handlers are written in
+        <command>rclimg</command> which is written in Perl because
-        <application>Python</application>, using a common module
+        <literal>exiftool</literal> has no real Python equivalent). The
-        to handle the protocol. There is an exception,
+        Python handlers use common modules to factor out the boilerplate,
-        <command>rclimg</command> which is written in Perl. The
+        which can make them very simple in favorable cases. The
-        subdocuments output by these handlers can be directly
+        subdocuments output by these handlers can be directly indexable
-        indexable (text or HTML), or they can be other simple or
+        (text or HTML), or they can be other simple or compound documents
-        compound documents that will need to be processed by
+        that will need to be processed by another handler.</para>
        another handler.</para>
        </listitem>
      </itemizedlist>
      </para>
@ -4458,10 +4467,12 @@ recollindex -c "$confdir"
      <para>The handlers that can handle multiple documents per file
      return a single piece of data to identify each document inside
      the file. This piece of data, called
-      an <literal>ipath element</literal> will be sent back by
+      an <literal>ipath</literal> will be sent back by
      &RCL; to extract the document at query time, for previewing,
      or for creating a temporary file to be opened by a
-      viewer.</para>  
+      viewer. These handlers can also return metadata either as HTML
      <literal>meta</literal> tags, or as named data through the
      communication protocol.</para>
      <para>The following section describes the simple
      handlers, and the next one gives a few explanations about
@ -4514,14 +4525,53 @@ recollindex -c "$confdir"
        <para>If you can program and want to write
        an <literal>execm</literal> handler, it should not be too
-        difficult to make sense of one of the existing modules. There is
+        difficult to make sense of one of the existing handlers.</para>
-        a sample one with many comments, not actually used by &RCL;,
+
-        which would index a text file as one document per line. Look for
+        <para>The existing handlers differ in the amount of helper code
-        <filename>rcltxtlines.py</filename> in the
+        which they are using:
-        <filename>src/filters</filename> directory in the &RCL; <ulink
+        <itemizedlist>
-        url="https://bitbucket.org/medoc/recoll/src">BitBucket
+          <listitem><para><literal>rclimg</literal> is written in Perl and
-        repository</ulink> (the sample
+          handles the execm protocol all by itself (showing how trivial it
-        not in the distributed release at the moment).</para>
+          is).</para></listitem>
          <listitem><para>All the Python handlers share at least the
          <filename>rclexecm.py</filename> module, which handles the
          communication. Have a look at, for example,
          <filename>rclzip</filename> for a handler which uses
          <filename>rclexecm.py</filename> directly.</para></listitem>
          <listitem><para>Most Python handlers which process
          single-document files by executing another command are further
          abstracted by using the <filename>rclexec1.py</filename>
          module. See for example <filename>rclrtf.py</filename> for a
          simple one, or <filename>rcldoc.py</filename> for a slightly more
          complicated one (possibly executing several
          commands).</para></listitem> 
          <listitem><para>Handlers which extract text from an XML document
          by using an XSLT style sheet are now executed inside
          <command>recollindex</command>, with only the style sheet stored
          in the <filename>filters/</filename> directory. These can
          use a single style sheet (e.g. <filename>abiword.xsl</filename>),
          or two sheets for the data and metadata
          (e.g. <filename>opendoc-body.xsl</filename> and
          <filename>opendoc-meta.xsl</filename>). The
          <filename>mimeconf</filename> configuration file defines how the
          sheets are used, have a look. Before the C++ import, the
          xsl-based handlers used a common module
          <filename>rclgenxslt.py</filename>, it is still around but
          unused. The handler for OpenXML presentations is still the Python
          version because the format did not fit with what the C++ code
          does. It would be a good base for another similar
          issue.</para></listitem>
        </itemizedlist>
        </para>
        <para>There is a sample trivial handler based on
        <filename>rclexecm.py</filename>, with many comments, not actually
        used by &RCL;. It would index a text file as one document per
        line. Look for <filename>rcltxtlines.py</filename> in the
        <filename>src/filters</filename> directory in the online &RCL;
        <ulink url="https://opensourceprojects.eu/p/recoll1/">Git
        repository</ulink> (the sample not in the distributed release at
        the moment).</para>
        <para>You can also have a look at the slightly more complex
        <command>rclzip</command> which uses Zip