doc

2019-03-22 12:32:00 +01:00 · 2019-03-22 12:32:00 +01:00 · 2d88b2ade6
commit 2d88b2ade6
parent f5fd7dd158
2 changed files with 202 additions and 69 deletions
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@ -5719,14 +5719,17 @@ recollindex -c "$confdir"
        cooperate to translate from the multitude of input document
        formats, simple ones as <span class=
        "application">opendocument</span>, <span class=
-        "application">acrobat</span>), or compound ones such as
+        "application">acrobat</span>, or compound ones such as
        <span class="application">Zip</span> or <span class=
        "application">Email</span>, into the final <span class=
        "application">Recoll</span> indexing input format, which is
-        plain text. Most input handlers are executable programs or
-        scripts. A few handlers are coded in C++ and live inside
-        <span class="command"><strong>recollindex</strong></span>.
-        This latter kind will not be described here.</p>
+        plain text (in many cases the processing pipeline has an
+        intermediary HTML step, which may be used for better
+        previewing presentation). Most input handlers are
+        executable programs or scripts. A few handlers are coded in
+        C++ and live inside <span class=
+        "command"><strong>recollindex</strong></span>. This latter
+        kind will not be described here.</p>
        <p>There are currently (since version 1.13) two kinds of
        external executable input handlers:</p>
        <div class="itemizedlist">
@ -5741,26 +5744,47 @@ recollindex -c "$confdir"
              document to the standard output. Their output can be
              plain text or HTML. HTML is usually preferred because
              it can store metadata fields and it allows preserving
-              some of the formatting for the GUI preview.</p>
+              some of the formatting for the GUI preview. However,
+              these handlers have limitations:</p>
+              <div class="itemizedlist">
+                <ul class="itemizedlist" style=
+                "list-style-type: circle;">
+                  <li class="listitem">
+                    <p>They can only process one document per
+                    file.</p>
+                  </li>
+                  <li class="listitem">
+                    <p>The output MIME type must be known and
+                    fixed.</p>
+                  </li>
+                  <li class="listitem">
+                    <p>The character encoding, if relevant, must be
+                    known and fixed (or possibly just depending on
+                    location).</p>
+                  </li>
+                </ul>
+              </div>
            </li>
            <li class="listitem">
              <p>Multiple <code class="literal">execm</code>
              handlers can process multiple files (sparing the
              process startup time which can be very significant),
-              or multiple documents per file (e.g.: for
-              <span class="application">zip</span> or <span class=
-              "application">chm</span> files). They communicate
-              with the indexer through a simple protocol, but are
+              or multiple documents per file (e.g.: for archives or
+              multi-chapter publications). They communicate with
+              the indexer through a simple protocol, but are
              nevertheless a bit more complicated than the older
-              kind. Most of new handlers are written in
-              <span class="application">Python</span>, using a
-              common module to handle the protocol. There is an
-              exception, <span class=
-              "command"><strong>rclimg</strong></span> which is
-              written in Perl. The subdocuments output by these
-              handlers can be directly indexable (text or HTML), or
-              they can be other simple or compound documents that
-              will need to be processed by another handler.</p>
+              kind. Most of the new handlers are written in
+              <span class="application">Python</span> (exception:
+              <span class="command"><strong>rclimg</strong></span>
+              which is written in Perl because <code class=
+              "literal">exiftool</code> has no real Python
+              equivalent). The Python handlers use common modules
+              to factor out the boilerplate, which can make them
+              very simple in favorable cases. The subdocuments
+              output by these handlers can be directly indexable
+              (text or HTML), or they can be other simple or
+              compound documents that will need to be processed by
+              another handler.</p>
            </li>
          </ul>
        </div>
@ -5786,10 +5810,13 @@ recollindex -c "$confdir"
        <p>The handlers that can handle multiple documents per file
        return a single piece of data to identify each document
        inside the file. This piece of data, called an <code class=
-        "literal">ipath element</code> will be sent back by
-        <span class="application">Recoll</span> to extract the
-        document at query time, for previewing, or for creating a
-        temporary file to be opened by a viewer.</p>
+        "literal">ipath</code> will be sent back by <span class=
+        "application">Recoll</span> to extract the document at
+        query time, for previewing, or for creating a temporary
+        file to be opened by a viewer. These handlers can also
+        return metadata either as HTML <code class=
+        "literal">meta</code> tags, or as named data through the
+        communication protocol.</p>
        <p>The following section describes the simple handlers, and
        the next one gives a few explanations about the
        <code class="literal">execm</code> ones. You could
@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
          </div>
          <p>If you can program and want to write an <code class=
          "literal">execm</code> handler, it should not be too
-          difficult to make sense of one of the existing modules.
-          There is a sample one with many comments, not actually
-          used by <span class="application">Recoll</span>, which
-          would index a text file as one document per line. Look
-          for <code class="filename">rcltxtlines.py</code> in the
-          <code class="filename">src/filters</code> directory in
-          the <span class="application">Recoll</span> <a class=
-          "ulink" href="https://bitbucket.org/medoc/recoll/src"
-          target="_top">BitBucket repository</a> (the sample not in
-          the distributed release at the moment).</p>
+          difficult to make sense of one of the existing
+          handlers.</p>
+          <p>The existing handlers differ in the amount of helper
+          code which they are using:</p>
+          <div class="itemizedlist">
+            <ul class="itemizedlist" style=
+            "list-style-type: disc;">
+              <li class="listitem">
+                <p><code class="literal">rclimg</code> is written
+                in Perl and handles the execm protocol all by
+                itself (showing how trivial it is).</p>
+              </li>
+              <li class="listitem">
+                <p>All the Python handlers share at least the
+                <code class="filename">rclexecm.py</code> module,
+                which handles the communication. Have a look at,
+                for example, <code class="filename">rclzip</code>
+                for a handler which uses <code class=
+                "filename">rclexecm.py</code> directly.</p>
+              </li>
+              <li class="listitem">
+                <p>Most Python handlers which process
+                single-document files by executing another command
+                are further abstracted by using the <code class=
+                "filename">rclexec1.py</code> module. See for
+                example <code class="filename">rclrtf.py</code> for
+                a simple one, or <code class=
+                "filename">rcldoc.py</code> for a slightly more
+                complicated one (possibly executing several
+                commands).</p>
+              </li>
+              <li class="listitem">
+                <p>Handlers which extract text from an XML document
+                by using an XSLT style sheet are now executed
+                inside <span class=
+                "command"><strong>recollindex</strong></span>, with
+                only the style sheet stored in the <code class=
+                "filename">filters/</code> directory. These can use
+                a single style sheet (e.g. <code class=
+                "filename">abiword.xsl</code>), or two sheets for
+                the data and metadata (e.g. <code class=
+                "filename">opendoc-body.xsl</code> and <code class=
+                "filename">opendoc-meta.xsl</code>). The
+                <code class="filename">mimeconf</code>
+                configuration file defines how the sheets are used,
+                have a look. Before the C++ import, the xsl-based
+                handlers used a common module <code class=
+                "filename">rclgenxslt.py</code>, it is still around
+                but unused. The handler for OpenXML presentations
+                is still the Python version because the format did
+                not fit with what the C++ code does. It would be a
+                good base for another similar issue.</p>
+              </li>
+            </ul>
+          </div>
+          <p>There is a sample trivial handler based on
+          <code class="filename">rclexecm.py</code>, with many
+          comments, not actually used by <span class=
+          "application">Recoll</span>. It would index a text file
+          as one document per line. Look for <code class=
+          "filename">rcltxtlines.py</code> in the <code class=
+          "filename">src/filters</code> directory in the online
+          <span class="application">Recoll</span> <a class="ulink"
+          href="https://opensourceprojects.eu/p/recoll1/" target=
+          "_top">Git repository</a> (the sample not in the
+          distributed release at the moment).</p>
          <p>You can also have a look at the slightly more complex
          <span class="command"><strong>rclzip</strong></span>
          which uses Zip file paths as identifiers (<code class=
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@ -4392,16 +4392,16 @@ recollindex -c "$confdir"
      still used in many places though.</para></note>

      <para>&RCL; input handlers cooperate to translate from the multitude
-      of input document formats, simple ones
-      as <application>opendocument</application>, 
-      <application>acrobat</application>), or compound ones such
-      as <application>Zip</application>
-      or <application>Email</application>, into the final &RCL;
-      indexing input format, which is plain text.
-      Most input handlers are executable
-      programs or scripts. A few handlers are coded in C++ and live
-      inside <command>recollindex</command>. This latter kind will not
-      be described here.</para>
+      of input document formats, simple ones as
+      <application>opendocument</application>,
+      <application>acrobat</application>, or compound ones such as
+      <application>Zip</application> or <application>Email</application>,
+      into the final &RCL; indexing input format, which is plain text (in
+      many cases the processing pipeline has an intermediary HTML step,
+      which may be used for better previewing presentation).  Most input
+      handlers are executable programs or scripts. A few handlers are coded
+      in C++ and live inside <command>recollindex</command>. This latter
+      kind will not be described here.</para>

      <para>There are currently (since version 1.13) two kinds of
      external executable input handlers:
@ -4414,23 +4414,32 @@ recollindex -c "$confdir"
        output. Their output can be plain text or HTML. HTML is
        usually preferred because it can store metadata fields and
        it allows preserving some of the formatting for the GUI
-        preview.</para>
+        preview. However, these handlers have limitations:
+        <itemizedlist>
+          <listitem><para>They can only process one document
+          per file.</para></listitem>
+          <listitem><para>The output MIME type must be known and
+          fixed.</para></listitem>
+          <listitem><para>The character encoding, if relevant, must be
+          known and fixed (or possibly just depending on
+          location).</para></listitem>
+        </itemizedlist>
+        </para>
        </listitem>
-        <listitem><para>Multiple <literal>execm</literal> handlers
-        can process multiple files (sparing the process startup
-        time which can be very significant), or multiple documents
-        per file (e.g.: for <application>zip</application> or
-        <application>chm</application> files). They communicate
-        with the indexer through a simple protocol, but are
-        nevertheless a bit more complicated than the older
-        kind. Most of new handlers are written in
-        <application>Python</application>, using a common module
-        to handle the protocol. There is an exception,
-        <command>rclimg</command> which is written in Perl. The
-        subdocuments output by these handlers can be directly
-        indexable (text or HTML), or they can be other simple or
-        compound documents that will need to be processed by
-        another handler.</para>
+        <listitem><para>Multiple <literal>execm</literal> handlers can
+        process multiple files (sparing the process startup time which can
+        be very significant), or multiple documents per file (e.g.: for
+        archives or multi-chapter publications). They communicate with the
+        indexer through a simple protocol, but are nevertheless a bit more
+        complicated than the older kind. Most of the new handlers are
+        written in <application>Python</application> (exception:
+        <command>rclimg</command> which is written in Perl because
+        <literal>exiftool</literal> has no real Python equivalent). The
+        Python handlers use common modules to factor out the boilerplate,
+        which can make them very simple in favorable cases. The
+        subdocuments output by these handlers can be directly indexable
+        (text or HTML), or they can be other simple or compound documents
+        that will need to be processed by another handler.</para>
        </listitem>
      </itemizedlist>
      </para>
@ -4458,10 +4467,12 @@ recollindex -c "$confdir"
      <para>The handlers that can handle multiple documents per file
      return a single piece of data to identify each document inside
      the file. This piece of data, called
-      an <literal>ipath element</literal> will be sent back by
+      an <literal>ipath</literal> will be sent back by
      &RCL; to extract the document at query time, for previewing,
      or for creating a temporary file to be opened by a
-      viewer.</para>  
+      viewer. These handlers can also return metadata either as HTML
+      <literal>meta</literal> tags, or as named data through the
+      communication protocol.</para>

      <para>The following section describes the simple
      handlers, and the next one gives a few explanations about
@ -4514,14 +4525,53 @@ recollindex -c "$confdir"

        <para>If you can program and want to write
        an <literal>execm</literal> handler, it should not be too
-        difficult to make sense of one of the existing modules. There is
-        a sample one with many comments, not actually used by &RCL;,
-        which would index a text file as one document per line. Look for
-        <filename>rcltxtlines.py</filename> in the
-        <filename>src/filters</filename> directory in the &RCL; <ulink
-        url="https://bitbucket.org/medoc/recoll/src">BitBucket
-        repository</ulink> (the sample
-        not in the distributed release at the moment).</para>
+        difficult to make sense of one of the existing handlers.</para>
+
+        <para>The existing handlers differ in the amount of helper code
+        which they are using:
+        <itemizedlist>
+          <listitem><para><literal>rclimg</literal> is written in Perl and
+          handles the execm protocol all by itself (showing how trivial it
+          is).</para></listitem>
+          <listitem><para>All the Python handlers share at least the
+          <filename>rclexecm.py</filename> module, which handles the
+          communication. Have a look at, for example,
+          <filename>rclzip</filename> for a handler which uses
+          <filename>rclexecm.py</filename> directly.</para></listitem>
+          <listitem><para>Most Python handlers which process
+          single-document files by executing another command are further
+          abstracted by using the <filename>rclexec1.py</filename>
+          module. See for example <filename>rclrtf.py</filename> for a
+          simple one, or <filename>rcldoc.py</filename> for a slightly more
+          complicated one (possibly executing several
+          commands).</para></listitem> 
+          <listitem><para>Handlers which extract text from an XML document
+          by using an XSLT style sheet are now executed inside
+          <command>recollindex</command>, with only the style sheet stored
+          in the <filename>filters/</filename> directory. These can
+          use a single style sheet (e.g. <filename>abiword.xsl</filename>),
+          or two sheets for the data and metadata
+          (e.g. <filename>opendoc-body.xsl</filename> and
+          <filename>opendoc-meta.xsl</filename>). The
+          <filename>mimeconf</filename> configuration file defines how the
+          sheets are used, have a look. Before the C++ import, the
+          xsl-based handlers used a common module
+          <filename>rclgenxslt.py</filename>, it is still around but
+          unused. The handler for OpenXML presentations is still the Python
+          version because the format did not fit with what the C++ code
+          does. It would be a good base for another similar
+          issue.</para></listitem>
+        </itemizedlist>
+        </para>
+
+        <para>There is a sample trivial handler based on
+        <filename>rclexecm.py</filename>, with many comments, not actually
+        used by &RCL;. It would index a text file as one document per
+        line. Look for <filename>rcltxtlines.py</filename> in the
+        <filename>src/filters</filename> directory in the online &RCL;
+        <ulink url="https://opensourceprojects.eu/p/recoll1/">Git
+        repository</ulink> (the sample not in the distributed release at
+        the moment).</para>

        <para>You can also have a look at the slightly more complex
        <command>rclzip</command> which uses Zip