doc

2012-11-02 17:30:07 +01:00 · 2012-11-02 17:30:07 +01:00 · 39c2809b6a
commit 39c2809b6a
parent 3d59c6933a
1 changed files with 138 additions and 42 deletions
--- a/src/doc/user/usermanual.sgml
+++ b/src/doc/user/usermanual.sgml
@ -3050,42 +3050,93 @@ dir:recoll dir:src -dir:utils -dir:common
      <para>The processing of metadata attributes for documents
        (<literal>fields</literal>) is highly configurable.</para>
      <sect1 id="rcl.program.filters">
        <title>Writing a document filter</title>
-      <para>&RCL; filters are executable programs which 
+        <para>&RCL; filters cooperate to translate from the multitude
-        translate from a specific format (ie:
+        of input document formats, simple ones
-        <application>openoffice</application>,
+        as <application>opendocument</application>, 
-        <application>acrobat</application>, etc.) to the &RCL;
+          <application>acrobat</application>), or compound ones such
-        indexing input format, which may be
+          as <application>Zip</application>
-        <literal>text/plain</literal> or
+          or <application>Email</application>, into the final &RCL;
-        <literal>text/html</literal>.</para> 
+          indexing input format, which may
          be <literal>text/plain</literal>
          or <literal>text/html</literal>. Most filters are executable
          programs or scripts. A few filters are coded in C++ and live
          inside <command>recollindex</command>. This latter kind will not
          be described here.</para>
-      <para>As of &RCL; 1.13, there are two kinds of filters:
+        <para>There are currently (1.18 and since 1.13) two kinds of
        external executable filters:
          <itemizedlist>
-	  <listitem><para>Simple filters (the old ones) run once and
+	    <listitem><para>Simple filters (<literal>exec</literal>
-	  exit. They can be bare programs like
+	        filters) run once and
-	  <application>antiword</application>, or shell-scripts using other
+	        exit. They can be bare programs
-	  programs. They are very simple to write, because they just need
+	        like <application>antiword</application>, or scripts
-	  to output the converted to the standard output.</para>
+	        using other programs. They are very simple to write,
 	        because they just need to print the converted document
 	        to the standard output. Their output can
 	        be <literal>text/plain</literal>
 	        or <literal>text/html</literal>.</para>
 	    </listitem>
-	  <listitem><para>Multiple filters, new in 1.13, run as long as
+	    <listitem><para>Multiple filters (<literal>execm</literal>
-	  their master process (ie: recollindex) is active. They can
+	        filters), run as long as
-	  process multiple files (sparing the process startup time which
+	        their master process (<command>recollindex</command>) is
-	  can be very significant), or multiple documents per file (ie: for
+	        active. They can process multiple files (sparing the
-	  zip or chm files). They communicate with the indexer through a
+	        process startup time which can be very significant),
-	  simple protocol, but are nevertheless a bit more complicated than
+	        or multiple documents per file (e.g.: for zip or chm
-	  the older kind. Most of these new filters are written in
+	        files). They communicate with the indexer through a
-	  <application>Python</application>, using a common module to
+	        simple protocol, but are nevertheless a bit more
-	  handle the protocol.</para>
+	        complicated than the older kind. Most of new
 	        filters are written
 	        in <application>Python</application>, using a common
 	        module to handle the protocol. There is an
 	        exception, <command>rclimg</command> which is written
 	        in Perl. The subdocuments output by these filters can
 	        be directly indexable (text or HTML), or they can be
 	        other simple or compound documents that will need to
 	        be processed by another filter.</para>
 	    </listitem>
 	  </itemizedlist>
-      The following will just describe the simple filters. If you can
+        </para>
-      program and want to write one of the other kind, it shouldn't be too
+
-      difficult to make sense of one of the existing modules. For example,
+        <para>In both cases, filters deal with regular file system
-      look at <command>rclzip</command> which uses Zip file paths as
+          files, and can process either a single document, or a
-      internal identifiers (<literal>ipath</literal>), and
+          linear list of documents in each file. &RCL; is responsible
-      <command>rclinfo</command>, which uses an integer index.</para> 
+          for performing up to date checks, deal with more complex
          embedding and other upper level issues.</para>
        <para>In the extreme case of a simple filter returning a
          document in <literal>text/plain</literal> format, no
          metadata can be transferred from the filter to the
          indexer. Generic metadata, like document size or
          modification date, will be gathered and stored by the
          indexer.</para> 
        <para>Filters that produce  <literal>text/html</literal>
          format can return an arbitrary amount of metadata inside HTML
          <literal>meta</literal> tags. These will be processed
          according to the directives found in 
          the <link linkend="rcl.program.fields">
            <filename>fields</filename> configuration
            file</link>.</para>
        <para>The filters that can handle multiple documents per file
          return a single piece of data to identify each document inside
          the file. This piece of data, called
          an <literal>ipath element</literal> will be sent back by
          &RCL; to extract the document at query time, for previewing,
          or for creating a temporary file to be opened by a
          viewer.</para>  
        <para>The following section describes the simple
          filters, and the next one gives a few explanations about
          the <literal>execm</literal> ones. You could conceivably
          write a simple filter with only the elements in the
          manual. This will not be the case for the other ones, for
          which you will have to look at the code.</para>
      <sect2 id="rcl.program.filters.simple">
        <title>Simple filters</title>
@ -3126,6 +3177,51 @@ dir:recoll dir:src -dir:utils -dir:common
      </sect2>
      <sect2 id="rcl.program.filters.multiple">
        <title>"Multiple" filters</title>
        <para>If you can program and want to write
          an <literal>execm</literal> filter, it should not be too
          difficult to make sense of one of the existing modules. For
          example, look at <command>rclzip</command> which uses Zip
          file paths as identifiers (<literal>ipath</literal>),
          and <command>rclics</command>, which uses an integer
          index. Also have a look at the comments inside
          the <filename>internfile/mh_execm.h</filename> file and
          possibly at the corresponding module.</para>
        <para><literal>execm</literal> filters sometimes need to make
          a choice for the nature of the <literal>ipath</literal>
          elements that they use in communication with the
          indexer. Here are a few guidelines:
          <itemizedlist>
            <listitem><para>Use ASCII or UTF-8 (if the identifier is an
                integer print it, for example, like printf %d would
                do).</para></listitem>
            <listitem><para>If at all possible, the data should make some
              kind of sense when printed to a log file to help with 
                debugging.</para></listitem>
            <listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
                separator to store a complex path internally (for
                deeper embedding). Colons inside
                the <literal>ipath</literal> elements output by a
                filter will be escaped, but would be a bad choice as a
                filter-specific separator (mostly, again, for
                debugging issues).</para></listitem>
          </itemizedlist>
          In any case, the main goal is that it should
          be easy for the filter to extract the target document, given
          the file name and the <literal>ipath</literal>
          element.</para>
        <para><literal>execm</literal> filters will also produce
          a document with a null <literal>ipath</literal>
          element. Depending on the type of document, this may have
          some associated data (e.g. the body of an email message), or
          none (typical for an archive file). If it is empty, this
          document will be useful anyway for some operations, as the
          parent of the actual data documents.</para>
      <sect2 id="rcl.program.filters.association">
        <title>Telling &RCL; about the filter</title>