doc

2012-11-02 17:30:07 +01:00 · 2012-11-02 17:30:07 +01:00 · 39c2809b6a
commit 39c2809b6a
parent 3d59c6933a
1 changed files with 138 additions and 42 deletions
--- a/src/doc/user/usermanual.sgml
+++ b/src/doc/user/usermanual.sgml
@ -3037,55 +3037,106 @@ dir:recoll dir:src -dir:utils -dir:common
  </chapter> <!-- Search -->
-  <chapter id="rcl.program">
+    <chapter id="rcl.program">
-    <title>Programming interface</title>
+      <title>Programming interface</title>
-    <para>&RCL; has an Application Programming Interface, usable both
+      <para>&RCL; has an Application Programming Interface, usable both
-    for indexing and searching, currently accessible from the
+        for indexing and searching, currently accessible from the
-    <application>Python</application> language.</para>
+        <application>Python</application> language.</para>
-    <para>Another less radical way to extend the application is to
+      <para>Another less radical way to extend the application is to
-    write filters for new types of documents.</para>
+        write filters for new types of documents.</para>
-    <para>The processing of metadata attributes for documents
+      <para>The processing of metadata attributes for documents
-    (<literal>fields</literal>) is highly configurable.</para>
+        (<literal>fields</literal>) is highly configurable.</para>
-    <sect1 id="rcl.program.filters">
+
      <sect1 id="rcl.program.filters">
        <title>Writing a document filter</title>
-      <para>&RCL; filters are executable programs which 
+        <para>&RCL; filters cooperate to translate from the multitude
-        translate from a specific format (ie:
+        of input document formats, simple ones
-        <application>openoffice</application>,
+        as <application>opendocument</application>, 
-        <application>acrobat</application>, etc.) to the &RCL;
+          <application>acrobat</application>), or compound ones such
-        indexing input format, which may be
+          as <application>Zip</application>
-        <literal>text/plain</literal> or
+          or <application>Email</application>, into the final &RCL;
-        <literal>text/html</literal>.</para> 
+          indexing input format, which may
          be <literal>text/plain</literal>
          or <literal>text/html</literal>. Most filters are executable
          programs or scripts. A few filters are coded in C++ and live
          inside <command>recollindex</command>. This latter kind will not
          be described here.</para>
-      <para>As of &RCL; 1.13, there are two kinds of filters:
+        <para>There are currently (1.18 and since 1.13) two kinds of
-        <itemizedlist>
+        external executable filters:
-	  <listitem><para>Simple filters (the old ones) run once and
+          <itemizedlist>
-	  exit. They can be bare programs like
+	    <listitem><para>Simple filters (<literal>exec</literal>
-	  <application>antiword</application>, or shell-scripts using other
+	        filters) run once and
-	  programs. They are very simple to write, because they just need
+	        exit. They can be bare programs
-	  to output the converted to the standard output.</para>
+	        like <application>antiword</application>, or scripts
-	  </listitem>
+	        using other programs. They are very simple to write,
-	  <listitem><para>Multiple filters, new in 1.13, run as long as
+	        because they just need to print the converted document
-	  their master process (ie: recollindex) is active. They can
+	        to the standard output. Their output can
-	  process multiple files (sparing the process startup time which
+	        be <literal>text/plain</literal>
-	  can be very significant), or multiple documents per file (ie: for
+	        or <literal>text/html</literal>.</para>
-	  zip or chm files). They communicate with the indexer through a
+	    </listitem>
-	  simple protocol, but are nevertheless a bit more complicated than
+	    <listitem><para>Multiple filters (<literal>execm</literal>
-	  the older kind. Most of these new filters are written in
+	        filters), run as long as
-	  <application>Python</application>, using a common module to
+	        their master process (<command>recollindex</command>) is
-	  handle the protocol.</para>
+	        active. They can process multiple files (sparing the
-	  </listitem>
+	        process startup time which can be very significant),
-	</itemizedlist>
+	        or multiple documents per file (e.g.: for zip or chm
-      The following will just describe the simple filters. If you can
+	        files). They communicate with the indexer through a
-      program and want to write one of the other kind, it shouldn't be too
+	        simple protocol, but are nevertheless a bit more
-      difficult to make sense of one of the existing modules. For example,
+	        complicated than the older kind. Most of new
-      look at <command>rclzip</command> which uses Zip file paths as
+	        filters are written
-      internal identifiers (<literal>ipath</literal>), and
+	        in <application>Python</application>, using a common
-      <command>rclinfo</command>, which uses an integer index.</para> 
+	        module to handle the protocol. There is an
 	        exception, <command>rclimg</command> which is written
 	        in Perl. The subdocuments output by these filters can
 	        be directly indexable (text or HTML), or they can be
 	        other simple or compound documents that will need to
 	        be processed by another filter.</para>
 	    </listitem>
 	  </itemizedlist>
        </para>
        <para>In both cases, filters deal with regular file system
          files, and can process either a single document, or a
          linear list of documents in each file. &RCL; is responsible
          for performing up to date checks, deal with more complex
          embedding and other upper level issues.</para>
        <para>In the extreme case of a simple filter returning a
          document in <literal>text/plain</literal> format, no
          metadata can be transferred from the filter to the
          indexer. Generic metadata, like document size or
          modification date, will be gathered and stored by the
          indexer.</para> 
        <para>Filters that produce  <literal>text/html</literal>
          format can return an arbitrary amount of metadata inside HTML
          <literal>meta</literal> tags. These will be processed
          according to the directives found in 
          the <link linkend="rcl.program.fields">
            <filename>fields</filename> configuration
            file</link>.</para>
        <para>The filters that can handle multiple documents per file
          return a single piece of data to identify each document inside
          the file. This piece of data, called
          an <literal>ipath element</literal> will be sent back by
          &RCL; to extract the document at query time, for previewing,
          or for creating a temporary file to be opened by a
          viewer.</para>  
        <para>The following section describes the simple
          filters, and the next one gives a few explanations about
          the <literal>execm</literal> ones. You could conceivably
          write a simple filter with only the elements in the
          manual. This will not be the case for the other ones, for
          which you will have to look at the code.</para>
      <sect2 id="rcl.program.filters.simple">
        <title>Simple filters</title>
@ -3126,6 +3177,51 @@ dir:recoll dir:src -dir:utils -dir:common
      </sect2>
      <sect2 id="rcl.program.filters.multiple">
        <title>"Multiple" filters</title>
        <para>If you can program and want to write
          an <literal>execm</literal> filter, it should not be too
          difficult to make sense of one of the existing modules. For
          example, look at <command>rclzip</command> which uses Zip
          file paths as identifiers (<literal>ipath</literal>),
          and <command>rclics</command>, which uses an integer
          index. Also have a look at the comments inside
          the <filename>internfile/mh_execm.h</filename> file and
          possibly at the corresponding module.</para>
        <para><literal>execm</literal> filters sometimes need to make
          a choice for the nature of the <literal>ipath</literal>
          elements that they use in communication with the
          indexer. Here are a few guidelines:
          <itemizedlist>
            <listitem><para>Use ASCII or UTF-8 (if the identifier is an
                integer print it, for example, like printf %d would
                do).</para></listitem>
            <listitem><para>If at all possible, the data should make some
              kind of sense when printed to a log file to help with 
                debugging.</para></listitem>
            <listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
                separator to store a complex path internally (for
                deeper embedding). Colons inside
                the <literal>ipath</literal> elements output by a
                filter will be escaped, but would be a bad choice as a
                filter-specific separator (mostly, again, for
                debugging issues).</para></listitem>
          </itemizedlist>
          In any case, the main goal is that it should
          be easy for the filter to extract the target document, given
          the file name and the <literal>ipath</literal>
          element.</para>
        <para><literal>execm</literal> filters will also produce
          a document with a null <literal>ipath</literal>
          element. Depending on the type of document, this may have
          some associated data (e.g. the body of an email message), or
          none (typical for an archive file). If it is empty, this
          document will be useful anyway for some operations, as the
          parent of the actual data documents.</para>
      <sect2 id="rcl.program.filters.association">
        <title>Telling &RCL; about the filter</title>