This commit is contained in:
Jean-Francois Dockes 2012-11-02 17:30:07 +01:00
parent 3d59c6933a
commit 39c2809b6a

View File

@ -3037,55 +3037,106 @@ dir:recoll dir:src -dir:utils -dir:common
</chapter> <!-- Search --> </chapter> <!-- Search -->
<chapter id="rcl.program"> <chapter id="rcl.program">
<title>Programming interface</title> <title>Programming interface</title>
<para>&RCL; has an Application Programming Interface, usable both <para>&RCL; has an Application Programming Interface, usable both
for indexing and searching, currently accessible from the for indexing and searching, currently accessible from the
<application>Python</application> language.</para> <application>Python</application> language.</para>
<para>Another less radical way to extend the application is to <para>Another less radical way to extend the application is to
write filters for new types of documents.</para> write filters for new types of documents.</para>
<para>The processing of metadata attributes for documents <para>The processing of metadata attributes for documents
(<literal>fields</literal>) is highly configurable.</para> (<literal>fields</literal>) is highly configurable.</para>
<sect1 id="rcl.program.filters">
<sect1 id="rcl.program.filters">
<title>Writing a document filter</title> <title>Writing a document filter</title>
<para>&RCL; filters are executable programs which <para>&RCL; filters cooperate to translate from the multitude
translate from a specific format (ie: of input document formats, simple ones
<application>openoffice</application>, as <application>opendocument</application>,
<application>acrobat</application>, etc.) to the &RCL; <application>acrobat</application>), or compound ones such
indexing input format, which may be as <application>Zip</application>
<literal>text/plain</literal> or or <application>Email</application>, into the final &RCL;
<literal>text/html</literal>.</para> indexing input format, which may
be <literal>text/plain</literal>
or <literal>text/html</literal>. Most filters are executable
programs or scripts. A few filters are coded in C++ and live
inside <command>recollindex</command>. This latter kind will not
be described here.</para>
<para>As of &RCL; 1.13, there are two kinds of filters: <para>There are currently (1.18 and since 1.13) two kinds of
<itemizedlist> external executable filters:
<listitem><para>Simple filters (the old ones) run once and <itemizedlist>
exit. They can be bare programs like <listitem><para>Simple filters (<literal>exec</literal>
<application>antiword</application>, or shell-scripts using other filters) run once and
programs. They are very simple to write, because they just need exit. They can be bare programs
to output the converted to the standard output.</para> like <application>antiword</application>, or scripts
</listitem> using other programs. They are very simple to write,
<listitem><para>Multiple filters, new in 1.13, run as long as because they just need to print the converted document
their master process (ie: recollindex) is active. They can to the standard output. Their output can
process multiple files (sparing the process startup time which be <literal>text/plain</literal>
can be very significant), or multiple documents per file (ie: for or <literal>text/html</literal>.</para>
zip or chm files). They communicate with the indexer through a </listitem>
simple protocol, but are nevertheless a bit more complicated than <listitem><para>Multiple filters (<literal>execm</literal>
the older kind. Most of these new filters are written in filters), run as long as
<application>Python</application>, using a common module to their master process (<command>recollindex</command>) is
handle the protocol.</para> active. They can process multiple files (sparing the
</listitem> process startup time which can be very significant),
</itemizedlist> or multiple documents per file (e.g.: for zip or chm
The following will just describe the simple filters. If you can files). They communicate with the indexer through a
program and want to write one of the other kind, it shouldn't be too simple protocol, but are nevertheless a bit more
difficult to make sense of one of the existing modules. For example, complicated than the older kind. Most of new
look at <command>rclzip</command> which uses Zip file paths as filters are written
internal identifiers (<literal>ipath</literal>), and in <application>Python</application>, using a common
<command>rclinfo</command>, which uses an integer index.</para> module to handle the protocol. There is an
exception, <command>rclimg</command> which is written
in Perl. The subdocuments output by these filters can
be directly indexable (text or HTML), or they can be
other simple or compound documents that will need to
be processed by another filter.</para>
</listitem>
</itemizedlist>
</para>
<para>In both cases, filters deal with regular file system
files, and can process either a single document, or a
linear list of documents in each file. &RCL; is responsible
for performing up to date checks, deal with more complex
embedding and other upper level issues.</para>
<para>In the extreme case of a simple filter returning a
document in <literal>text/plain</literal> format, no
metadata can be transferred from the filter to the
indexer. Generic metadata, like document size or
modification date, will be gathered and stored by the
indexer.</para>
<para>Filters that produce <literal>text/html</literal>
format can return an arbitrary amount of metadata inside HTML
<literal>meta</literal> tags. These will be processed
according to the directives found in
the <link linkend="rcl.program.fields">
<filename>fields</filename> configuration
file</link>.</para>
<para>The filters that can handle multiple documents per file
return a single piece of data to identify each document inside
the file. This piece of data, called
an <literal>ipath element</literal> will be sent back by
&RCL; to extract the document at query time, for previewing,
or for creating a temporary file to be opened by a
viewer.</para>
<para>The following section describes the simple
filters, and the next one gives a few explanations about
the <literal>execm</literal> ones. You could conceivably
write a simple filter with only the elements in the
manual. This will not be the case for the other ones, for
which you will have to look at the code.</para>
<sect2 id="rcl.program.filters.simple"> <sect2 id="rcl.program.filters.simple">
<title>Simple filters</title> <title>Simple filters</title>
@ -3126,6 +3177,51 @@ dir:recoll dir:src -dir:utils -dir:common
</sect2> </sect2>
<sect2 id="rcl.program.filters.multiple">
<title>"Multiple" filters</title>
<para>If you can program and want to write
an <literal>execm</literal> filter, it should not be too
difficult to make sense of one of the existing modules. For
example, look at <command>rclzip</command> which uses Zip
file paths as identifiers (<literal>ipath</literal>),
and <command>rclics</command>, which uses an integer
index. Also have a look at the comments inside
the <filename>internfile/mh_execm.h</filename> file and
possibly at the corresponding module.</para>
<para><literal>execm</literal> filters sometimes need to make
a choice for the nature of the <literal>ipath</literal>
elements that they use in communication with the
indexer. Here are a few guidelines:
<itemizedlist>
<listitem><para>Use ASCII or UTF-8 (if the identifier is an
integer print it, for example, like printf %d would
do).</para></listitem>
<listitem><para>If at all possible, the data should make some
kind of sense when printed to a log file to help with
debugging.</para></listitem>
<listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
separator to store a complex path internally (for
deeper embedding). Colons inside
the <literal>ipath</literal> elements output by a
filter will be escaped, but would be a bad choice as a
filter-specific separator (mostly, again, for
debugging issues).</para></listitem>
</itemizedlist>
In any case, the main goal is that it should
be easy for the filter to extract the target document, given
the file name and the <literal>ipath</literal>
element.</para>
<para><literal>execm</literal> filters will also produce
a document with a null <literal>ipath</literal>
element. Depending on the type of document, this may have
some associated data (e.g. the body of an email message), or
none (typical for an archive file). If it is empty, this
document will be useful anyway for some operations, as the
parent of the actual data documents.</para>
<sect2 id="rcl.program.filters.association"> <sect2 id="rcl.program.filters.association">
<title>Telling &RCL; about the filter</title> <title>Telling &RCL; about the filter</title>