doc
This commit is contained in:
parent
3d59c6933a
commit
39c2809b6a
@ -3037,55 +3037,106 @@ dir:recoll dir:src -dir:utils -dir:common
|
|||||||
</chapter> <!-- Search -->
|
</chapter> <!-- Search -->
|
||||||
|
|
||||||
|
|
||||||
<chapter id="rcl.program">
|
<chapter id="rcl.program">
|
||||||
<title>Programming interface</title>
|
<title>Programming interface</title>
|
||||||
|
|
||||||
<para>&RCL; has an Application Programming Interface, usable both
|
<para>&RCL; has an Application Programming Interface, usable both
|
||||||
for indexing and searching, currently accessible from the
|
for indexing and searching, currently accessible from the
|
||||||
<application>Python</application> language.</para>
|
<application>Python</application> language.</para>
|
||||||
|
|
||||||
<para>Another less radical way to extend the application is to
|
<para>Another less radical way to extend the application is to
|
||||||
write filters for new types of documents.</para>
|
write filters for new types of documents.</para>
|
||||||
|
|
||||||
<para>The processing of metadata attributes for documents
|
<para>The processing of metadata attributes for documents
|
||||||
(<literal>fields</literal>) is highly configurable.</para>
|
(<literal>fields</literal>) is highly configurable.</para>
|
||||||
|
|
||||||
<sect1 id="rcl.program.filters">
|
|
||||||
|
|
||||||
|
<sect1 id="rcl.program.filters">
|
||||||
<title>Writing a document filter</title>
|
<title>Writing a document filter</title>
|
||||||
|
|
||||||
<para>&RCL; filters are executable programs which
|
<para>&RCL; filters cooperate to translate from the multitude
|
||||||
translate from a specific format (ie:
|
of input document formats, simple ones
|
||||||
<application>openoffice</application>,
|
as <application>opendocument</application>,
|
||||||
<application>acrobat</application>, etc.) to the &RCL;
|
<application>acrobat</application>), or compound ones such
|
||||||
indexing input format, which may be
|
as <application>Zip</application>
|
||||||
<literal>text/plain</literal> or
|
or <application>Email</application>, into the final &RCL;
|
||||||
<literal>text/html</literal>.</para>
|
indexing input format, which may
|
||||||
|
be <literal>text/plain</literal>
|
||||||
|
or <literal>text/html</literal>. Most filters are executable
|
||||||
|
programs or scripts. A few filters are coded in C++ and live
|
||||||
|
inside <command>recollindex</command>. This latter kind will not
|
||||||
|
be described here.</para>
|
||||||
|
|
||||||
<para>As of &RCL; 1.13, there are two kinds of filters:
|
<para>There are currently (1.18 and since 1.13) two kinds of
|
||||||
<itemizedlist>
|
external executable filters:
|
||||||
<listitem><para>Simple filters (the old ones) run once and
|
<itemizedlist>
|
||||||
exit. They can be bare programs like
|
<listitem><para>Simple filters (<literal>exec</literal>
|
||||||
<application>antiword</application>, or shell-scripts using other
|
filters) run once and
|
||||||
programs. They are very simple to write, because they just need
|
exit. They can be bare programs
|
||||||
to output the converted to the standard output.</para>
|
like <application>antiword</application>, or scripts
|
||||||
</listitem>
|
using other programs. They are very simple to write,
|
||||||
<listitem><para>Multiple filters, new in 1.13, run as long as
|
because they just need to print the converted document
|
||||||
their master process (ie: recollindex) is active. They can
|
to the standard output. Their output can
|
||||||
process multiple files (sparing the process startup time which
|
be <literal>text/plain</literal>
|
||||||
can be very significant), or multiple documents per file (ie: for
|
or <literal>text/html</literal>.</para>
|
||||||
zip or chm files). They communicate with the indexer through a
|
</listitem>
|
||||||
simple protocol, but are nevertheless a bit more complicated than
|
<listitem><para>Multiple filters (<literal>execm</literal>
|
||||||
the older kind. Most of these new filters are written in
|
filters), run as long as
|
||||||
<application>Python</application>, using a common module to
|
their master process (<command>recollindex</command>) is
|
||||||
handle the protocol.</para>
|
active. They can process multiple files (sparing the
|
||||||
</listitem>
|
process startup time which can be very significant),
|
||||||
</itemizedlist>
|
or multiple documents per file (e.g.: for zip or chm
|
||||||
The following will just describe the simple filters. If you can
|
files). They communicate with the indexer through a
|
||||||
program and want to write one of the other kind, it shouldn't be too
|
simple protocol, but are nevertheless a bit more
|
||||||
difficult to make sense of one of the existing modules. For example,
|
complicated than the older kind. Most of new
|
||||||
look at <command>rclzip</command> which uses Zip file paths as
|
filters are written
|
||||||
internal identifiers (<literal>ipath</literal>), and
|
in <application>Python</application>, using a common
|
||||||
<command>rclinfo</command>, which uses an integer index.</para>
|
module to handle the protocol. There is an
|
||||||
|
exception, <command>rclimg</command> which is written
|
||||||
|
in Perl. The subdocuments output by these filters can
|
||||||
|
be directly indexable (text or HTML), or they can be
|
||||||
|
other simple or compound documents that will need to
|
||||||
|
be processed by another filter.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>In both cases, filters deal with regular file system
|
||||||
|
files, and can process either a single document, or a
|
||||||
|
linear list of documents in each file. &RCL; is responsible
|
||||||
|
for performing up to date checks, deal with more complex
|
||||||
|
embedding and other upper level issues.</para>
|
||||||
|
|
||||||
|
<para>In the extreme case of a simple filter returning a
|
||||||
|
document in <literal>text/plain</literal> format, no
|
||||||
|
metadata can be transferred from the filter to the
|
||||||
|
indexer. Generic metadata, like document size or
|
||||||
|
modification date, will be gathered and stored by the
|
||||||
|
indexer.</para>
|
||||||
|
|
||||||
|
<para>Filters that produce <literal>text/html</literal>
|
||||||
|
format can return an arbitrary amount of metadata inside HTML
|
||||||
|
<literal>meta</literal> tags. These will be processed
|
||||||
|
according to the directives found in
|
||||||
|
the <link linkend="rcl.program.fields">
|
||||||
|
<filename>fields</filename> configuration
|
||||||
|
file</link>.</para>
|
||||||
|
|
||||||
|
<para>The filters that can handle multiple documents per file
|
||||||
|
return a single piece of data to identify each document inside
|
||||||
|
the file. This piece of data, called
|
||||||
|
an <literal>ipath element</literal> will be sent back by
|
||||||
|
&RCL; to extract the document at query time, for previewing,
|
||||||
|
or for creating a temporary file to be opened by a
|
||||||
|
viewer.</para>
|
||||||
|
|
||||||
|
<para>The following section describes the simple
|
||||||
|
filters, and the next one gives a few explanations about
|
||||||
|
the <literal>execm</literal> ones. You could conceivably
|
||||||
|
write a simple filter with only the elements in the
|
||||||
|
manual. This will not be the case for the other ones, for
|
||||||
|
which you will have to look at the code.</para>
|
||||||
|
|
||||||
<sect2 id="rcl.program.filters.simple">
|
<sect2 id="rcl.program.filters.simple">
|
||||||
<title>Simple filters</title>
|
<title>Simple filters</title>
|
||||||
@ -3126,6 +3177,51 @@ dir:recoll dir:src -dir:utils -dir:common
|
|||||||
|
|
||||||
</sect2>
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="rcl.program.filters.multiple">
|
||||||
|
<title>"Multiple" filters</title>
|
||||||
|
|
||||||
|
<para>If you can program and want to write
|
||||||
|
an <literal>execm</literal> filter, it should not be too
|
||||||
|
difficult to make sense of one of the existing modules. For
|
||||||
|
example, look at <command>rclzip</command> which uses Zip
|
||||||
|
file paths as identifiers (<literal>ipath</literal>),
|
||||||
|
and <command>rclics</command>, which uses an integer
|
||||||
|
index. Also have a look at the comments inside
|
||||||
|
the <filename>internfile/mh_execm.h</filename> file and
|
||||||
|
possibly at the corresponding module.</para>
|
||||||
|
|
||||||
|
<para><literal>execm</literal> filters sometimes need to make
|
||||||
|
a choice for the nature of the <literal>ipath</literal>
|
||||||
|
elements that they use in communication with the
|
||||||
|
indexer. Here are a few guidelines:
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para>Use ASCII or UTF-8 (if the identifier is an
|
||||||
|
integer print it, for example, like printf %d would
|
||||||
|
do).</para></listitem>
|
||||||
|
<listitem><para>If at all possible, the data should make some
|
||||||
|
kind of sense when printed to a log file to help with
|
||||||
|
debugging.</para></listitem>
|
||||||
|
<listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
|
||||||
|
separator to store a complex path internally (for
|
||||||
|
deeper embedding). Colons inside
|
||||||
|
the <literal>ipath</literal> elements output by a
|
||||||
|
filter will be escaped, but would be a bad choice as a
|
||||||
|
filter-specific separator (mostly, again, for
|
||||||
|
debugging issues).</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
In any case, the main goal is that it should
|
||||||
|
be easy for the filter to extract the target document, given
|
||||||
|
the file name and the <literal>ipath</literal>
|
||||||
|
element.</para>
|
||||||
|
|
||||||
|
<para><literal>execm</literal> filters will also produce
|
||||||
|
a document with a null <literal>ipath</literal>
|
||||||
|
element. Depending on the type of document, this may have
|
||||||
|
some associated data (e.g. the body of an email message), or
|
||||||
|
none (typical for an archive file). If it is empty, this
|
||||||
|
document will be useful anyway for some operations, as the
|
||||||
|
parent of the actual data documents.</para>
|
||||||
|
|
||||||
<sect2 id="rcl.program.filters.association">
|
<sect2 id="rcl.program.filters.association">
|
||||||
<title>Telling &RCL; about the filter</title>
|
<title>Telling &RCL; about the filter</title>
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user