This commit is contained in:
Jean-Francois Dockes 2019-03-22 12:32:00 +01:00
parent f5fd7dd158
commit 2d88b2ade6
2 changed files with 202 additions and 69 deletions

View File

@ -5719,14 +5719,17 @@ recollindex -c "$confdir"
cooperate to translate from the multitude of input document
formats, simple ones as <span class=
"application">opendocument</span>, <span class=
"application">acrobat</span>), or compound ones such as
"application">acrobat</span>, or compound ones such as
<span class="application">Zip</span> or <span class=
"application">Email</span>, into the final <span class=
"application">Recoll</span> indexing input format, which is
plain text. Most input handlers are executable programs or
scripts. A few handlers are coded in C++ and live inside
<span class="command"><strong>recollindex</strong></span>.
This latter kind will not be described here.</p>
plain text (in many cases the processing pipeline has an
intermediary HTML step, which may be used for better
previewing presentation). Most input handlers are
executable programs or scripts. A few handlers are coded in
C++ and live inside <span class=
"command"><strong>recollindex</strong></span>. This latter
kind will not be described here.</p>
<p>There are currently (since version 1.13) two kinds of
external executable input handlers:</p>
<div class="itemizedlist">
@ -5741,26 +5744,47 @@ recollindex -c "$confdir"
document to the standard output. Their output can be
plain text or HTML. HTML is usually preferred because
it can store metadata fields and it allows preserving
some of the formatting for the GUI preview.</p>
some of the formatting for the GUI preview. However,
these handlers have limitations:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style=
"list-style-type: circle;">
<li class="listitem">
<p>They can only process one document per
file.</p>
</li>
<li class="listitem">
<p>The output MIME type must be known and
fixed.</p>
</li>
<li class="listitem">
<p>The character encoding, if relevant, must be
known and fixed (or possibly just depending on
location).</p>
</li>
</ul>
</div>
</li>
<li class="listitem">
<p>Multiple <code class="literal">execm</code>
handlers can process multiple files (sparing the
process startup time which can be very significant),
or multiple documents per file (e.g.: for
<span class="application">zip</span> or <span class=
"application">chm</span> files). They communicate
with the indexer through a simple protocol, but are
or multiple documents per file (e.g.: for archives or
multi-chapter publications). They communicate with
the indexer through a simple protocol, but are
nevertheless a bit more complicated than the older
kind. Most of new handlers are written in
<span class="application">Python</span>, using a
common module to handle the protocol. There is an
exception, <span class=
"command"><strong>rclimg</strong></span> which is
written in Perl. The subdocuments output by these
handlers can be directly indexable (text or HTML), or
they can be other simple or compound documents that
will need to be processed by another handler.</p>
kind. Most of the new handlers are written in
<span class="application">Python</span> (exception:
<span class="command"><strong>rclimg</strong></span>
which is written in Perl because <code class=
"literal">exiftool</code> has no real Python
equivalent). The Python handlers use common modules
to factor out the boilerplate, which can make them
very simple in favorable cases. The subdocuments
output by these handlers can be directly indexable
(text or HTML), or they can be other simple or
compound documents that will need to be processed by
another handler.</p>
</li>
</ul>
</div>
@ -5786,10 +5810,13 @@ recollindex -c "$confdir"
<p>The handlers that can handle multiple documents per file
return a single piece of data to identify each document
inside the file. This piece of data, called an <code class=
"literal">ipath element</code> will be sent back by
<span class="application">Recoll</span> to extract the
document at query time, for previewing, or for creating a
temporary file to be opened by a viewer.</p>
"literal">ipath</code> will be sent back by <span class=
"application">Recoll</span> to extract the document at
query time, for previewing, or for creating a temporary
file to be opened by a viewer. These handlers can also
return metadata either as HTML <code class=
"literal">meta</code> tags, or as named data through the
communication protocol.</p>
<p>The following section describes the simple handlers, and
the next one gives a few explanations about the
<code class="literal">execm</code> ones. You could
@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
</div>
<p>If you can program and want to write an <code class=
"literal">execm</code> handler, it should not be too
difficult to make sense of one of the existing modules.
There is a sample one with many comments, not actually
used by <span class="application">Recoll</span>, which
would index a text file as one document per line. Look
for <code class="filename">rcltxtlines.py</code> in the
<code class="filename">src/filters</code> directory in
the <span class="application">Recoll</span> <a class=
"ulink" href="https://bitbucket.org/medoc/recoll/src"
target="_top">BitBucket repository</a> (the sample not in
the distributed release at the moment).</p>
difficult to make sense of one of the existing
handlers.</p>
<p>The existing handlers differ in the amount of helper
code which they are using:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style=
"list-style-type: disc;">
<li class="listitem">
<p><code class="literal">rclimg</code> is written
in Perl and handles the execm protocol all by
itself (showing how trivial it is).</p>
</li>
<li class="listitem">
<p>All the Python handlers share at least the
<code class="filename">rclexecm.py</code> module,
which handles the communication. Have a look at,
for example, <code class="filename">rclzip</code>
for a handler which uses <code class=
"filename">rclexecm.py</code> directly.</p>
</li>
<li class="listitem">
<p>Most Python handlers which process
single-document files by executing another command
are further abstracted by using the <code class=
"filename">rclexec1.py</code> module. See for
example <code class="filename">rclrtf.py</code> for
a simple one, or <code class=
"filename">rcldoc.py</code> for a slightly more
complicated one (possibly executing several
commands).</p>
</li>
<li class="listitem">
<p>Handlers which extract text from an XML document
by using an XSLT style sheet are now executed
inside <span class=
"command"><strong>recollindex</strong></span>, with
only the style sheet stored in the <code class=
"filename">filters/</code> directory. These can use
a single style sheet (e.g. <code class=
"filename">abiword.xsl</code>), or two sheets for
the data and metadata (e.g. <code class=
"filename">opendoc-body.xsl</code> and <code class=
"filename">opendoc-meta.xsl</code>). The
<code class="filename">mimeconf</code>
configuration file defines how the sheets are used,
have a look. Before the C++ import, the xsl-based
handlers used a common module <code class=
"filename">rclgenxslt.py</code>, it is still around
but unused. The handler for OpenXML presentations
is still the Python version because the format did
not fit with what the C++ code does. It would be a
good base for another similar issue.</p>
</li>
</ul>
</div>
<p>There is a sample trivial handler based on
<code class="filename">rclexecm.py</code>, with many
comments, not actually used by <span class=
"application">Recoll</span>. It would index a text file
as one document per line. Look for <code class=
"filename">rcltxtlines.py</code> in the <code class=
"filename">src/filters</code> directory in the online
<span class="application">Recoll</span> <a class="ulink"
href="https://opensourceprojects.eu/p/recoll1/" target=
"_top">Git repository</a> (the sample not in the
distributed release at the moment).</p>
<p>You can also have a look at the slightly more complex
<span class="command"><strong>rclzip</strong></span>
which uses Zip file paths as identifiers (<code class=

View File

@ -4392,16 +4392,16 @@ recollindex -c "$confdir"
still used in many places though.</para></note>
<para>&RCL; input handlers cooperate to translate from the multitude
of input document formats, simple ones
as <application>opendocument</application>,
<application>acrobat</application>), or compound ones such
as <application>Zip</application>
or <application>Email</application>, into the final &RCL;
indexing input format, which is plain text.
Most input handlers are executable
programs or scripts. A few handlers are coded in C++ and live
inside <command>recollindex</command>. This latter kind will not
be described here.</para>
of input document formats, simple ones as
<application>opendocument</application>,
<application>acrobat</application>, or compound ones such as
<application>Zip</application> or <application>Email</application>,
into the final &RCL; indexing input format, which is plain text (in
many cases the processing pipeline has an intermediary HTML step,
which may be used for better previewing presentation). Most input
handlers are executable programs or scripts. A few handlers are coded
in C++ and live inside <command>recollindex</command>. This latter
kind will not be described here.</para>
<para>There are currently (since version 1.13) two kinds of
external executable input handlers:
@ -4414,23 +4414,32 @@ recollindex -c "$confdir"
output. Their output can be plain text or HTML. HTML is
usually preferred because it can store metadata fields and
it allows preserving some of the formatting for the GUI
preview.</para>
preview. However, these handlers have limitations:
<itemizedlist>
<listitem><para>They can only process one document
per file.</para></listitem>
<listitem><para>The output MIME type must be known and
fixed.</para></listitem>
<listitem><para>The character encoding, if relevant, must be
known and fixed (or possibly just depending on
location).</para></listitem>
</itemizedlist>
</para>
</listitem>
<listitem><para>Multiple <literal>execm</literal> handlers
can process multiple files (sparing the process startup
time which can be very significant), or multiple documents
per file (e.g.: for <application>zip</application> or
<application>chm</application> files). They communicate
with the indexer through a simple protocol, but are
nevertheless a bit more complicated than the older
kind. Most of new handlers are written in
<application>Python</application>, using a common module
to handle the protocol. There is an exception,
<command>rclimg</command> which is written in Perl. The
subdocuments output by these handlers can be directly
indexable (text or HTML), or they can be other simple or
compound documents that will need to be processed by
another handler.</para>
<listitem><para>Multiple <literal>execm</literal> handlers can
process multiple files (sparing the process startup time which can
be very significant), or multiple documents per file (e.g.: for
archives or multi-chapter publications). They communicate with the
indexer through a simple protocol, but are nevertheless a bit more
complicated than the older kind. Most of the new handlers are
written in <application>Python</application> (exception:
<command>rclimg</command> which is written in Perl because
<literal>exiftool</literal> has no real Python equivalent). The
Python handlers use common modules to factor out the boilerplate,
which can make them very simple in favorable cases. The
subdocuments output by these handlers can be directly indexable
(text or HTML), or they can be other simple or compound documents
that will need to be processed by another handler.</para>
</listitem>
</itemizedlist>
</para>
@ -4458,10 +4467,12 @@ recollindex -c "$confdir"
<para>The handlers that can handle multiple documents per file
return a single piece of data to identify each document inside
the file. This piece of data, called
an <literal>ipath element</literal> will be sent back by
an <literal>ipath</literal> will be sent back by
&RCL; to extract the document at query time, for previewing,
or for creating a temporary file to be opened by a
viewer.</para>
viewer. These handlers can also return metadata either as HTML
<literal>meta</literal> tags, or as named data through the
communication protocol.</para>
<para>The following section describes the simple
handlers, and the next one gives a few explanations about
@ -4514,14 +4525,53 @@ recollindex -c "$confdir"
<para>If you can program and want to write
an <literal>execm</literal> handler, it should not be too
difficult to make sense of one of the existing modules. There is
a sample one with many comments, not actually used by &RCL;,
which would index a text file as one document per line. Look for
<filename>rcltxtlines.py</filename> in the
<filename>src/filters</filename> directory in the &RCL; <ulink
url="https://bitbucket.org/medoc/recoll/src">BitBucket
repository</ulink> (the sample
not in the distributed release at the moment).</para>
difficult to make sense of one of the existing handlers.</para>
<para>The existing handlers differ in the amount of helper code
which they are using:
<itemizedlist>
<listitem><para><literal>rclimg</literal> is written in Perl and
handles the execm protocol all by itself (showing how trivial it
is).</para></listitem>
<listitem><para>All the Python handlers share at least the
<filename>rclexecm.py</filename> module, which handles the
communication. Have a look at, for example,
<filename>rclzip</filename> for a handler which uses
<filename>rclexecm.py</filename> directly.</para></listitem>
<listitem><para>Most Python handlers which process
single-document files by executing another command are further
abstracted by using the <filename>rclexec1.py</filename>
module. See for example <filename>rclrtf.py</filename> for a
simple one, or <filename>rcldoc.py</filename> for a slightly more
complicated one (possibly executing several
commands).</para></listitem>
<listitem><para>Handlers which extract text from an XML document
by using an XSLT style sheet are now executed inside
<command>recollindex</command>, with only the style sheet stored
in the <filename>filters/</filename> directory. These can
use a single style sheet (e.g. <filename>abiword.xsl</filename>),
or two sheets for the data and metadata
(e.g. <filename>opendoc-body.xsl</filename> and
<filename>opendoc-meta.xsl</filename>). The
<filename>mimeconf</filename> configuration file defines how the
sheets are used, have a look. Before the C++ import, the
xsl-based handlers used a common module
<filename>rclgenxslt.py</filename>, it is still around but
unused. The handler for OpenXML presentations is still the Python
version because the format did not fit with what the C++ code
does. It would be a good base for another similar
issue.</para></listitem>
</itemizedlist>
</para>
<para>There is a sample trivial handler based on
<filename>rclexecm.py</filename>, with many comments, not actually
used by &RCL;. It would index a text file as one document per
line. Look for <filename>rcltxtlines.py</filename> in the
<filename>src/filters</filename> directory in the online &RCL;
<ulink url="https://opensourceprojects.eu/p/recoll1/">Git
repository</ulink> (the sample not in the distributed release at
the moment).</para>
<para>You can also have a look at the slightly more complex
<command>rclzip</command> which uses Zip