This commit is contained in:
Jean-Francois Dockes 2019-03-22 12:32:00 +01:00
parent f5fd7dd158
commit 2d88b2ade6
2 changed files with 202 additions and 69 deletions

View File

@ -5719,14 +5719,17 @@ recollindex -c "$confdir"
cooperate to translate from the multitude of input document cooperate to translate from the multitude of input document
formats, simple ones as <span class= formats, simple ones as <span class=
"application">opendocument</span>, <span class= "application">opendocument</span>, <span class=
"application">acrobat</span>), or compound ones such as "application">acrobat</span>, or compound ones such as
<span class="application">Zip</span> or <span class= <span class="application">Zip</span> or <span class=
"application">Email</span>, into the final <span class= "application">Email</span>, into the final <span class=
"application">Recoll</span> indexing input format, which is "application">Recoll</span> indexing input format, which is
plain text. Most input handlers are executable programs or plain text (in many cases the processing pipeline has an
scripts. A few handlers are coded in C++ and live inside intermediary HTML step, which may be used for better
<span class="command"><strong>recollindex</strong></span>. previewing presentation). Most input handlers are
This latter kind will not be described here.</p> executable programs or scripts. A few handlers are coded in
C++ and live inside <span class=
"command"><strong>recollindex</strong></span>. This latter
kind will not be described here.</p>
<p>There are currently (since version 1.13) two kinds of <p>There are currently (since version 1.13) two kinds of
external executable input handlers:</p> external executable input handlers:</p>
<div class="itemizedlist"> <div class="itemizedlist">
@ -5741,26 +5744,47 @@ recollindex -c "$confdir"
document to the standard output. Their output can be document to the standard output. Their output can be
plain text or HTML. HTML is usually preferred because plain text or HTML. HTML is usually preferred because
it can store metadata fields and it allows preserving it can store metadata fields and it allows preserving
some of the formatting for the GUI preview.</p> some of the formatting for the GUI preview. However,
these handlers have limitations:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style=
"list-style-type: circle;">
<li class="listitem">
<p>They can only process one document per
file.</p>
</li>
<li class="listitem">
<p>The output MIME type must be known and
fixed.</p>
</li>
<li class="listitem">
<p>The character encoding, if relevant, must be
known and fixed (or possibly just depending on
location).</p>
</li>
</ul>
</div>
</li> </li>
<li class="listitem"> <li class="listitem">
<p>Multiple <code class="literal">execm</code> <p>Multiple <code class="literal">execm</code>
handlers can process multiple files (sparing the handlers can process multiple files (sparing the
process startup time which can be very significant), process startup time which can be very significant),
or multiple documents per file (e.g.: for or multiple documents per file (e.g.: for archives or
<span class="application">zip</span> or <span class= multi-chapter publications). They communicate with
"application">chm</span> files). They communicate the indexer through a simple protocol, but are
with the indexer through a simple protocol, but are
nevertheless a bit more complicated than the older nevertheless a bit more complicated than the older
kind. Most of new handlers are written in kind. Most of the new handlers are written in
<span class="application">Python</span>, using a <span class="application">Python</span> (exception:
common module to handle the protocol. There is an <span class="command"><strong>rclimg</strong></span>
exception, <span class= which is written in Perl because <code class=
"command"><strong>rclimg</strong></span> which is "literal">exiftool</code> has no real Python
written in Perl. The subdocuments output by these equivalent). The Python handlers use common modules
handlers can be directly indexable (text or HTML), or to factor out the boilerplate, which can make them
they can be other simple or compound documents that very simple in favorable cases. The subdocuments
will need to be processed by another handler.</p> output by these handlers can be directly indexable
(text or HTML), or they can be other simple or
compound documents that will need to be processed by
another handler.</p>
</li> </li>
</ul> </ul>
</div> </div>
@ -5786,10 +5810,13 @@ recollindex -c "$confdir"
<p>The handlers that can handle multiple documents per file <p>The handlers that can handle multiple documents per file
return a single piece of data to identify each document return a single piece of data to identify each document
inside the file. This piece of data, called an <code class= inside the file. This piece of data, called an <code class=
"literal">ipath element</code> will be sent back by "literal">ipath</code> will be sent back by <span class=
<span class="application">Recoll</span> to extract the "application">Recoll</span> to extract the document at
document at query time, for previewing, or for creating a query time, for previewing, or for creating a temporary
temporary file to be opened by a viewer.</p> file to be opened by a viewer. These handlers can also
return metadata either as HTML <code class=
"literal">meta</code> tags, or as named data through the
communication protocol.</p>
<p>The following section describes the simple handlers, and <p>The following section describes the simple handlers, and
the next one gives a few explanations about the the next one gives a few explanations about the
<code class="literal">execm</code> ones. You could <code class="literal">execm</code> ones. You could
@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
</div> </div>
<p>If you can program and want to write an <code class= <p>If you can program and want to write an <code class=
"literal">execm</code> handler, it should not be too "literal">execm</code> handler, it should not be too
difficult to make sense of one of the existing modules. difficult to make sense of one of the existing
There is a sample one with many comments, not actually handlers.</p>
used by <span class="application">Recoll</span>, which <p>The existing handlers differ in the amount of helper
would index a text file as one document per line. Look code which they are using:</p>
for <code class="filename">rcltxtlines.py</code> in the <div class="itemizedlist">
<code class="filename">src/filters</code> directory in <ul class="itemizedlist" style=
the <span class="application">Recoll</span> <a class= "list-style-type: disc;">
"ulink" href="https://bitbucket.org/medoc/recoll/src" <li class="listitem">
target="_top">BitBucket repository</a> (the sample not in <p><code class="literal">rclimg</code> is written
the distributed release at the moment).</p> in Perl and handles the execm protocol all by
itself (showing how trivial it is).</p>
</li>
<li class="listitem">
<p>All the Python handlers share at least the
<code class="filename">rclexecm.py</code> module,
which handles the communication. Have a look at,
for example, <code class="filename">rclzip</code>
for a handler which uses <code class=
"filename">rclexecm.py</code> directly.</p>
</li>
<li class="listitem">
<p>Most Python handlers which process
single-document files by executing another command
are further abstracted by using the <code class=
"filename">rclexec1.py</code> module. See for
example <code class="filename">rclrtf.py</code> for
a simple one, or <code class=
"filename">rcldoc.py</code> for a slightly more
complicated one (possibly executing several
commands).</p>
</li>
<li class="listitem">
<p>Handlers which extract text from an XML document
by using an XSLT style sheet are now executed
inside <span class=
"command"><strong>recollindex</strong></span>, with
only the style sheet stored in the <code class=
"filename">filters/</code> directory. These can use
a single style sheet (e.g. <code class=
"filename">abiword.xsl</code>), or two sheets for
the data and metadata (e.g. <code class=
"filename">opendoc-body.xsl</code> and <code class=
"filename">opendoc-meta.xsl</code>). The
<code class="filename">mimeconf</code>
configuration file defines how the sheets are used,
have a look. Before the C++ import, the xsl-based
handlers used a common module <code class=
"filename">rclgenxslt.py</code>, it is still around
but unused. The handler for OpenXML presentations
is still the Python version because the format did
not fit with what the C++ code does. It would be a
good base for another similar issue.</p>
</li>
</ul>
</div>
<p>There is a sample trivial handler based on
<code class="filename">rclexecm.py</code>, with many
comments, not actually used by <span class=
"application">Recoll</span>. It would index a text file
as one document per line. Look for <code class=
"filename">rcltxtlines.py</code> in the <code class=
"filename">src/filters</code> directory in the online
<span class="application">Recoll</span> <a class="ulink"
href="https://opensourceprojects.eu/p/recoll1/" target=
"_top">Git repository</a> (the sample not in the
distributed release at the moment).</p>
<p>You can also have a look at the slightly more complex <p>You can also have a look at the slightly more complex
<span class="command"><strong>rclzip</strong></span> <span class="command"><strong>rclzip</strong></span>
which uses Zip file paths as identifiers (<code class= which uses Zip file paths as identifiers (<code class=

View File

@ -4392,16 +4392,16 @@ recollindex -c "$confdir"
still used in many places though.</para></note> still used in many places though.</para></note>
<para>&RCL; input handlers cooperate to translate from the multitude <para>&RCL; input handlers cooperate to translate from the multitude
of input document formats, simple ones of input document formats, simple ones as
as <application>opendocument</application>, <application>opendocument</application>,
<application>acrobat</application>), or compound ones such <application>acrobat</application>, or compound ones such as
as <application>Zip</application> <application>Zip</application> or <application>Email</application>,
or <application>Email</application>, into the final &RCL; into the final &RCL; indexing input format, which is plain text (in
indexing input format, which is plain text. many cases the processing pipeline has an intermediary HTML step,
Most input handlers are executable which may be used for better previewing presentation). Most input
programs or scripts. A few handlers are coded in C++ and live handlers are executable programs or scripts. A few handlers are coded
inside <command>recollindex</command>. This latter kind will not in C++ and live inside <command>recollindex</command>. This latter
be described here.</para> kind will not be described here.</para>
<para>There are currently (since version 1.13) two kinds of <para>There are currently (since version 1.13) two kinds of
external executable input handlers: external executable input handlers:
@ -4414,23 +4414,32 @@ recollindex -c "$confdir"
output. Their output can be plain text or HTML. HTML is output. Their output can be plain text or HTML. HTML is
usually preferred because it can store metadata fields and usually preferred because it can store metadata fields and
it allows preserving some of the formatting for the GUI it allows preserving some of the formatting for the GUI
preview.</para> preview. However, these handlers have limitations:
<itemizedlist>
<listitem><para>They can only process one document
per file.</para></listitem>
<listitem><para>The output MIME type must be known and
fixed.</para></listitem>
<listitem><para>The character encoding, if relevant, must be
known and fixed (or possibly just depending on
location).</para></listitem>
</itemizedlist>
</para>
</listitem> </listitem>
<listitem><para>Multiple <literal>execm</literal> handlers <listitem><para>Multiple <literal>execm</literal> handlers can
can process multiple files (sparing the process startup process multiple files (sparing the process startup time which can
time which can be very significant), or multiple documents be very significant), or multiple documents per file (e.g.: for
per file (e.g.: for <application>zip</application> or archives or multi-chapter publications). They communicate with the
<application>chm</application> files). They communicate indexer through a simple protocol, but are nevertheless a bit more
with the indexer through a simple protocol, but are complicated than the older kind. Most of the new handlers are
nevertheless a bit more complicated than the older written in <application>Python</application> (exception:
kind. Most of new handlers are written in <command>rclimg</command> which is written in Perl because
<application>Python</application>, using a common module <literal>exiftool</literal> has no real Python equivalent). The
to handle the protocol. There is an exception, Python handlers use common modules to factor out the boilerplate,
<command>rclimg</command> which is written in Perl. The which can make them very simple in favorable cases. The
subdocuments output by these handlers can be directly subdocuments output by these handlers can be directly indexable
indexable (text or HTML), or they can be other simple or (text or HTML), or they can be other simple or compound documents
compound documents that will need to be processed by that will need to be processed by another handler.</para>
another handler.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</para> </para>
@ -4458,10 +4467,12 @@ recollindex -c "$confdir"
<para>The handlers that can handle multiple documents per file <para>The handlers that can handle multiple documents per file
return a single piece of data to identify each document inside return a single piece of data to identify each document inside
the file. This piece of data, called the file. This piece of data, called
an <literal>ipath element</literal> will be sent back by an <literal>ipath</literal> will be sent back by
&RCL; to extract the document at query time, for previewing, &RCL; to extract the document at query time, for previewing,
or for creating a temporary file to be opened by a or for creating a temporary file to be opened by a
viewer.</para> viewer. These handlers can also return metadata either as HTML
<literal>meta</literal> tags, or as named data through the
communication protocol.</para>
<para>The following section describes the simple <para>The following section describes the simple
handlers, and the next one gives a few explanations about handlers, and the next one gives a few explanations about
@ -4514,14 +4525,53 @@ recollindex -c "$confdir"
<para>If you can program and want to write <para>If you can program and want to write
an <literal>execm</literal> handler, it should not be too an <literal>execm</literal> handler, it should not be too
difficult to make sense of one of the existing modules. There is difficult to make sense of one of the existing handlers.</para>
a sample one with many comments, not actually used by &RCL;,
which would index a text file as one document per line. Look for <para>The existing handlers differ in the amount of helper code
<filename>rcltxtlines.py</filename> in the which they are using:
<filename>src/filters</filename> directory in the &RCL; <ulink <itemizedlist>
url="https://bitbucket.org/medoc/recoll/src">BitBucket <listitem><para><literal>rclimg</literal> is written in Perl and
repository</ulink> (the sample handles the execm protocol all by itself (showing how trivial it
not in the distributed release at the moment).</para> is).</para></listitem>
<listitem><para>All the Python handlers share at least the
<filename>rclexecm.py</filename> module, which handles the
communication. Have a look at, for example,
<filename>rclzip</filename> for a handler which uses
<filename>rclexecm.py</filename> directly.</para></listitem>
<listitem><para>Most Python handlers which process
single-document files by executing another command are further
abstracted by using the <filename>rclexec1.py</filename>
module. See for example <filename>rclrtf.py</filename> for a
simple one, or <filename>rcldoc.py</filename> for a slightly more
complicated one (possibly executing several
commands).</para></listitem>
<listitem><para>Handlers which extract text from an XML document
by using an XSLT style sheet are now executed inside
<command>recollindex</command>, with only the style sheet stored
in the <filename>filters/</filename> directory. These can
use a single style sheet (e.g. <filename>abiword.xsl</filename>),
or two sheets for the data and metadata
(e.g. <filename>opendoc-body.xsl</filename> and
<filename>opendoc-meta.xsl</filename>). The
<filename>mimeconf</filename> configuration file defines how the
sheets are used, have a look. Before the C++ import, the
xsl-based handlers used a common module
<filename>rclgenxslt.py</filename>, it is still around but
unused. The handler for OpenXML presentations is still the Python
version because the format did not fit with what the C++ code
does. It would be a good base for another similar
issue.</para></listitem>
</itemizedlist>
</para>
<para>There is a sample trivial handler based on
<filename>rclexecm.py</filename>, with many comments, not actually
used by &RCL;. It would index a text file as one document per
line. Look for <filename>rcltxtlines.py</filename> in the
<filename>src/filters</filename> directory in the online &RCL;
<ulink url="https://opensourceprojects.eu/p/recoll1/">Git
repository</ulink> (the sample not in the distributed release at
the moment).</para>
<para>You can also have a look at the slightly more complex <para>You can also have a look at the slightly more complex
<command>rclzip</command> which uses Zip <command>rclzip</command> which uses Zip