doc
This commit is contained in:
parent
f5fd7dd158
commit
2d88b2ade6
@ -5719,14 +5719,17 @@ recollindex -c "$confdir"
|
|||||||
cooperate to translate from the multitude of input document
|
cooperate to translate from the multitude of input document
|
||||||
formats, simple ones as <span class=
|
formats, simple ones as <span class=
|
||||||
"application">opendocument</span>, <span class=
|
"application">opendocument</span>, <span class=
|
||||||
"application">acrobat</span>), or compound ones such as
|
"application">acrobat</span>, or compound ones such as
|
||||||
<span class="application">Zip</span> or <span class=
|
<span class="application">Zip</span> or <span class=
|
||||||
"application">Email</span>, into the final <span class=
|
"application">Email</span>, into the final <span class=
|
||||||
"application">Recoll</span> indexing input format, which is
|
"application">Recoll</span> indexing input format, which is
|
||||||
plain text. Most input handlers are executable programs or
|
plain text (in many cases the processing pipeline has an
|
||||||
scripts. A few handlers are coded in C++ and live inside
|
intermediary HTML step, which may be used for better
|
||||||
<span class="command"><strong>recollindex</strong></span>.
|
previewing presentation). Most input handlers are
|
||||||
This latter kind will not be described here.</p>
|
executable programs or scripts. A few handlers are coded in
|
||||||
|
C++ and live inside <span class=
|
||||||
|
"command"><strong>recollindex</strong></span>. This latter
|
||||||
|
kind will not be described here.</p>
|
||||||
<p>There are currently (since version 1.13) two kinds of
|
<p>There are currently (since version 1.13) two kinds of
|
||||||
external executable input handlers:</p>
|
external executable input handlers:</p>
|
||||||
<div class="itemizedlist">
|
<div class="itemizedlist">
|
||||||
@ -5741,26 +5744,47 @@ recollindex -c "$confdir"
|
|||||||
document to the standard output. Their output can be
|
document to the standard output. Their output can be
|
||||||
plain text or HTML. HTML is usually preferred because
|
plain text or HTML. HTML is usually preferred because
|
||||||
it can store metadata fields and it allows preserving
|
it can store metadata fields and it allows preserving
|
||||||
some of the formatting for the GUI preview.</p>
|
some of the formatting for the GUI preview. However,
|
||||||
|
these handlers have limitations:</p>
|
||||||
|
<div class="itemizedlist">
|
||||||
|
<ul class="itemizedlist" style=
|
||||||
|
"list-style-type: circle;">
|
||||||
|
<li class="listitem">
|
||||||
|
<p>They can only process one document per
|
||||||
|
file.</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>The output MIME type must be known and
|
||||||
|
fixed.</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>The character encoding, if relevant, must be
|
||||||
|
known and fixed (or possibly just depending on
|
||||||
|
location).</p>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
</li>
|
</li>
|
||||||
<li class="listitem">
|
<li class="listitem">
|
||||||
<p>Multiple <code class="literal">execm</code>
|
<p>Multiple <code class="literal">execm</code>
|
||||||
handlers can process multiple files (sparing the
|
handlers can process multiple files (sparing the
|
||||||
process startup time which can be very significant),
|
process startup time which can be very significant),
|
||||||
or multiple documents per file (e.g.: for
|
or multiple documents per file (e.g.: for archives or
|
||||||
<span class="application">zip</span> or <span class=
|
multi-chapter publications). They communicate with
|
||||||
"application">chm</span> files). They communicate
|
the indexer through a simple protocol, but are
|
||||||
with the indexer through a simple protocol, but are
|
|
||||||
nevertheless a bit more complicated than the older
|
nevertheless a bit more complicated than the older
|
||||||
kind. Most of new handlers are written in
|
kind. Most of the new handlers are written in
|
||||||
<span class="application">Python</span>, using a
|
<span class="application">Python</span> (exception:
|
||||||
common module to handle the protocol. There is an
|
<span class="command"><strong>rclimg</strong></span>
|
||||||
exception, <span class=
|
which is written in Perl because <code class=
|
||||||
"command"><strong>rclimg</strong></span> which is
|
"literal">exiftool</code> has no real Python
|
||||||
written in Perl. The subdocuments output by these
|
equivalent). The Python handlers use common modules
|
||||||
handlers can be directly indexable (text or HTML), or
|
to factor out the boilerplate, which can make them
|
||||||
they can be other simple or compound documents that
|
very simple in favorable cases. The subdocuments
|
||||||
will need to be processed by another handler.</p>
|
output by these handlers can be directly indexable
|
||||||
|
(text or HTML), or they can be other simple or
|
||||||
|
compound documents that will need to be processed by
|
||||||
|
another handler.</p>
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</div>
|
</div>
|
||||||
@ -5786,10 +5810,13 @@ recollindex -c "$confdir"
|
|||||||
<p>The handlers that can handle multiple documents per file
|
<p>The handlers that can handle multiple documents per file
|
||||||
return a single piece of data to identify each document
|
return a single piece of data to identify each document
|
||||||
inside the file. This piece of data, called an <code class=
|
inside the file. This piece of data, called an <code class=
|
||||||
"literal">ipath element</code> will be sent back by
|
"literal">ipath</code> will be sent back by <span class=
|
||||||
<span class="application">Recoll</span> to extract the
|
"application">Recoll</span> to extract the document at
|
||||||
document at query time, for previewing, or for creating a
|
query time, for previewing, or for creating a temporary
|
||||||
temporary file to be opened by a viewer.</p>
|
file to be opened by a viewer. These handlers can also
|
||||||
|
return metadata either as HTML <code class=
|
||||||
|
"literal">meta</code> tags, or as named data through the
|
||||||
|
communication protocol.</p>
|
||||||
<p>The following section describes the simple handlers, and
|
<p>The following section describes the simple handlers, and
|
||||||
the next one gives a few explanations about the
|
the next one gives a few explanations about the
|
||||||
<code class="literal">execm</code> ones. You could
|
<code class="literal">execm</code> ones. You could
|
||||||
@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
|
|||||||
</div>
|
</div>
|
||||||
<p>If you can program and want to write an <code class=
|
<p>If you can program and want to write an <code class=
|
||||||
"literal">execm</code> handler, it should not be too
|
"literal">execm</code> handler, it should not be too
|
||||||
difficult to make sense of one of the existing modules.
|
difficult to make sense of one of the existing
|
||||||
There is a sample one with many comments, not actually
|
handlers.</p>
|
||||||
used by <span class="application">Recoll</span>, which
|
<p>The existing handlers differ in the amount of helper
|
||||||
would index a text file as one document per line. Look
|
code which they are using:</p>
|
||||||
for <code class="filename">rcltxtlines.py</code> in the
|
<div class="itemizedlist">
|
||||||
<code class="filename">src/filters</code> directory in
|
<ul class="itemizedlist" style=
|
||||||
the <span class="application">Recoll</span> <a class=
|
"list-style-type: disc;">
|
||||||
"ulink" href="https://bitbucket.org/medoc/recoll/src"
|
<li class="listitem">
|
||||||
target="_top">BitBucket repository</a> (the sample not in
|
<p><code class="literal">rclimg</code> is written
|
||||||
the distributed release at the moment).</p>
|
in Perl and handles the execm protocol all by
|
||||||
|
itself (showing how trivial it is).</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>All the Python handlers share at least the
|
||||||
|
<code class="filename">rclexecm.py</code> module,
|
||||||
|
which handles the communication. Have a look at,
|
||||||
|
for example, <code class="filename">rclzip</code>
|
||||||
|
for a handler which uses <code class=
|
||||||
|
"filename">rclexecm.py</code> directly.</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>Most Python handlers which process
|
||||||
|
single-document files by executing another command
|
||||||
|
are further abstracted by using the <code class=
|
||||||
|
"filename">rclexec1.py</code> module. See for
|
||||||
|
example <code class="filename">rclrtf.py</code> for
|
||||||
|
a simple one, or <code class=
|
||||||
|
"filename">rcldoc.py</code> for a slightly more
|
||||||
|
complicated one (possibly executing several
|
||||||
|
commands).</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>Handlers which extract text from an XML document
|
||||||
|
by using an XSLT style sheet are now executed
|
||||||
|
inside <span class=
|
||||||
|
"command"><strong>recollindex</strong></span>, with
|
||||||
|
only the style sheet stored in the <code class=
|
||||||
|
"filename">filters/</code> directory. These can use
|
||||||
|
a single style sheet (e.g. <code class=
|
||||||
|
"filename">abiword.xsl</code>), or two sheets for
|
||||||
|
the data and metadata (e.g. <code class=
|
||||||
|
"filename">opendoc-body.xsl</code> and <code class=
|
||||||
|
"filename">opendoc-meta.xsl</code>). The
|
||||||
|
<code class="filename">mimeconf</code>
|
||||||
|
configuration file defines how the sheets are used,
|
||||||
|
have a look. Before the C++ import, the xsl-based
|
||||||
|
handlers used a common module <code class=
|
||||||
|
"filename">rclgenxslt.py</code>, it is still around
|
||||||
|
but unused. The handler for OpenXML presentations
|
||||||
|
is still the Python version because the format did
|
||||||
|
not fit with what the C++ code does. It would be a
|
||||||
|
good base for another similar issue.</p>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
<p>There is a sample trivial handler based on
|
||||||
|
<code class="filename">rclexecm.py</code>, with many
|
||||||
|
comments, not actually used by <span class=
|
||||||
|
"application">Recoll</span>. It would index a text file
|
||||||
|
as one document per line. Look for <code class=
|
||||||
|
"filename">rcltxtlines.py</code> in the <code class=
|
||||||
|
"filename">src/filters</code> directory in the online
|
||||||
|
<span class="application">Recoll</span> <a class="ulink"
|
||||||
|
href="https://opensourceprojects.eu/p/recoll1/" target=
|
||||||
|
"_top">Git repository</a> (the sample not in the
|
||||||
|
distributed release at the moment).</p>
|
||||||
<p>You can also have a look at the slightly more complex
|
<p>You can also have a look at the slightly more complex
|
||||||
<span class="command"><strong>rclzip</strong></span>
|
<span class="command"><strong>rclzip</strong></span>
|
||||||
which uses Zip file paths as identifiers (<code class=
|
which uses Zip file paths as identifiers (<code class=
|
||||||
|
|||||||
@ -4392,16 +4392,16 @@ recollindex -c "$confdir"
|
|||||||
still used in many places though.</para></note>
|
still used in many places though.</para></note>
|
||||||
|
|
||||||
<para>&RCL; input handlers cooperate to translate from the multitude
|
<para>&RCL; input handlers cooperate to translate from the multitude
|
||||||
of input document formats, simple ones
|
of input document formats, simple ones as
|
||||||
as <application>opendocument</application>,
|
<application>opendocument</application>,
|
||||||
<application>acrobat</application>), or compound ones such
|
<application>acrobat</application>, or compound ones such as
|
||||||
as <application>Zip</application>
|
<application>Zip</application> or <application>Email</application>,
|
||||||
or <application>Email</application>, into the final &RCL;
|
into the final &RCL; indexing input format, which is plain text (in
|
||||||
indexing input format, which is plain text.
|
many cases the processing pipeline has an intermediary HTML step,
|
||||||
Most input handlers are executable
|
which may be used for better previewing presentation). Most input
|
||||||
programs or scripts. A few handlers are coded in C++ and live
|
handlers are executable programs or scripts. A few handlers are coded
|
||||||
inside <command>recollindex</command>. This latter kind will not
|
in C++ and live inside <command>recollindex</command>. This latter
|
||||||
be described here.</para>
|
kind will not be described here.</para>
|
||||||
|
|
||||||
<para>There are currently (since version 1.13) two kinds of
|
<para>There are currently (since version 1.13) two kinds of
|
||||||
external executable input handlers:
|
external executable input handlers:
|
||||||
@ -4414,23 +4414,32 @@ recollindex -c "$confdir"
|
|||||||
output. Their output can be plain text or HTML. HTML is
|
output. Their output can be plain text or HTML. HTML is
|
||||||
usually preferred because it can store metadata fields and
|
usually preferred because it can store metadata fields and
|
||||||
it allows preserving some of the formatting for the GUI
|
it allows preserving some of the formatting for the GUI
|
||||||
preview.</para>
|
preview. However, these handlers have limitations:
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para>They can only process one document
|
||||||
|
per file.</para></listitem>
|
||||||
|
<listitem><para>The output MIME type must be known and
|
||||||
|
fixed.</para></listitem>
|
||||||
|
<listitem><para>The character encoding, if relevant, must be
|
||||||
|
known and fixed (or possibly just depending on
|
||||||
|
location).</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem><para>Multiple <literal>execm</literal> handlers
|
<listitem><para>Multiple <literal>execm</literal> handlers can
|
||||||
can process multiple files (sparing the process startup
|
process multiple files (sparing the process startup time which can
|
||||||
time which can be very significant), or multiple documents
|
be very significant), or multiple documents per file (e.g.: for
|
||||||
per file (e.g.: for <application>zip</application> or
|
archives or multi-chapter publications). They communicate with the
|
||||||
<application>chm</application> files). They communicate
|
indexer through a simple protocol, but are nevertheless a bit more
|
||||||
with the indexer through a simple protocol, but are
|
complicated than the older kind. Most of the new handlers are
|
||||||
nevertheless a bit more complicated than the older
|
written in <application>Python</application> (exception:
|
||||||
kind. Most of new handlers are written in
|
<command>rclimg</command> which is written in Perl because
|
||||||
<application>Python</application>, using a common module
|
<literal>exiftool</literal> has no real Python equivalent). The
|
||||||
to handle the protocol. There is an exception,
|
Python handlers use common modules to factor out the boilerplate,
|
||||||
<command>rclimg</command> which is written in Perl. The
|
which can make them very simple in favorable cases. The
|
||||||
subdocuments output by these handlers can be directly
|
subdocuments output by these handlers can be directly indexable
|
||||||
indexable (text or HTML), or they can be other simple or
|
(text or HTML), or they can be other simple or compound documents
|
||||||
compound documents that will need to be processed by
|
that will need to be processed by another handler.</para>
|
||||||
another handler.</para>
|
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
</para>
|
</para>
|
||||||
@ -4458,10 +4467,12 @@ recollindex -c "$confdir"
|
|||||||
<para>The handlers that can handle multiple documents per file
|
<para>The handlers that can handle multiple documents per file
|
||||||
return a single piece of data to identify each document inside
|
return a single piece of data to identify each document inside
|
||||||
the file. This piece of data, called
|
the file. This piece of data, called
|
||||||
an <literal>ipath element</literal> will be sent back by
|
an <literal>ipath</literal> will be sent back by
|
||||||
&RCL; to extract the document at query time, for previewing,
|
&RCL; to extract the document at query time, for previewing,
|
||||||
or for creating a temporary file to be opened by a
|
or for creating a temporary file to be opened by a
|
||||||
viewer.</para>
|
viewer. These handlers can also return metadata either as HTML
|
||||||
|
<literal>meta</literal> tags, or as named data through the
|
||||||
|
communication protocol.</para>
|
||||||
|
|
||||||
<para>The following section describes the simple
|
<para>The following section describes the simple
|
||||||
handlers, and the next one gives a few explanations about
|
handlers, and the next one gives a few explanations about
|
||||||
@ -4514,14 +4525,53 @@ recollindex -c "$confdir"
|
|||||||
|
|
||||||
<para>If you can program and want to write
|
<para>If you can program and want to write
|
||||||
an <literal>execm</literal> handler, it should not be too
|
an <literal>execm</literal> handler, it should not be too
|
||||||
difficult to make sense of one of the existing modules. There is
|
difficult to make sense of one of the existing handlers.</para>
|
||||||
a sample one with many comments, not actually used by &RCL;,
|
|
||||||
which would index a text file as one document per line. Look for
|
<para>The existing handlers differ in the amount of helper code
|
||||||
<filename>rcltxtlines.py</filename> in the
|
which they are using:
|
||||||
<filename>src/filters</filename> directory in the &RCL; <ulink
|
<itemizedlist>
|
||||||
url="https://bitbucket.org/medoc/recoll/src">BitBucket
|
<listitem><para><literal>rclimg</literal> is written in Perl and
|
||||||
repository</ulink> (the sample
|
handles the execm protocol all by itself (showing how trivial it
|
||||||
not in the distributed release at the moment).</para>
|
is).</para></listitem>
|
||||||
|
<listitem><para>All the Python handlers share at least the
|
||||||
|
<filename>rclexecm.py</filename> module, which handles the
|
||||||
|
communication. Have a look at, for example,
|
||||||
|
<filename>rclzip</filename> for a handler which uses
|
||||||
|
<filename>rclexecm.py</filename> directly.</para></listitem>
|
||||||
|
<listitem><para>Most Python handlers which process
|
||||||
|
single-document files by executing another command are further
|
||||||
|
abstracted by using the <filename>rclexec1.py</filename>
|
||||||
|
module. See for example <filename>rclrtf.py</filename> for a
|
||||||
|
simple one, or <filename>rcldoc.py</filename> for a slightly more
|
||||||
|
complicated one (possibly executing several
|
||||||
|
commands).</para></listitem>
|
||||||
|
<listitem><para>Handlers which extract text from an XML document
|
||||||
|
by using an XSLT style sheet are now executed inside
|
||||||
|
<command>recollindex</command>, with only the style sheet stored
|
||||||
|
in the <filename>filters/</filename> directory. These can
|
||||||
|
use a single style sheet (e.g. <filename>abiword.xsl</filename>),
|
||||||
|
or two sheets for the data and metadata
|
||||||
|
(e.g. <filename>opendoc-body.xsl</filename> and
|
||||||
|
<filename>opendoc-meta.xsl</filename>). The
|
||||||
|
<filename>mimeconf</filename> configuration file defines how the
|
||||||
|
sheets are used, have a look. Before the C++ import, the
|
||||||
|
xsl-based handlers used a common module
|
||||||
|
<filename>rclgenxslt.py</filename>, it is still around but
|
||||||
|
unused. The handler for OpenXML presentations is still the Python
|
||||||
|
version because the format did not fit with what the C++ code
|
||||||
|
does. It would be a good base for another similar
|
||||||
|
issue.</para></listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>There is a sample trivial handler based on
|
||||||
|
<filename>rclexecm.py</filename>, with many comments, not actually
|
||||||
|
used by &RCL;. It would index a text file as one document per
|
||||||
|
line. Look for <filename>rcltxtlines.py</filename> in the
|
||||||
|
<filename>src/filters</filename> directory in the online &RCL;
|
||||||
|
<ulink url="https://opensourceprojects.eu/p/recoll1/">Git
|
||||||
|
repository</ulink> (the sample not in the distributed release at
|
||||||
|
the moment).</para>
|
||||||
|
|
||||||
<para>You can also have a look at the slightly more complex
|
<para>You can also have a look at the slightly more complex
|
||||||
<command>rclzip</command> which uses Zip
|
<command>rclzip</command> which uses Zip
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user