diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 64e61a82..45bb1dd2 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -5719,14 +5719,17 @@ recollindex -c "$confdir" cooperate to translate from the multitude of input document formats, simple ones as opendocument, acrobat), or compound ones such as + "application">acrobat, or compound ones such as Zip or Email, into the final Recoll indexing input format, which is - plain text. Most input handlers are executable programs or - scripts. A few handlers are coded in C++ and live inside - recollindex. - This latter kind will not be described here.
+ plain text (in many cases the processing pipeline has an + intermediary HTML step, which may be used for better + previewing presentation). Most input handlers are + executable programs or scripts. A few handlers are coded in + C++ and live inside recollindex. This latter + kind will not be described here.There are currently (since version 1.13) two kinds of external executable input handlers:
They can only process one document per + file.
+The output MIME type must be known and + fixed.
+The character encoding, if relevant, must be + known and fixed (or possibly just depending on + location).
+Multiple execm
handlers can process multiple files (sparing the
process startup time which can be very significant),
- or multiple documents per file (e.g.: for
- zip or chm files). They communicate
- with the indexer through a simple protocol, but are
+ or multiple documents per file (e.g.: for archives or
+ multi-chapter publications). They communicate with
+ the indexer through a simple protocol, but are
nevertheless a bit more complicated than the older
- kind. Most of new handlers are written in
- Python, using a
- common module to handle the protocol. There is an
- exception, rclimg which is
- written in Perl. The subdocuments output by these
- handlers can be directly indexable (text or HTML), or
- they can be other simple or compound documents that
- will need to be processed by another handler.
exiftool has no real Python
+ equivalent). The Python handlers use common modules
+ to factor out the boilerplate, which can make them
+ very simple in favorable cases. The subdocuments
+ output by these handlers can be directly indexable
+ (text or HTML), or they can be other simple or
+ compound documents that will need to be processed by
+ another handler.
The handlers that can handle multiple documents per file
return a single piece of data to identify each document
inside the file. This piece of data, called an ipath element will be sent back by
- Recoll to extract the
- document at query time, for previewing, or for creating a
- temporary file to be opened by a viewer.
meta tags, or as named data through the
+ communication protocol.
The following section describes the simple handlers, and
the next one gives a few explanations about the
execm ones. You could
@@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
If you can program and want to write an execm handler, it should not be too
- difficult to make sense of one of the existing modules.
- There is a sample one with many comments, not actually
- used by Recoll, which
- would index a text file as one document per line. Look
- for rcltxtlines.py in the
- src/filters directory in
- the Recoll BitBucket repository (the sample not in
- the distributed release at the moment).
The existing handlers differ in the amount of helper + code which they are using:
+rclimg is written
+ in Perl and handles the execm protocol all by
+ itself (showing how trivial it is).
All the Python handlers share at least the
+ rclexecm.py module,
+ which handles the communication. Have a look at,
+ for example, rclzip
+ for a handler which uses rclexecm.py directly.
Most Python handlers which process
+ single-document files by executing another command
+ are further abstracted by using the rclexec1.py module. See for
+ example rclrtf.py for
+ a simple one, or rcldoc.py for a slightly more
+ complicated one (possibly executing several
+ commands).
Handlers which extract text from an XML document
+ by using an XSLT style sheet are now executed
+ inside recollindex, with
+ only the style sheet stored in the filters/ directory. These can use
+ a single style sheet (e.g. abiword.xsl), or two sheets for
+ the data and metadata (e.g. opendoc-body.xsl and opendoc-meta.xsl). The
+ mimeconf
+ configuration file defines how the sheets are used,
+ have a look. Before the C++ import, the xsl-based
+ handlers used a common module rclgenxslt.py, it is still around
+ but unused. The handler for OpenXML presentations
+ is still the Python version because the format did
+ not fit with what the C++ code does. It would be a
+ good base for another similar issue.
There is a sample trivial handler based on
+ rclexecm.py, with many
+ comments, not actually used by Recoll. It would index a text file
+ as one document per line. Look for rcltxtlines.py in the src/filters directory in the online
+ Recoll Git repository (the sample not in the
+ distributed release at the moment).
You can also have a look at the slightly more complex
rclzip
which uses Zip file paths as identifiers (