From 2d88b2ade6bf18cc58e35dbe9f7ecfecd4ab36e2 Mon Sep 17 00:00:00 2001
From: Jean-Francois Dockes
Date: Fri, 22 Mar 2019 12:32:00 +0100
Subject: [PATCH] doc
---
src/doc/user/usermanual.html | 149 +++++++++++++++++++++++++++--------
src/doc/user/usermanual.xml | 122 +++++++++++++++++++---------
2 files changed, 202 insertions(+), 69 deletions(-)
diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html
index 64e61a82..45bb1dd2 100644
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
@@ -5719,14 +5719,17 @@ recollindex -c "$confdir"
cooperate to translate from the multitude of input document
formats, simple ones as opendocument, acrobat), or compound ones such as
+ "application">acrobat, or compound ones such as
Zip or Email, into the final Recoll indexing input format, which is
- plain text. Most input handlers are executable programs or
- scripts. A few handlers are coded in C++ and live inside
- recollindex.
- This latter kind will not be described here.
+ plain text (in many cases the processing pipeline has an
+ intermediary HTML step, which may be used for better
+ previewing presentation). Most input handlers are
+ executable programs or scripts. A few handlers are coded in
+ C++ and live inside recollindex. This latter
+ kind will not be described here.
There are currently (since version 1.13) two kinds of
external executable input handlers:
@@ -5741,26 +5744,47 @@ recollindex -c "$confdir"
document to the standard output. Their output can be
plain text or HTML. HTML is usually preferred because
it can store metadata fields and it allows preserving
- some of the formatting for the GUI preview.
+ some of the formatting for the GUI preview. However,
+ these handlers have limitations:
+
+
+ -
+
They can only process one document per
+ file.
+
+ -
+
The output MIME type must be known and
+ fixed.
+
+ -
+
The character encoding, if relevant, must be
+ known and fixed (or possibly just depending on
+ location).
+
+
+
Multiple execm
handlers can process multiple files (sparing the
process startup time which can be very significant),
- or multiple documents per file (e.g.: for
- zip or chm files). They communicate
- with the indexer through a simple protocol, but are
+ or multiple documents per file (e.g.: for archives or
+ multi-chapter publications). They communicate with
+ the indexer through a simple protocol, but are
nevertheless a bit more complicated than the older
- kind. Most of new handlers are written in
- Python, using a
- common module to handle the protocol. There is an
- exception, rclimg which is
- written in Perl. The subdocuments output by these
- handlers can be directly indexable (text or HTML), or
- they can be other simple or compound documents that
- will need to be processed by another handler.
+ kind. Most of the new handlers are written in
+ Python (exception:
+ rclimg
+ which is written in Perl because exiftool has no real Python
+ equivalent). The Python handlers use common modules
+ to factor out the boilerplate, which can make them
+ very simple in favorable cases. The subdocuments
+ output by these handlers can be directly indexable
+ (text or HTML), or they can be other simple or
+ compound documents that will need to be processed by
+ another handler.
@@ -5786,10 +5810,13 @@ recollindex -c "$confdir"
The handlers that can handle multiple documents per file
return a single piece of data to identify each document
inside the file. This piece of data, called an ipath element will be sent back by
- Recoll to extract the
- document at query time, for previewing, or for creating a
- temporary file to be opened by a viewer.
+ "literal">ipath will be sent back by Recoll to extract the document at
+ query time, for previewing, or for creating a temporary
+ file to be opened by a viewer. These handlers can also
+ return metadata either as HTML meta tags, or as named data through the
+ communication protocol.
The following section describes the simple handlers, and
the next one gives a few explanations about the
execm ones. You could
@@ -5860,16 +5887,72 @@ recollindex -c "$confdir"
If you can program and want to write an execm handler, it should not be too
- difficult to make sense of one of the existing modules.
- There is a sample one with many comments, not actually
- used by Recoll, which
- would index a text file as one document per line. Look
- for rcltxtlines.py in the
- src/filters directory in
- the Recoll BitBucket repository (the sample not in
- the distributed release at the moment).
+ difficult to make sense of one of the existing
+ handlers.
+ The existing handlers differ in the amount of helper
+ code which they are using:
+
+
+ -
+
rclimg is written
+ in Perl and handles the execm protocol all by
+ itself (showing how trivial it is).
+
+ -
+
All the Python handlers share at least the
+ rclexecm.py module,
+ which handles the communication. Have a look at,
+ for example, rclzip
+ for a handler which uses rclexecm.py directly.
+
+ -
+
Most Python handlers which process
+ single-document files by executing another command
+ are further abstracted by using the rclexec1.py module. See for
+ example rclrtf.py for
+ a simple one, or rcldoc.py for a slightly more
+ complicated one (possibly executing several
+ commands).
+
+ -
+
Handlers which extract text from an XML document
+ by using an XSLT style sheet are now executed
+ inside recollindex, with
+ only the style sheet stored in the filters/ directory. These can use
+ a single style sheet (e.g. abiword.xsl), or two sheets for
+ the data and metadata (e.g. opendoc-body.xsl and opendoc-meta.xsl). The
+ mimeconf
+ configuration file defines how the sheets are used,
+ have a look. Before the C++ import, the xsl-based
+ handlers used a common module rclgenxslt.py, it is still around
+ but unused. The handler for OpenXML presentations
+ is still the Python version because the format did
+ not fit with what the C++ code does. It would be a
+ good base for another similar issue.
+
+
+
+ There is a sample trivial handler based on
+ rclexecm.py, with many
+ comments, not actually used by Recoll. It would index a text file
+ as one document per line. Look for rcltxtlines.py in the src/filters directory in the online
+ Recoll Git repository (the sample not in the
+ distributed release at the moment).
You can also have a look at the slightly more complex
rclzip
which uses Zip file paths as identifiers (
&RCL; input handlers cooperate to translate from the multitude
- of input document formats, simple ones
- as opendocument,
- acrobat), or compound ones such
- as Zip
- or Email, into the final &RCL;
- indexing input format, which is plain text.
- Most input handlers are executable
- programs or scripts. A few handlers are coded in C++ and live
- inside recollindex. This latter kind will not
- be described here.
+ of input document formats, simple ones as
+ opendocument,
+ acrobat, or compound ones such as
+ Zip or Email,
+ into the final &RCL; indexing input format, which is plain text (in
+ many cases the processing pipeline has an intermediary HTML step,
+ which may be used for better previewing presentation). Most input
+ handlers are executable programs or scripts. A few handlers are coded
+ in C++ and live inside recollindex. This latter
+ kind will not be described here.
There are currently (since version 1.13) two kinds of
external executable input handlers:
@@ -4414,23 +4414,32 @@ recollindex -c "$confdir"
output. Their output can be plain text or HTML. HTML is
usually preferred because it can store metadata fields and
it allows preserving some of the formatting for the GUI
- preview.
+ preview. However, these handlers have limitations:
+
+ They can only process one document
+ per file.
+ The output MIME type must be known and
+ fixed.
+ The character encoding, if relevant, must be
+ known and fixed (or possibly just depending on
+ location).
+
+
- Multiple execm handlers
- can process multiple files (sparing the process startup
- time which can be very significant), or multiple documents
- per file (e.g.: for zip or
- chm files). They communicate
- with the indexer through a simple protocol, but are
- nevertheless a bit more complicated than the older
- kind. Most of new handlers are written in
- Python, using a common module
- to handle the protocol. There is an exception,
- rclimg which is written in Perl. The
- subdocuments output by these handlers can be directly
- indexable (text or HTML), or they can be other simple or
- compound documents that will need to be processed by
- another handler.
+ Multiple execm handlers can
+ process multiple files (sparing the process startup time which can
+ be very significant), or multiple documents per file (e.g.: for
+ archives or multi-chapter publications). They communicate with the
+ indexer through a simple protocol, but are nevertheless a bit more
+ complicated than the older kind. Most of the new handlers are
+ written in Python (exception:
+ rclimg which is written in Perl because
+ exiftool has no real Python equivalent). The
+ Python handlers use common modules to factor out the boilerplate,
+ which can make them very simple in favorable cases. The
+ subdocuments output by these handlers can be directly indexable
+ (text or HTML), or they can be other simple or compound documents
+ that will need to be processed by another handler.
@@ -4458,10 +4467,12 @@ recollindex -c "$confdir"
The handlers that can handle multiple documents per file
return a single piece of data to identify each document inside
the file. This piece of data, called
- an ipath element will be sent back by
+ an ipath will be sent back by
&RCL; to extract the document at query time, for previewing,
or for creating a temporary file to be opened by a
- viewer.
+ viewer. These handlers can also return metadata either as HTML
+ meta tags, or as named data through the
+ communication protocol.
The following section describes the simple
handlers, and the next one gives a few explanations about
@@ -4514,14 +4525,53 @@ recollindex -c "$confdir"
If you can program and want to write
an execm handler, it should not be too
- difficult to make sense of one of the existing modules. There is
- a sample one with many comments, not actually used by &RCL;,
- which would index a text file as one document per line. Look for
- rcltxtlines.py in the
- src/filters directory in the &RCL; BitBucket
- repository (the sample
- not in the distributed release at the moment).
+ difficult to make sense of one of the existing handlers.
+
+ The existing handlers differ in the amount of helper code
+ which they are using:
+
+ rclimg is written in Perl and
+ handles the execm protocol all by itself (showing how trivial it
+ is).
+ All the Python handlers share at least the
+ rclexecm.py module, which handles the
+ communication. Have a look at, for example,
+ rclzip for a handler which uses
+ rclexecm.py directly.
+ Most Python handlers which process
+ single-document files by executing another command are further
+ abstracted by using the rclexec1.py
+ module. See for example rclrtf.py for a
+ simple one, or rcldoc.py for a slightly more
+ complicated one (possibly executing several
+ commands).
+ Handlers which extract text from an XML document
+ by using an XSLT style sheet are now executed inside
+ recollindex, with only the style sheet stored
+ in the filters/ directory. These can
+ use a single style sheet (e.g. abiword.xsl),
+ or two sheets for the data and metadata
+ (e.g. opendoc-body.xsl and
+ opendoc-meta.xsl). The
+ mimeconf configuration file defines how the
+ sheets are used, have a look. Before the C++ import, the
+ xsl-based handlers used a common module
+ rclgenxslt.py, it is still around but
+ unused. The handler for OpenXML presentations is still the Python
+ version because the format did not fit with what the C++ code
+ does. It would be a good base for another similar
+ issue.
+
+
+
+ There is a sample trivial handler based on
+ rclexecm.py, with many comments, not actually
+ used by &RCL;. It would index a text file as one document per
+ line. Look for rcltxtlines.py in the
+ src/filters directory in the online &RCL;
+ Git
+ repository (the sample not in the distributed release at
+ the moment).
You can also have a look at the slightly more complex
rclzip which uses Zip