From 2d88b2ade6bf18cc58e35dbe9f7ecfecd4ab36e2 Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Fri, 22 Mar 2019 12:32:00 +0100 Subject: [PATCH] doc --- src/doc/user/usermanual.html | 149 +++++++++++++++++++++++++++-------- src/doc/user/usermanual.xml | 122 +++++++++++++++++++--------- 2 files changed, 202 insertions(+), 69 deletions(-) diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 64e61a82..45bb1dd2 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -5719,14 +5719,17 @@ recollindex -c "$confdir" cooperate to translate from the multitude of input document formats, simple ones as opendocument, acrobat), or compound ones such as + "application">acrobat, or compound ones such as Zip or Email, into the final Recoll indexing input format, which is - plain text. Most input handlers are executable programs or - scripts. A few handlers are coded in C++ and live inside - recollindex. - This latter kind will not be described here.

+ plain text (in many cases the processing pipeline has an + intermediary HTML step, which may be used for better + previewing presentation). Most input handlers are + executable programs or scripts. A few handlers are coded in + C++ and live inside recollindex. This latter + kind will not be described here.

There are currently (since version 1.13) two kinds of external executable input handlers:

@@ -5741,26 +5744,47 @@ recollindex -c "$confdir" document to the standard output. Their output can be plain text or HTML. HTML is usually preferred because it can store metadata fields and it allows preserving - some of the formatting for the GUI preview.

+ some of the formatting for the GUI preview. However, + these handlers have limitations:

+
+
    +
  • +

    They can only process one document per + file.

    +
  • +
  • +

    The output MIME type must be known and + fixed.

    +
  • +
  • +

    The character encoding, if relevant, must be + known and fixed (or possibly just depending on + location).

    +
  • +
+
  • Multiple execm handlers can process multiple files (sparing the process startup time which can be very significant), - or multiple documents per file (e.g.: for - zip or chm files). They communicate - with the indexer through a simple protocol, but are + or multiple documents per file (e.g.: for archives or + multi-chapter publications). They communicate with + the indexer through a simple protocol, but are nevertheless a bit more complicated than the older - kind. Most of new handlers are written in - Python, using a - common module to handle the protocol. There is an - exception, rclimg which is - written in Perl. The subdocuments output by these - handlers can be directly indexable (text or HTML), or - they can be other simple or compound documents that - will need to be processed by another handler.

    + kind. Most of the new handlers are written in + Python (exception: + rclimg + which is written in Perl because exiftool has no real Python + equivalent). The Python handlers use common modules + to factor out the boilerplate, which can make them + very simple in favorable cases. The subdocuments + output by these handlers can be directly indexable + (text or HTML), or they can be other simple or + compound documents that will need to be processed by + another handler.

  • @@ -5786,10 +5810,13 @@ recollindex -c "$confdir"

    The handlers that can handle multiple documents per file return a single piece of data to identify each document inside the file. This piece of data, called an ipath element will be sent back by - Recoll to extract the - document at query time, for previewing, or for creating a - temporary file to be opened by a viewer.

    + "literal">ipath will be sent back by Recoll to extract the document at + query time, for previewing, or for creating a temporary + file to be opened by a viewer. These handlers can also + return metadata either as HTML meta tags, or as named data through the + communication protocol.

    The following section describes the simple handlers, and the next one gives a few explanations about the execm ones. You could @@ -5860,16 +5887,72 @@ recollindex -c "$confdir"

    If you can program and want to write an execm handler, it should not be too - difficult to make sense of one of the existing modules. - There is a sample one with many comments, not actually - used by Recoll, which - would index a text file as one document per line. Look - for rcltxtlines.py in the - src/filters directory in - the Recoll BitBucket repository (the sample not in - the distributed release at the moment).

    + difficult to make sense of one of the existing + handlers.

    +

    The existing handlers differ in the amount of helper + code which they are using:

    +
    + +
    +

    There is a sample trivial handler based on + rclexecm.py, with many + comments, not actually used by Recoll. It would index a text file + as one document per line. Look for rcltxtlines.py in the src/filters directory in the online + Recoll Git repository (the sample not in the + distributed release at the moment).

    You can also have a look at the slightly more complex rclzip which uses Zip file paths as identifiers ( &RCL; input handlers cooperate to translate from the multitude - of input document formats, simple ones - as opendocument, - acrobat), or compound ones such - as Zip - or Email, into the final &RCL; - indexing input format, which is plain text. - Most input handlers are executable - programs or scripts. A few handlers are coded in C++ and live - inside recollindex. This latter kind will not - be described here. + of input document formats, simple ones as + opendocument, + acrobat, or compound ones such as + Zip or Email, + into the final &RCL; indexing input format, which is plain text (in + many cases the processing pipeline has an intermediary HTML step, + which may be used for better previewing presentation). Most input + handlers are executable programs or scripts. A few handlers are coded + in C++ and live inside recollindex. This latter + kind will not be described here. There are currently (since version 1.13) two kinds of external executable input handlers: @@ -4414,23 +4414,32 @@ recollindex -c "$confdir" output. Their output can be plain text or HTML. HTML is usually preferred because it can store metadata fields and it allows preserving some of the formatting for the GUI - preview. + preview. However, these handlers have limitations: + + They can only process one document + per file. + The output MIME type must be known and + fixed. + The character encoding, if relevant, must be + known and fixed (or possibly just depending on + location). + + - Multiple execm handlers - can process multiple files (sparing the process startup - time which can be very significant), or multiple documents - per file (e.g.: for zip or - chm files). They communicate - with the indexer through a simple protocol, but are - nevertheless a bit more complicated than the older - kind. Most of new handlers are written in - Python, using a common module - to handle the protocol. There is an exception, - rclimg which is written in Perl. The - subdocuments output by these handlers can be directly - indexable (text or HTML), or they can be other simple or - compound documents that will need to be processed by - another handler. + Multiple execm handlers can + process multiple files (sparing the process startup time which can + be very significant), or multiple documents per file (e.g.: for + archives or multi-chapter publications). They communicate with the + indexer through a simple protocol, but are nevertheless a bit more + complicated than the older kind. Most of the new handlers are + written in Python (exception: + rclimg which is written in Perl because + exiftool has no real Python equivalent). The + Python handlers use common modules to factor out the boilerplate, + which can make them very simple in favorable cases. The + subdocuments output by these handlers can be directly indexable + (text or HTML), or they can be other simple or compound documents + that will need to be processed by another handler. @@ -4458,10 +4467,12 @@ recollindex -c "$confdir" The handlers that can handle multiple documents per file return a single piece of data to identify each document inside the file. This piece of data, called - an ipath element will be sent back by + an ipath will be sent back by &RCL; to extract the document at query time, for previewing, or for creating a temporary file to be opened by a - viewer. + viewer. These handlers can also return metadata either as HTML + meta tags, or as named data through the + communication protocol. The following section describes the simple handlers, and the next one gives a few explanations about @@ -4514,14 +4525,53 @@ recollindex -c "$confdir" If you can program and want to write an execm handler, it should not be too - difficult to make sense of one of the existing modules. There is - a sample one with many comments, not actually used by &RCL;, - which would index a text file as one document per line. Look for - rcltxtlines.py in the - src/filters directory in the &RCL; BitBucket - repository (the sample - not in the distributed release at the moment). + difficult to make sense of one of the existing handlers. + + The existing handlers differ in the amount of helper code + which they are using: + + rclimg is written in Perl and + handles the execm protocol all by itself (showing how trivial it + is). + All the Python handlers share at least the + rclexecm.py module, which handles the + communication. Have a look at, for example, + rclzip for a handler which uses + rclexecm.py directly. + Most Python handlers which process + single-document files by executing another command are further + abstracted by using the rclexec1.py + module. See for example rclrtf.py for a + simple one, or rcldoc.py for a slightly more + complicated one (possibly executing several + commands). + Handlers which extract text from an XML document + by using an XSLT style sheet are now executed inside + recollindex, with only the style sheet stored + in the filters/ directory. These can + use a single style sheet (e.g. abiword.xsl), + or two sheets for the data and metadata + (e.g. opendoc-body.xsl and + opendoc-meta.xsl). The + mimeconf configuration file defines how the + sheets are used, have a look. Before the C++ import, the + xsl-based handlers used a common module + rclgenxslt.py, it is still around but + unused. The handler for OpenXML presentations is still the Python + version because the format did not fit with what the C++ code + does. It would be a good base for another similar + issue. + + + + There is a sample trivial handler based on + rclexecm.py, with many comments, not actually + used by &RCL;. It would index a text file as one document per + line. Look for rcltxtlines.py in the + src/filters directory in the online &RCL; + Git + repository (the sample not in the distributed release at + the moment). You can also have a look at the slightly more complex rclzip which uses Zip