From 39c2809b6a01285ae204bc64e686e933c84a3080 Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Fri, 2 Nov 2012 17:30:07 +0100 Subject: [PATCH] doc --- src/doc/user/usermanual.sgml | 180 +++++++++++++++++++++++++++-------- 1 file changed, 138 insertions(+), 42 deletions(-) diff --git a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml index dc5f5cd1..6e43fc6b 100644 --- a/src/doc/user/usermanual.sgml +++ b/src/doc/user/usermanual.sgml @@ -3037,55 +3037,106 @@ dir:recoll dir:src -dir:utils -dir:common - - Programming interface + + Programming interface - &RCL; has an Application Programming Interface, usable both - for indexing and searching, currently accessible from the - Python language. + &RCL; has an Application Programming Interface, usable both + for indexing and searching, currently accessible from the + Python language. - Another less radical way to extend the application is to - write filters for new types of documents. + Another less radical way to extend the application is to + write filters for new types of documents. - The processing of metadata attributes for documents - (fields) is highly configurable. + The processing of metadata attributes for documents + (fields) is highly configurable. - + + + Writing a document filter - &RCL; filters are executable programs which - translate from a specific format (ie: - openoffice, - acrobat, etc.) to the &RCL; - indexing input format, which may be - text/plain or - text/html. + &RCL; filters cooperate to translate from the multitude + of input document formats, simple ones + as opendocument, + acrobat), or compound ones such + as Zip + or Email, into the final &RCL; + indexing input format, which may + be text/plain + or text/html. Most filters are executable + programs or scripts. A few filters are coded in C++ and live + inside recollindex. This latter kind will not + be described here. - As of &RCL; 1.13, there are two kinds of filters: - - Simple filters (the old ones) run once and - exit. They can be bare programs like - antiword, or shell-scripts using other - programs. They are very simple to write, because they just need - to output the converted to the standard output. - - Multiple filters, new in 1.13, run as long as - their master process (ie: recollindex) is active. They can - process multiple files (sparing the process startup time which - can be very significant), or multiple documents per file (ie: for - zip or chm files). They communicate with the indexer through a - simple protocol, but are nevertheless a bit more complicated than - the older kind. Most of these new filters are written in - Python, using a common module to - handle the protocol. - - - The following will just describe the simple filters. If you can - program and want to write one of the other kind, it shouldn't be too - difficult to make sense of one of the existing modules. For example, - look at rclzip which uses Zip file paths as - internal identifiers (ipath), and - rclinfo, which uses an integer index. + There are currently (1.18 and since 1.13) two kinds of + external executable filters: + + Simple filters (exec + filters) run once and + exit. They can be bare programs + like antiword, or scripts + using other programs. They are very simple to write, + because they just need to print the converted document + to the standard output. Their output can + be text/plain + or text/html. + + Multiple filters (execm + filters), run as long as + their master process (recollindex) is + active. They can process multiple files (sparing the + process startup time which can be very significant), + or multiple documents per file (e.g.: for zip or chm + files). They communicate with the indexer through a + simple protocol, but are nevertheless a bit more + complicated than the older kind. Most of new + filters are written + in Python, using a common + module to handle the protocol. There is an + exception, rclimg which is written + in Perl. The subdocuments output by these filters can + be directly indexable (text or HTML), or they can be + other simple or compound documents that will need to + be processed by another filter. + + + + + In both cases, filters deal with regular file system + files, and can process either a single document, or a + linear list of documents in each file. &RCL; is responsible + for performing up to date checks, deal with more complex + embedding and other upper level issues. + + In the extreme case of a simple filter returning a + document in text/plain format, no + metadata can be transferred from the filter to the + indexer. Generic metadata, like document size or + modification date, will be gathered and stored by the + indexer. + + Filters that produce text/html + format can return an arbitrary amount of metadata inside HTML + meta tags. These will be processed + according to the directives found in + the + fields configuration + file. + + The filters that can handle multiple documents per file + return a single piece of data to identify each document inside + the file. This piece of data, called + an ipath element will be sent back by + &RCL; to extract the document at query time, for previewing, + or for creating a temporary file to be opened by a + viewer. + + The following section describes the simple + filters, and the next one gives a few explanations about + the execm ones. You could conceivably + write a simple filter with only the elements in the + manual. This will not be the case for the other ones, for + which you will have to look at the code. Simple filters @@ -3126,6 +3177,51 @@ dir:recoll dir:src -dir:utils -dir:common + + "Multiple" filters + + If you can program and want to write + an execm filter, it should not be too + difficult to make sense of one of the existing modules. For + example, look at rclzip which uses Zip + file paths as identifiers (ipath), + and rclics, which uses an integer + index. Also have a look at the comments inside + the internfile/mh_execm.h file and + possibly at the corresponding module. + + execm filters sometimes need to make + a choice for the nature of the ipath + elements that they use in communication with the + indexer. Here are a few guidelines: + + Use ASCII or UTF-8 (if the identifier is an + integer print it, for example, like printf %d would + do). + If at all possible, the data should make some + kind of sense when printed to a log file to help with + debugging. + &RCL; uses a colon (:) as a + separator to store a complex path internally (for + deeper embedding). Colons inside + the ipath elements output by a + filter will be escaped, but would be a bad choice as a + filter-specific separator (mostly, again, for + debugging issues). + + In any case, the main goal is that it should + be easy for the filter to extract the target document, given + the file name and the ipath + element. + + execm filters will also produce + a document with a null ipath + element. Depending on the type of document, this may have + some associated data (e.g. the body of an email message), or + none (typical for an archive file). If it is empty, this + document will be useful anyway for some operations, as the + parent of the actual data documents. + Telling &RCL; about the filter