diff --git a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml index 36b0548a..c11e625e 100644 --- a/src/doc/user/usermanual.sgml +++ b/src/doc/user/usermanual.sgml @@ -2324,32 +2324,75 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r handle the protocol. - The following will just describe the simple filters, if you are - programmer enough to write one of the other kind, it shouldn't be too - difficult to make sense of one of the existing modules (ie: - rclzip). + The following will just describe the simple filters. If you can + program and want to write one of the other kind, it shouldn't be too + difficult to make sense of one of the existing modules. For example, + look at rclzip which uses Zip file paths as + internal identifiers (ipath), and + rclinfo, which uses an integer index. + + + Simple filters &RCL; simple filters are usually shell-scripts, but this is in - no way necessary. These programs are extremely simple and most - of the difficulty lies in extracting the text from the native - format, not outputting what is expected by &RCL;. Happily - enough, most document formats already have translators or text - extractors which handle the difficult part and can be called - from the filter. In some case the output of the translating - program is appropriate, and no intermediate shell-script is - needed. + no way necessary. Extracting the text from the native format is the + difficult part. Outputting the format expected by &RCL; is + trivial. Happily enough, most document formats have translators or + text extractors which can be called from the filter. In some cases + the output of the translating program is completely appropriate, + and no intermediate shell-script is needed. Filters are called with a single argument which is the source file name. They should output the result to stdout. - The RECOLL_FILTER_FORPREVIEW - environment variable (values yes, - no) tells the filter if the operation is - for indexing or previewing. Some filters use this to output a - slightly different format. This is not essential. + When writing a filter, you should decide if it will output + plain text or html. Plain text is simpler, but you will not be able + to add metadata or vary the output character encoding (this will be + defined in a configuration file). Additionally, some formatting may + easier to preserve when previewing html. Actually the deciding factor + is metadata: &RCL; has a way to + extract metadata from the html header and use it for field + searches.. + + The RECOLL_FILTER_FORPREVIEW environment + variable (values yes, no) + tells the filter if the operation is for indexing or + previewing. Some filters use this to output a slightly different + format, for example stripping uninteresting repeated keywords (ie: + Subject: for email) when indexing. This is not + essential. + + You should look to one of the simple filters, for exemple + rclps for a starting point. + + Don't forget to make your filter executable before + testing ! + + + + + Telling &RCL; about the filter + + There are two elements that link a file to the filter which + should process it: the association of file to mime type and the + association of a mime type with a filter. + + The association of files to mime types is mostly based on + name suffixes. The types are defined inside the + + mimemap file. Example: + + +.doc = application/msword + + If no suffix association is found for the file name, &RCL; will try + to execute the file -i command to determine a + mime type. The association of file types to filters is performed in - the mimeconf file. A sample: + the + mimeconf file. A sample will probably be + of better help than a long explanation: [index] @@ -2392,14 +2435,9 @@ application/x-chm = execm rclchm execm keyword. - The easiest way to write a new filter is probably to start from an - existing one. - - Filters which output text/plain text - are generally simpler, but they cannot specify the character set - and other metadata, so they are limited to cases where these - elements are not needed. + + Filter HTML output