83 lines
3.8 KiB
Plaintext
83 lines
3.8 KiB
Plaintext
== Recoll input handlers
|
|
|
|
In the end, Recoll indexes plain UTF-8 text, remembering when it came
|
|
from.
|
|
|
|
But of course, this is not how the source data looks like.
|
|
The text content of the original documents is encoded in many fashions
|
|
(ie pdf, ms-word, html, etc.), and it can also be stored in quite
|
|
involved ways (inside archives, email attachments ...).
|
|
|
|
For getting to the data and converting it to plain text, Recoll uses a set
|
|
of modules which it calls input handlers (or filters), which either operate
|
|
on the storage structure (ie: a zip handler), or the storage format (ie a
|
|
pdf to text translator), or both. In addition, there is a tentative notion
|
|
of a higher level storage backend which we will ignore for now (for
|
|
reference there are currently two of those: the file system and the web
|
|
history cache).
|
|
|
|
The basic task of filters is to take a document as input and produce a
|
|
series of subdocuments as output. The subdocument's format is defined
|
|
either dynamically (as part of the output data), or statically, in the
|
|
filter definition.
|
|
|
|
=== Simple filters
|
|
|
|
These are executed by a the **mh_exec** recoll module. They are the vast
|
|
majority.
|
|
|
|
These filters are very simple. They are designed to perform a simple task
|
|
with minimal interface, they mostly don't know anything about each other,
|
|
and they don't know much about their context. This makes writing a filter
|
|
quite easy as there is not much to learn about their environment.
|
|
|
|
Only one output document is produced and the format is fixed.
|
|
|
|
In practise the filter, which is most generally a shell-script (but could
|
|
be any executable program), takes a file name on the command line and
|
|
outputs an html or plain text document on standard output, then exits.
|
|
|
|
For example, the pdf filter takes one pdf file name as input on the command
|
|
line and produces one html document on stdout. The fact that the output is
|
|
html is statically defined in a configuration file.
|
|
|
|
For filters which produce plain text, the output character set information
|
|
is in general defined in the configuration file. Else it will be obtained
|
|
from the locale (hoping that it makes sense).
|
|
|
|
Filters that output html can produce metadata information in the html
|
|
header (ie author etc.). Filters that output plain text can only output
|
|
main text data, no metadata fields.
|
|
|
|
Besides the file name, there is one other piece of input information, which
|
|
is in the form of an environment variable, and can be safely ignored:
|
|
+RECOLL_FILTER_FORPREVIEW+. This indicates if the filter is being used
|
|
for previewing or for indexing data. Some filters will elect to suppress
|
|
repetitive parts of the output text when indexing to avoid distorting the
|
|
term statistics. For exemple, the man filter suppresses the section
|
|
headers (NAME, SYNOPSIS...) when indexing.
|
|
|
|
=== Multiple input filters
|
|
|
|
These filters are more complex, but still quite easy to write, especially
|
|
if you can use Python, because they can then use a common module which
|
|
manages the communication with the indexer.
|
|
|
|
Newer Recoll versions have converted many previously 'simple' filters to
|
|
this kind as part of the port to Windows.
|
|
|
|
These filters are executed by the *mh_execm* Recoll module.
|
|
|
|
They are persistent (one instance will persist through a whole indexing
|
|
pass), and will index successive multiple input files (the point being to
|
|
avoid startup performance penalty), and possibly multiple documents per
|
|
input file if this makes sense for their input format (ie: zip archive, chm
|
|
help file).
|
|
|
|
They use a simple communication protocol over a pipe with the main recoll
|
|
or recollindex process, with file names and a few other parameters being
|
|
sent as input, and decoded data and attributes being sent in return.
|
|
|
|
The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip'
|
|
or 'rclaudio' for reasonably straightforward exemples.
|