recoll/website/faqsandhowtos/FilterArch.txt

== Recoll input handlers

In the end, Recoll indexes plain UTF-8 text, remembering when it came
from.

But of course, this is not how the source data looks like.
The text content of the original documents is encoded in many fashions
(ie pdf, ms-word, html, etc.), and it can also be stored in quite
involved ways (inside archives, email attachments ...).

For getting to the data and converting it to plain text, Recoll uses a set
of modules which it calls input handlers (or filters), which either operate
on the storage structure (ie: a zip handler), or the storage format (ie a
pdf to text translator), or both. In addition, there is a tentative notion
of a higher level storage backend which we will ignore for now (for
reference there are currently two of those: the file system and the web
history cache).

The basic task of filters is to take a document as input and produce a
series of subdocuments as output. The subdocument's format is defined
either dynamically (as part of the output data), or statically, in the
filter definition.

=== Simple filters

These are executed by a the **mh_exec** recoll module. They are the vast
majority.

These filters are very simple. They are designed to perform a simple task
with minimal interface, they mostly don't know anything about each other,
and they don't know much about their context. This makes writing a filter
quite easy as there is not much to learn about their environment.

Only one output document is produced and the format is fixed.

In practise the filter, which is most generally a shell-script (but could
be any executable program), takes a file name on the command line and
outputs an html or plain text document on standard output, then exits.

For example, the pdf filter takes one pdf file name as input on the command
line and produces one html document on stdout. The fact that the output is
html is statically defined in a configuration file.

For filters which produce plain text, the output character set information
is in general defined in the configuration file. Else it will be obtained
from the locale (hoping that it makes sense).

Filters that output html can produce metadata information in the html
header (ie author etc.). Filters that output plain text can only output
main text data, no metadata fields.

Besides the file name, there is one other piece of input information, which
is in the form of an environment variable, and can be safely ignored:
+RECOLL_FILTER_FORPREVIEW+. This indicates if the filter is being used
for previewing or for indexing data. Some filters will elect to suppress
repetitive parts of the output text when indexing to avoid distorting the
term statistics. For exemple, the man filter suppresses the section
headers (NAME, SYNOPSIS...) when indexing.

=== Multiple input filters

These filters are more complex, but still quite easy to write, especially
if you can use Python, because they can then use a common module which
manages the communication with the indexer.

Newer Recoll versions have converted many previously 'simple' filters to
this kind as part of the port to Windows.

These filters are executed by the *mh_execm* Recoll module.

They are persistent (one instance will persist through a whole indexing
pass), and will index successive multiple input files (the point being to
avoid startup performance penalty), and possibly multiple documents per
input file if this makes sense for their input format (ie: zip archive, chm
help file).

They use a simple communication protocol over a pipe with the main recoll
or recollindex process, with file names and a few other parameters being
sent as input, and decoded data and attributes being sent in return.

The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip'
or 'rclaudio' for reasonably straightforward exemples.