Index queries do not provide document content (only a
partial and unprecise reconstruction is performed to show the
snippets text). In order to access the actual document data,
the data extraction part of the indexing process
must be performed (subdocument access and format
translation). This is not trivial in
general. The rclextract module currently
provides a single class which can be used to access the data
content for result documents.
Methods
- Extractor(doc)
- An
Extractorobject is built from aDocobject, output from a query. - Extractor.textextract(ipath)
- Extract document defined
by
ipathand return aDocobject. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The typical use would be as follows:qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) doc = extractor.textextract(qdoc.ipath) # use doc.text, e.g. for previewing
- Extractor.idoctofile(ipath, targetmtype, outfile='')
- Extracts document into an output file,
which can be given explicitly or will be created as a
temporary file to be deleted by the caller. Typical use:
qdoc = query.fetchone() extractor = recoll.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)

