From 03063f58dc4474c293754ae575dce061920bd7e1 Mon Sep 17 00:00:00 2001
From: Jean-Francois Dockes By writing a custom Python program, using the
- Recoll Python
- API. The small programs or pieces of code which handle the
+ processing of the different document types for
+ Recoll used to be called
Recoll input handlers
@@ -6411,8 +6423,8 @@ or
Recoll versions after
+ 1.11 define a Python programming interface, both for
+ searching and creating/updating an index. The search interface is used in the Recoll Ubuntu Unity Lens and the
+ Recoll Web UI. It can
+ run queries on any Recoll configuration. The index update section of the API may be used to
+ create and update Recoll
+ indexes on specific configurations (separate from the
+ ones created by recollindex). The
+ resulting databases can be queried alone, or in
+ conjunction with regular ones, through the GUI or any of
+ the query interfaces. The search API is modeled along the Python database
+ API specification. There were two major changes along
+ Recoll versions: The basis for the Recoll API changed from Python
+ database API version 1.0 (Recoll versions up to 1.18.1),
+ to version 2.0 (Recoll 1.18.2 and later). The We will describe the new API and package structure
+ here. A paragraph at the end of this section will explain
+ a few differences and ways to write code compatible with
+ both versions. The Python interface can be found in the source
+ package, under The As of Recoll 1.19,
+ the module can be compiled for Python3. The normal Recoll
+ installer installs the Python2 API along with the main
+ code. The Python3 version must be explicitely built and
+ installed. When installing from a repository, and depending on
+ the distribution, the Python API can sometimes be found
+ in a separate package. As an introduction, the following small sample will
+ run a query and list the title and url for each of the
+ results. It would work with Recoll 1.19 and later. The
+ An udi (unique document identifier) identifies a
- document. Because of limitations inside the index
- engine, it is restricted in length (to 200 bytes),
- which is why a regular URI cannot be used. The
- structure and contents of the udi is defined by the
- application and opaque to the index engine. For
- example, the internal file system indexer uses the
- complete document path (file path + internal path),
- truncated to length, the suppressed part being
- replaced by a hash value. This data value (set as a field in the Doc
object) is stored, along with the URL, but not
indexed by Recoll.
- Its contents are not interpreted, and its use is up
- to the application. For example, the Recoll internal file system
- indexer stores the part of the document access path
- internal to the container file ( An If this attribute is set on a document when
+ entering it in the index, it designates its
+ physical container document. In a multilevel
+ hierarchy, this may not be the immediate parent.
+ Data for an external indexer, should be stored in a
- separate index, not the one for the Recoll internal file system indexer,
- except if the latter is not used at all). The reason is
- that the main document indexer purge pass would remove
- all the other indexer's documents, as they were not seen
- during indexing. The main indexer documents would also
- probably be a problem for the external indexer purge
- operation. Recoll versions
- after 1.11 define a Python programming interface, both
- for searching and indexing. The search interface is used in the Recoll Ubuntu
- Unity Lens and Recoll WebUI. The indexing section of the API has seen little use,
- and is more a proof of concept. In truth it is waiting
- for its killer app... The search API is modeled along the Python database
- API specification. There were two major changes along
- Recoll versions: The basis for the Recoll API changed from
- Python database API version 1.0 (Recoll versions up to
- 1.18.1), to version 2.0 (Recoll 1.18.2 and
- later). The We will mostly describe the new API and package
- structure here. A paragraph at the end of this section
- will explain a few differences and ways to write code
- compatible with both versions. The Python interface can be found in the source
- package, under The As of Recoll 1.19,
- the module can be compiled for Python3. The normal Recoll
- installer installs the Python2 API along with the main
- code. The Python3 version must be explicitely built and
- installed. When installing from a repository, and depending on
- the distribution, the Python API can sometimes be found
- in a separate package. The following small sample will run a query and list
- the title and url for each of the results. It would
- work with Recoll 1.19
- and later. The The The This call initializes the recoll module, and
+ it should always be performed before any other
+ call or object creation. Closes the connection. You can't do
+ anything with the These aliases return a blank Set the parameters used to build snippets
+ (sets of keywords in context text fragments).
+ Expand an expression against the index
+ term list. Performs the basic function from
+ the GUI term explorer tool. Sort results by Starts a search for Starts a search for the query defined by
+ the SearchData object. Fetches the next Fetches the next Closes the query. The object is unusable
+ after the call. Adjusts the position in the current result
+ set. Retrieves the expanded query terms as a
+ list of pairs. Meaningful only after
+ executexx In each pair, the first entry is a
+ list of user terms (of size one for simple
+ terms, or more for group and phrase clauses),
+ the second a list of query terms as derived
+ from the user terms and used in the Xapian
+ Query. Return the Xapian query description as a
+ Unicode string. Meaningful only after
+ executexx. Will insert <span "class=rclmatch">,
+ </span> tags around the match areas in
+ the input text and return the modified text.
+ Create a snippets abstract for
+ So that things like Default number of records processed by
+ fetchmany (r/w). Number of records returned by the last
+ execute. Next index to be fetched from results.
+ Normally increments after each fetchone()
+ call, but can be set/reset before the call to
+ effect seeking (equivalent to using
+ Retrieve the named doc attribute. You can
+ also use Set the the named doc attribute. You can
+ also use Retrieve the URL in byte array format (no
+ transcoding), for use as parameter to a
+ system call. Set the URL in byte array format (no
+ transcoding). Return a dictionary of doc object
+ keys/values list of doc object keys (attribute
+ names). An
Terminology
The small programs or
- pieces of code which handle the processing of the
- different document types for Recoll used to be called
+ Terminology
+
+ filters, which is still
reflected in the name of the directory which holds them
and many configuration variables. They were named this
@@ -5820,7 +5832,7 @@ dir:recoll dir:src -dir:utils -dir:common
term input handler is now
progressively substituted in the documentation.
filter is still used in many
- places though.
+ places though.
+
+ recoll module
+ became a package (with an internal recoll module) as of Recoll version 1.19, in order
+ to add more functions. For existing code, this only
+ changes the way the interface must be imported.python/recoll.python/recoll/
+ directory contains the usual setup.py. After configuring the main
+ Recoll code, you can use
+ the script to build and install the Python module:
+
+
+ cd recoll-xxx/python/recoll
+ python setup.py build
+ python setup.py install
+
+python/samples source
+ directory contains several examples of Python programming
+ with Recoll, exercising
+ the extension more completely, and especially its data
+ extraction features.
+#!/usr/bin/env python
+
+from recoll import recoll
+
+db = recoll.connect()
+query = db.query()
+nres = query.execute("some query")
+results = query.fetchmany(20)
+for doc in results:
+ print(doc.url, doc.title)
+
+
-
ipath in this case is a list of
- subdocument sequential numbers). url and ipath are
- returned in every search result and permit access
- to the original document.ipath to
+ store the part of the document access path internal
+ to (possibly imbricated) container documents.
+ ipath in this case is
+ a vector of access elements (e.g, the first part
+ could be a path inside a zip file to an archive
+ member which happens to be an mbox file, the second
+ element would be the message sequential number
+ inside the mbox etc.). url and ipath are returned in every search
+ result and define the access to the original
+ document. ipath is
+ empty for top-level document/files (e.g. a PDF
+ document which is a filesystem file). The
+ Recoll GUI knows
+ about the structure of the ipath values used by the
+ filesystem indexer, and uses it for such functions
+ as opening the parent of a given document.
+ udi (unique
+ document identifier) identifies a document. Because
+ of limitations inside the index engine, it is
+ restricted in length (to 200 bytes), which is why a
+ regular URI cannot be used. The structure and
+ contents of the udi is
+ defined by the application and opaque to the index
+ engine. For example, the internal file system
+ indexer uses the complete document path (file path
+ + internal path), truncated to length, the
+ suppressed part being replaced by a hash value. The
+ udi is not explicit in
+ the query interface (it is used "under the hood" by
+ the rclextract
+ module), but it is an explicit element of the
+ update interface.parent_udi is
+ optional, but its use by an indexer may simplify
+ index maintenance, as Recoll will automatically
+ delete all children defined by parent_udi == udi when the
+ document designated by udi is destroyed. e.g. if a
+ Zip archive contains
+ entries which are themselves containers, like
+ mbox files, all the
+ subdocuments inside the Zip file (mbox, messages, message
+ attachments, etc.) would have the same parent_udi, matching the
+ udi for the
+ Zip file, and all
+ would be destroyed when the Zip file (identified by its
+ udi) is removed from
+ the index. The standard filesystem indexer uses
+ parent_udi.
-
- recoll module
- became a package (with an internal recoll module) as of
- Recoll version
- 1.19, in order to add more functions. For
- existing code, this only changes the way the
- interface must be imported.python/recoll.python/recoll/
- directory contains the usual setup.py. After configuring the main
- Recoll code, you can
- use the script to build and install the Python
- module:
-
-
- cd recoll-xxx/python/recoll
- python setup.py build
- python setup.py install
-
-python/samples source directory
- contains several examples of Python programming with
- Recoll, exercising the
- extension more completely, and especially its data
- extraction features.
- from recoll import recoll
-
- db = recoll.connect()
- query = db.query()
- nres = query.execute("some query")
- results = query.fetchmany(20)
- for doc in results:
- print(doc.url, doc.title)
-
-
- recoll module
contains functions and classes used to query (or
- update) the index.connect()
+ connect()
function connects to one or several
Recoll
index(es) and returns a Db object.
+ "literal">Db object.
-
- confdir may specify a
- configuration directory. The usual defaults
- apply.confdir
+ may specify a configuration directory.
+ The usual defaults apply.extra_dbs is a list of
- additional indexes (Xapian
- directories).extra_dbs
+ is a list of additional indexes (Xapian
+ directories).writable decides if we can
- index new data through this
- connection.writable
+ decides if we can index new data through
+ this connection.
@@ -6781,8 +6863,9 @@ or
Db object after this.Db object after this.Query object for this
- index.Query object for this
+ index.maxchars defines
- the maximum total size of the abstract.
- contextwords
- defines how many terms are shown around the
- keyword.maxchars defines
+ the maximum total size of the abstract.
+ contextwords
+ defines how many terms are shown around the
+ keyword.match_type can be either of
- wildcard,
- regexp or
- stem. Returns a
- list of terms expanded from the input
- expression.match_type can be either of
+ wildcard,
+ regexp or
+ stem. Returns a
+ list of terms expanded from the input
+ expression.fieldname, in
- ascending or descending order. Must be called
- before executing the search.fieldname, in
+ ascending or descending order. Must be called
+ before executing the search.query_string, a
- Recoll search
- language string.query_string,
+ a Recoll
+ search language string.Doc objects in the current
- search results, and returns them as an array of
- the required size, which is by default the
- value of the arraysize data member.Doc objects in the current
+ search results, and returns them as an array
+ of the required size, which is by default the
+ value of the arraysize data member.Doc object from the current
- search results.Doc object from the current
+ search results.mode can be
- relative or
- absolute.mode can be
+ relative or
+ absolute.ishtml can be set
- to indicate that the input text is HTML and
- that HTML special characters should not be
- escaped. methods
- if set should be an object with methods
- startMatch(i) and endMatch() which will be
- called for each match and should return a begin
- and end tagishtml can be
+ set to indicate that the input text is HTML
+ and that HTML special characters should not
+ be escaped. methods if set should be an
+ object with methods startMatch(i) and
+ endMatch() which will be called for each
+ match and should return a begin and end
+ tagdoc (a Doc object) by selecting text
- around the match terms. If methods is set, will
- also perform highlighting. See the highlight
- method.doc (a
+ Doc object) by
+ selecting text around the match terms. If
+ methods is set, will also perform
+ highlighting. See the highlight method.for doc in query: will
- work.for doc in query: will
+ work.scroll()). Starts at 0.scroll()).
+ Starts at 0.getattr(doc,
+ key) or doc.key.setattr(doc,
+ key, value).
+ "RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA">
The SearchData class
+ "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
The Extractor class
Extractor
- object is built from a Doc object, output from a
- query.Extractor
+ object is built from a Doc object, output from a
+ query.ipath and
return a Doc
object. The doc.text field has the document
text converted to either text/plain or
text/html according to doc.mimetype. The
- typical use would be as follows:
+ typical use would be as follows:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
@@ -7106,10 +7252,10 @@ doc = extractor.textextract(qdoc.ipath)
outfile='')
Extracts document into an output file, + which can be given explicitly or will be + created as a temporary file to be deleted by + the caller. Typical use:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
@@ -7127,9 +7273,9 @@ filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
@@ -7167,26 +7313,281 @@ for i in range(nres):
+
+
+ The update API can be used to create an index from + data which is not accessible to the regular Recoll indexer, or structured to + present difficulties to the Recoll input handlers.
+ +An indexer created using this API will be have + equivalent work to do as the the Recoll file system + indexer: look for modified documents, extract their text, + call the API for indexing it, take care of purging the + index out of data from documents which do not exist in + the document store any more.
+ +The data for such an external indexer should be stored + in an index separate from any used by the Recoll internal file system indexer. + The reason is that the main document indexer purge pass + (removal of deleted documents) would also remove all the + documents belonging to the external indexer, as they were + not seen during the filesystem walk. The main indexer + documents would also probably be a problem for the + external indexer own purge operation.
+ +While there would be ways to enable multiple foreign + indexers to cooperate on a single index, it is just + simpler to use separate ones, and use the multiple index + access capabilities of the query interface, if + needed.
+ +There are two parts in the update interface:
+ +Methods inside the recoll module allow inserting
+ data into the index, to make it accessible by the
+ normal query interface.
An interface based on scripts execution is
+ defined to allow either the GUI or the rclextract module to access
+ original document data for previewing or
+ editing.
The following code fragments can be used to ensure - that code can run with both the old and the new API (as - long as it does not use the new abilities of the new - API of course).
+The update methods are part of the recoll module described above. The
+ connect() method is used with a writable=true parameter to obtain a
+ writable Db object. The
+ following Db object
+ methods are then available.
Adapting to the new package structure:
+Add or update index data for a given document
+ The udi
+ string must define a unique id for the document.
+ It is an opaque interface element and not
+ interpreted inside Recoll. doc is a Doc object,
+ created from the data to be indexed (the main
+ text should be in doc.text). If parent_udi
+ is set, this is a unique identifier for the
+ top-level container (e.g. for the filesystem
+ indexer, this would be the one which is an actual
+ file).
Purge index from all data for udi, and all documents (if any)
+ which have a matrching parent_udi.
Test if the index needs to be updated for the
+ document identified by udi. If this call is to be used,
+ the doc.sig field
+ should contain a signature value when calling
+ addOrUpdate(). The
+ needUpdate() call
+ then compares its parameter value with the stored
+ sig for udi. sig is an opaque value, compared
+ as a string.
The filesystem indexer uses a concatenation of + the decimal string values for file size and + update time, but a hash of the contents could + also be used.
+ +As a side effect, if the return value is false
+ (the index is up to date), the call will set the
+ existence flag for the document (and any
+ subdocument defined by its parent_udi), so that a later
+ purge() call will
+ preserve them).
The use of needUpdate() and purge() is optional, and the
+ indexer may use another method for checking the
+ need to reindex or to delete stale entries.
Delete all documents that were not touched + during the just finished indexing pass (since + open-for-write). These are the documents for the + needUpdate() call was not performed, indicating + that they no longer exist in the primary storage + system.
+Recoll has internal
+ methods to access document data for its internal
+ (filesystem) indexer. An external indexer needs to
+ provide data access methods if it needs integration
+ with the GUI (e.g. preview function), or support for
+ the rclextract
+ module.
The index data and the access method are linked by
+ the rclbes (recoll backend
+ storage) Doc field. You
+ should set this to a short string value identifying
+ your indexer (e.g. the filesystem indexer uses either
+ "FS" or an empty value, the Web history indexer uses
+ "BGL").
The link is actually performed inside a backends configuration file (stored
+ in the configuration directory). This defines commands
+ to execute to access data from the specified indexer.
+ Example, for the mbox indexing sample found in the
+ Recoll source (which sets rclbes="MBOX"):
+[MBOX] +fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch +makesig = path/to/recoll/src/python/samples/rclmbox.py makesig + ++ +
fetch and makesig define two commands to execute
+ to respectively retrieve the document text and compute
+ the document signature (the example implementation uses
+ the same script with different first parameters to
+ perform both operations).
The scripts are called with three additional
+ arguments: udi,
+ url, ipath, stored with the document when
+ it was indexed, and may use any or all to perform the
+ requested operation. The caller expects the result data
+ on stdout.
The Recoll source tree has two samples of external
+ indexers in the src/python/samples directory. The
+ more interesting one is rclmbox.py which indexes a directory
+ containing mbox folder
+ files. It exercises most features in the update
+ interface, and has a data access interface.
See the comments inside the file for more + information.
+The following code fragments can be used to ensure + that code can run with both the old and the new API (as + long as it does not use the new abilities of the new API + of course).
+ +Adapting to the new package structure:
+
try:
from recoll import recoll
@@ -7196,21 +7597,21 @@ except:
import recoll
hasextract = False
+
- Adapting to the change of nature of the next Query member. The same test can be
- used to choose to use the scroll() method (new) or set the
- next value (old).
+Adapting to the change of nature of the
+nextQuery+ member. The same test can be used to choose to use the +scroll()method (new) or set + thenextvalue (old).rownum = query.next if type(query.next) == int else \ query.rownumber +-