diff --git a/src/doc/user/Makefile b/src/doc/user/Makefile index 5d1860e5..14b6c9c2 100644 --- a/src/doc/user/Makefile +++ b/src/doc/user/Makefile @@ -19,7 +19,7 @@ commonoptions=--stringparam section.autolabel 1 \ # index.html chunk format target replaced by nicer webhelp (needs separate # make) in webhelp/ subdir -all: usermanual.html usermanual.pdf webh +all: usermanual.html webh usermanual.pdf webh: make -C webhelp diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 99ec3fe9..b86a303b 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -20,8 +20,8 @@ alink="#0000FF">
By writing a custom Python program, using the - Recoll Python - API.
+ Recoll Python API.The small programs or pieces of code which handle the
+ processing of the different document types for
+ Recoll used to be called
filters, which is still
reflected in the name of the directory which holds them
and many configuration variables. They were named this
@@ -5820,7 +5832,7 @@ dir:recoll dir:src -dir:utils -dir:common
term input handler is now
progressively substituted in the documentation.
filter is still used in many
- places though.
+ places though.
Recoll input handlers @@ -6411,8 +6423,8 @@ or
Recoll versions after + 1.11 define a Python programming interface, both for + searching and creating/updating an index.
+ +The search interface is used in the Recoll Ubuntu Unity Lens and the + Recoll Web UI. It can + run queries on any Recoll configuration.
+ +The index update section of the API may be used to + create and update Recoll + indexes on specific configurations (separate from the + ones created by recollindex). The + resulting databases can be queried alone, or in + conjunction with regular ones, through the GUI or any of + the query interfaces.
+ +The search API is modeled along the Python database + API specification. There were two major changes along + Recoll versions:
+ +The basis for the Recoll API changed from Python + database API version 1.0 (Recoll versions up to 1.18.1), + to version 2.0 (Recoll 1.18.2 and later).
+The recoll module
+ became a package (with an internal recoll module) as of Recoll version 1.19, in order
+ to add more functions. For existing code, this only
+ changes the way the interface must be imported.
We will describe the new API and package structure + here. A paragraph at the end of this section will explain + a few differences and ways to write code compatible with + both versions.
+ +The Python interface can be found in the source
+ package, under python/recoll.
The python/recoll/
+ directory contains the usual setup.py. After configuring the main
+ Recoll code, you can use
+ the script to build and install the Python module:
++ +cd recoll-xxx/python/recoll+python setup.py build+python setup.py install+ +
As of Recoll 1.19, + the module can be compiled for Python3.
+ +The normal Recoll + installer installs the Python2 API along with the main + code. The Python3 version must be explicitely built and + installed.
+ +When installing from a repository, and depending on + the distribution, the Python API can sometimes be found + in a separate package.
+ +As an introduction, the following small sample will
+ run a query and list the title and url for each of the
+ results. It would work with Recoll 1.19 and later. The
+ python/samples source
+ directory contains several examples of Python programming
+ with Recoll, exercising
+ the extension more completely, and especially its data
+ extraction features.
+#!/usr/bin/env python
+
+from recoll import recoll
+
+db = recoll.connect()
+query = db.query()
+nres = query.execute("some query")
+results = query.fetchmany(20)
+for doc in results:
+ print(doc.url, doc.title)
+
+ An udi (unique document identifier) identifies a - document. Because of limitations inside the index - engine, it is restricted in length (to 200 bytes), - which is why a regular URI cannot be used. The - structure and contents of the udi is defined by the - application and opaque to the index engine. For - example, the internal file system indexer uses the - complete document path (file path + internal path), - truncated to length, the suppressed part being - replaced by a hash value.
-This data value (set as a field in the Doc
object) is stored, along with the URL, but not
indexed by Recoll.
- Its contents are not interpreted, and its use is up
- to the application. For example, the Recoll internal file system
- indexer stores the part of the document access path
- internal to the container file (ipath in this case is a list of
- subdocument sequential numbers). url and ipath are
- returned in every search result and permit access
- to the original document.
ipath to
+ store the part of the document access path internal
+ to (possibly imbricated) container documents.
+ ipath in this case is
+ a vector of access elements (e.g, the first part
+ could be a path inside a zip file to an archive
+ member which happens to be an mbox file, the second
+ element would be the message sequential number
+ inside the mbox etc.). url and ipath are returned in every search
+ result and define the access to the original
+ document. ipath is
+ empty for top-level document/files (e.g. a PDF
+ document which is a filesystem file). The
+ Recoll GUI knows
+ about the structure of the ipath values used by the
+ filesystem indexer, and uses it for such functions
+ as opening the parent of a given document.
+ An udi (unique
+ document identifier) identifies a document. Because
+ of limitations inside the index engine, it is
+ restricted in length (to 200 bytes), which is why a
+ regular URI cannot be used. The structure and
+ contents of the udi is
+ defined by the application and opaque to the index
+ engine. For example, the internal file system
+ indexer uses the complete document path (file path
+ + internal path), truncated to length, the
+ suppressed part being replaced by a hash value. The
+ udi is not explicit in
+ the query interface (it is used "under the hood" by
+ the rclextract
+ module), but it is an explicit element of the
+ update interface.
If this attribute is set on a document when
+ entering it in the index, it designates its
+ physical container document. In a multilevel
+ hierarchy, this may not be the immediate parent.
+ parent_udi is
+ optional, but its use by an indexer may simplify
+ index maintenance, as Recoll will automatically
+ delete all children defined by parent_udi == udi when the
+ document designated by udi is destroyed. e.g. if a
+ Zip archive contains
+ entries which are themselves containers, like
+ mbox files, all the
+ subdocuments inside the Zip file (mbox, messages, message
+ attachments, etc.) would have the same parent_udi, matching the
+ udi for the
+ Zip file, and all
+ would be destroyed when the Zip file (identified by its
+ udi) is removed from
+ the index. The standard filesystem indexer uses
+ parent_udi.
Data for an external indexer, should be stored in a - separate index, not the one for the Recoll internal file system indexer, - except if the latter is not used at all). The reason is - that the main document indexer purge pass would remove - all the other indexer's documents, as they were not seen - during indexing. The main indexer documents would also - probably be a problem for the external indexer purge - operation.
Recoll versions - after 1.11 define a Python programming interface, both - for searching and indexing.
- -The search interface is used in the Recoll Ubuntu - Unity Lens and Recoll WebUI.
- -The indexing section of the API has seen little use, - and is more a proof of concept. In truth it is waiting - for its killer app...
- -The search API is modeled along the Python database - API specification. There were two major changes along - Recoll versions:
- -The basis for the Recoll API changed from - Python database API version 1.0 (Recoll versions up to - 1.18.1), to version 2.0 (Recoll 1.18.2 and - later).
-The recoll module
- became a package (with an internal recoll module) as of
- Recoll version
- 1.19, in order to add more functions. For
- existing code, this only changes the way the
- interface must be imported.
We will mostly describe the new API and package - structure here. A paragraph at the end of this section - will explain a few differences and ways to write code - compatible with both versions.
- -The Python interface can be found in the source
- package, under python/recoll.
The python/recoll/
- directory contains the usual setup.py. After configuring the main
- Recoll code, you can
- use the script to build and install the Python
- module:
-- -cd recoll-xxx/python/recoll-python setup.py build-python setup.py install- -
As of Recoll 1.19, - the module can be compiled for Python3.
- -The normal Recoll - installer installs the Python2 API along with the main - code. The Python3 version must be explicitely built and - installed.
- -When installing from a repository, and depending on - the distribution, the Python API can sometimes be found - in a separate package.
- -The following small sample will run a query and list
- the title and url for each of the results. It would
- work with Recoll 1.19
- and later. The python/samples source directory
- contains several examples of Python programming with
- Recoll, exercising the
- extension more completely, and especially its data
- extraction features.
- from recoll import recoll
-
- db = recoll.connect()
- query = db.query()
- nres = query.execute("some query")
- results = query.fetchmany(20)
- for doc in results:
- print(doc.url, doc.title)
-
-
- The recoll module
contains functions and classes used to query (or
- update) the index.
connect()
+ The connect()
function connects to one or several
Recoll
index(es) and returns a Db object.
+ "literal">Db object.
confdir may specify a
- configuration directory. The usual defaults
- apply.confdir
+ may specify a configuration directory.
+ The usual defaults apply.
extra_dbs is a list of
- additional indexes (Xapian
- directories).extra_dbs
+ is a list of additional indexes (Xapian
+ directories).
writable decides if we can
- index new data through this
- connection.writable
+ decides if we can index new data through
+ this connection.
This call initializes the recoll module, and + it should always be performed before any other + call or object creation.
@@ -6710,8 +6784,8 @@ orDb object after this.Closes the connection. You can't do
+ anything with the Db object after this.
Query object for this
- index.These aliases return a blank Query object for this
+ index.
maxchars defines
- the maximum total size of the abstract.
- contextwords
- defines how many terms are shown around the
- keyword.Set the parameters used to build snippets
+ (sets of keywords in context text fragments).
+ maxchars defines
+ the maximum total size of the abstract.
+ contextwords
+ defines how many terms are shown around the
+ keyword.
match_type can be either of
- wildcard,
- regexp or
- stem. Returns a
- list of terms expanded from the input
- expression.Expand an expression against the index
+ term list. Performs the basic function from
+ the GUI term explorer tool. match_type can be either of
+ wildcard,
+ regexp or
+ stem. Returns a
+ list of terms expanded from the input
+ expression.
fieldname, in
- ascending or descending order. Must be called
- before executing the search.Sort results by fieldname, in
+ ascending or descending order. Must be called
+ before executing the search.
query_string, a
- Recoll search
- language string.Starts a search for query_string,
+ a Recoll
+ search language string.
Starts a search for the query defined by + the SearchData object.
+Doc objects in the current
- search results, and returns them as an array of
- the required size, which is by default the
- value of the arraysize data member.Fetches the next Doc objects in the current
+ search results, and returns them as an array
+ of the required size, which is by default the
+ value of the arraysize data member.
Doc object from the current
- search results.Fetches the next Doc object from the current
+ search results.
Closes the query. The object is unusable + after the call.
+mode can be
- relative or
- absolute.Adjusts the position in the current result
+ set. mode can be
+ relative or
+ absolute.
Retrieves the expanded query terms as a + list of pairs. Meaningful only after + executexx In each pair, the first entry is a + list of user terms (of size one for simple + terms, or more for group and phrase clauses), + the second a list of query terms as derived + from the user terms and used in the Xapian + Query.
+Return the Xapian query description as a + Unicode string. Meaningful only after + executexx.
+ishtml can be set
- to indicate that the input text is HTML and
- that HTML special characters should not be
- escaped. methods
- if set should be an object with methods
- startMatch(i) and endMatch() which will be
- called for each match and should return a begin
- and end tagWill insert <span "class=rclmatch">,
+ </span> tags around the match areas in
+ the input text and return the modified text.
+ ishtml can be
+ set to indicate that the input text is HTML
+ and that HTML special characters should not
+ be escaped. methods if set should be an
+ object with methods startMatch(i) and
+ endMatch() which will be called for each
+ match and should return a begin and end
+ tag
doc (a Doc object) by selecting text
- around the match terms. If methods is set, will
- also perform highlighting. See the highlight
- method.Create a snippets abstract for
+ doc (a
+ Doc object) by
+ selecting text around the match terms. If
+ methods is set, will also perform
+ highlighting. See the highlight method.
for doc in query: will
- work.So that things like for doc in query: will
+ work.
Default number of records processed by + fetchmany (r/w).
+Number of records returned by the last + execute.
+scroll()). Starts at 0.Next index to be fetched from results.
+ Normally increments after each fetchone()
+ call, but can be set/reset before the call to
+ effect seeking (equivalent to using
+ scroll()).
+ Starts at 0.
Retrieve the named doc attribute. You can
+ also use getattr(doc,
+ key) or doc.key.
Set the the named doc attribute. You can
+ also use setattr(doc,
+ key, value).
Retrieve the URL in byte array format (no + transcoding), for use as parameter to a + system call.
+Set the URL in byte array format (no + transcoding).
+Return a dictionary of doc object + keys/values
+list of doc object keys (attribute + names).
+Extractor
- object is built from a Doc object, output from a
- query.An Extractor
+ object is built from a Doc object, output from a
+ query.
ipath and
return a Doc
object. The doc.text field has the document
text converted to either text/plain or
text/html according to doc.mimetype. The
- typical use would be as follows:
+ typical use would be as follows:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
@@ -7106,10 +7252,10 @@ doc = extractor.textextract(qdoc.ipath)
outfile='')
Extracts document into an output file, + which can be given explicitly or will be + created as a temporary file to be deleted by + the caller. Typical use:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
@@ -7127,9 +7273,9 @@ filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
@@ -7167,26 +7313,281 @@ for i in range(nres):
+
+
+ The update API can be used to create an index from + data which is not accessible to the regular Recoll indexer, or structured to + present difficulties to the Recoll input handlers.
+ +An indexer created using this API will be have + equivalent work to do as the the Recoll file system + indexer: look for modified documents, extract their text, + call the API for indexing it, take care of purging the + index out of data from documents which do not exist in + the document store any more.
+ +The data for such an external indexer should be stored + in an index separate from any used by the Recoll internal file system indexer. + The reason is that the main document indexer purge pass + (removal of deleted documents) would also remove all the + documents belonging to the external indexer, as they were + not seen during the filesystem walk. The main indexer + documents would also probably be a problem for the + external indexer own purge operation.
+ +While there would be ways to enable multiple foreign + indexers to cooperate on a single index, it is just + simpler to use separate ones, and use the multiple index + access capabilities of the query interface, if + needed.
+ +There are two parts in the update interface:
+ +Methods inside the recoll module allow inserting
+ data into the index, to make it accessible by the
+ normal query interface.
An interface based on scripts execution is
+ defined to allow either the GUI or the rclextract module to access
+ original document data for previewing or
+ editing.
The following code fragments can be used to ensure - that code can run with both the old and the new API (as - long as it does not use the new abilities of the new - API of course).
+The update methods are part of the recoll module described above. The
+ connect() method is used with a writable=true parameter to obtain a
+ writable Db object. The
+ following Db object
+ methods are then available.
Adapting to the new package structure:
+Add or update index data for a given document
+ The udi
+ string must define a unique id for the document.
+ It is an opaque interface element and not
+ interpreted inside Recoll. doc is a Doc object,
+ created from the data to be indexed (the main
+ text should be in doc.text). If parent_udi
+ is set, this is a unique identifier for the
+ top-level container (e.g. for the filesystem
+ indexer, this would be the one which is an actual
+ file).
Purge index from all data for udi, and all documents (if any)
+ which have a matrching parent_udi.
Test if the index needs to be updated for the
+ document identified by udi. If this call is to be used,
+ the doc.sig field
+ should contain a signature value when calling
+ addOrUpdate(). The
+ needUpdate() call
+ then compares its parameter value with the stored
+ sig for udi. sig is an opaque value, compared
+ as a string.
The filesystem indexer uses a concatenation of + the decimal string values for file size and + update time, but a hash of the contents could + also be used.
+ +As a side effect, if the return value is false
+ (the index is up to date), the call will set the
+ existence flag for the document (and any
+ subdocument defined by its parent_udi), so that a later
+ purge() call will
+ preserve them).
The use of needUpdate() and purge() is optional, and the
+ indexer may use another method for checking the
+ need to reindex or to delete stale entries.
Delete all documents that were not touched + during the just finished indexing pass (since + open-for-write). These are the documents for the + needUpdate() call was not performed, indicating + that they no longer exist in the primary storage + system.
+Recoll has internal
+ methods to access document data for its internal
+ (filesystem) indexer. An external indexer needs to
+ provide data access methods if it needs integration
+ with the GUI (e.g. preview function), or support for
+ the rclextract
+ module.
The index data and the access method are linked by
+ the rclbes (recoll backend
+ storage) Doc field. You
+ should set this to a short string value identifying
+ your indexer (e.g. the filesystem indexer uses either
+ "FS" or an empty value, the Web history indexer uses
+ "BGL").
The link is actually performed inside a backends configuration file (stored
+ in the configuration directory). This defines commands
+ to execute to access data from the specified indexer.
+ Example, for the mbox indexing sample found in the
+ Recoll source (which sets rclbes="MBOX"):
+[MBOX] +fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch +makesig = path/to/recoll/src/python/samples/rclmbox.py makesig + ++ +
fetch and makesig define two commands to execute
+ to respectively retrieve the document text and compute
+ the document signature (the example implementation uses
+ the same script with different first parameters to
+ perform both operations).
The scripts are called with three additional
+ arguments: udi,
+ url, ipath, stored with the document when
+ it was indexed, and may use any or all to perform the
+ requested operation. The caller expects the result data
+ on stdout.
The Recoll source tree has two samples of external
+ indexers in the src/python/samples directory. The
+ more interesting one is rclmbox.py which indexes a directory
+ containing mbox folder
+ files. It exercises most features in the update
+ interface, and has a data access interface.
See the comments inside the file for more + information.
+The following code fragments can be used to ensure + that code can run with both the old and the new API (as + long as it does not use the new abilities of the new API + of course).
+ +Adapting to the new package structure:
+
try:
from recoll import recoll
@@ -7196,21 +7597,21 @@ except:
import recoll
hasextract = False
+
- Adapting to the change of nature of the next Query member. The same test can be
- used to choose to use the scroll() method (new) or set the
- next value (old).
+Adapting to the change of nature of the
+nextQuery+ member. The same test can be used to choose to use the +scroll()method (new) or set + thenextvalue (old).rownum = query.next if type(query.next) == int else \ query.rownumber +-