From 583877757ad6c9237c18799e7299ba75a9e93c60 Mon Sep 17 00:00:00 2001 From: dockes Date: Fri, 10 Oct 2008 08:19:12 +0000 Subject: [PATCH] added python api doc --- src/doc/user/usermanual.sgml | 321 ++++++++++++++++++++++++++++++++++- 1 file changed, 319 insertions(+), 2 deletions(-) diff --git a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml index d81047e5..0a58707b 100644 --- a/src/doc/user/usermanual.sgml +++ b/src/doc/user/usermanual.sgml @@ -24,7 +24,7 @@ Dockes - $Id: usermanual.sgml,v 1.66 2008-10-08 16:12:36 dockes Exp $ + $Id: usermanual.sgml,v 1.67 2008-10-10 08:19:12 dockes Exp $ This document introduces full text search notions @@ -1575,12 +1575,329 @@ fvwm Your main database (the one the current configuration indexes to), is always implicitly active. If this is not desirable, you can set up your configuration so that it indexes, - for example, an empty directory. + for example, an empty directory. An alternative indexer may also + need to implement a way of purging the index from stale data, + + + Programming interface + + + Interface elements + + A few elements in the interface are specific and and need + an explanation. + + + + + udi An udi (unique document + identifier) identifies a document. Because of limitations + inside the index engine, it is restricted in length (to + 200 bytes), which is why a regular URI cannot be used. The + structure and contents of the udi is defined by the + application and opaque to the index engine. For example, + the internal file system indexer uses the complete + document path (file path + internal path), truncated to + length, the suppressed part being replaced by a hash + value. + + + + ipath + + This data value (set as a field in the Doc + object) is stored, along with the URL, but not indexed by + &RCL;. Its contents are not interpreted, and its use is up + to the application. For example, the &RCL; internal file + system indexer stores the part of the document access path + internal to the container file (ipath in + this case is a list of subdocument sequential numbers). url + and ipath are returned in every search result and permit + access to the original document. + + + + + Stored and indexed fields + + The fields file inside + the &RCL; configuration defines which document fields are + either "indexed" (searchable), "stored" (retrievable with + search results), or both. + + + + + + Data for an external indexer, should be stored in a + separate index, not the one for the &RCL; internal file system + indexer, except if the latter is not used at all). The reason + is that the main document indexer purge pass would remove all + the other indexer's documents, as they were not seen during + indexing. The main indexer documents would also probably be a + problem for the external indexer purge operation. + + + + + Python interface + + + Introduction + + &RCL; versions after 1.11 define a Python programming + interface, both for searching and indexing. + + The python interface is not built by default and can be + found in the source package, under python/recoll. The + directory contains the usual setup.py + script which you can use to build and install the + module: + + + cd recoll-xxx/python/recoll + python setup.py build + python setup.py install + + + + + + + + Interface manual + + +NAME + recoll - This is an interface to the Recoll full text indexer. + +FILE + /usr/local/lib/python2.5/site-packages/recoll.so + +CLASSES + Db + Doc + Query + SearchData + + class Db(__builtin__.object) + | Db([confdir=None], [extra_dbs=None], [writable = False]) + | + | A Db object holds a connection to a Recoll index. Use the connect() + | function to create one. + | confdir specifies a Recoll configuration directory (default: + | $RECOLL_CONFDIR or ~/.recoll). + | extra_dbs is a list of external databases (xapian directories) + | writable decides if we can index new data through this connection + | + | Methods defined here: + | + | + | addOrUpdate(...) + | addOrUpdate(udi, doc, parent_udi=None) -> None + | Add or update index data for a given document + | The udi string must define a unique id for the document. It is not + | interpreted inside Recoll + | doc is a Doc object + | if parent_udi is set, this is a unique identifier for the + | top-level container (ie mbox file) + | + | delete(...) + | delete(udi) -> Bool. + | Purge index from all data for udi. If udi matches a container + | document, purge all subdocs (docs with a parent_udi matching udi). + | + | makeDocAbstract(...) + | makeDocAbstract(Doc, Query) -> string + | Build and return 'keyword-in-context' abstract for document + | and query. + | + | needUpdate(...) + | needUpdate(udi, sig) -> Bool. + | Check if the index is up to date for the document defined by udi, + | having the current signature sig. + | + | purge(...) + | purge() -> Bool. + | Delete all documents that were not touched during the just finished + | indexing pass (since open-for-write). These are the documents for + | the needUpdate() call was not performed, indicating that they no + | longer exist in the primary storage system. + | + | query(...) + | query() -> Query. Return a new, blank query object for this index. + | + | setAbstractParams(...) + | setAbstractParams(maxchars, contextwords). + | Set the parameters used to build 'keyword-in-context' abstracts + | + | ---------------------------------------------------------------------- + | Data and other attributes defined here: + | + + class Doc(__builtin__.object) + | Doc() + | + | A Doc object contains index data for a given document. + | The data is extracted from the index when searching, or set by the + | indexer program when updating. The Doc object has no useful methods but + | many attributes to be read or set by its user. It matches exactly the + | Rcl::Doc c++ object. Some of the attributes are predefined, but, + | especially when indexing, others can be set, the name of which will be + | processed as field names by the indexing configuration. + | Inputs can be specified as unicode or strings. + | Outputs are unicode objects. + | All dates are specified as unix timestamps, printed as strings + | Predefined attributes (index/query/both): + | text (index): document plain text + | url (both) + | fbytes (both) optional) file size in bytes + | filename (both) + | fmtime (both) optional file modification date. Unix time printed + | as string + | dbytes (both) document text bytes + | dmtime (both) document creation/modification date + | ipath (both) value private to the app.: internal access path + | inside file + | mtype (both) mime type for original document + | mtime (query) dmtime if set else fmtime + | origcharset (both) charset the text was converted from + | size (query) dbytes if set, else fbytes + | sig (both) app-defined file modification signature. + | For up to date checks + | relevancyrating (query) + | abstract (both) + | author (both) + | title (both) + | keywords (both) + | + | Methods defined here: + | + | + | ---------------------------------------------------------------------- + | Data and other attributes defined here: + | + + class Query(__builtin__.object) + | Recoll Query objects are used to execute index searches. + | They must be created by the Db.query() method. + | + | Methods defined here: + | + | + | execute(...) + | execute(query_string, stemming=1|0) + | + | Starts a search for query_string, a Recoll search language string + | (mostly Xesam-compatible). + | The query can be a simple list of terms (and'ed by default), or more + | complicated with field specs etc. See the Recoll manual. + | + | executesd(...) + | executesd(SearchData) + | + | Starts a search for the query defined by the SearchData object. + | + | fetchone(...) + | fetchone(None) -> Doc + | + | Fetches the next Doc object in the current search results. + | + | sortby(...) + | sortby(field=fieldname, ascending=true) + | Sort results by 'fieldname', in ascending or descending order. + | Only one field can be used, no subsorts for now. + | Must be called before executing the search + | + | ---------------------------------------------------------------------- + | Data descriptors defined here: + | + | next + | Next index to be fetched from results. Normally increments after + | each fetchone() call, but can be set/reset before the call effect + | seeking. Starts at 0 + | + | ---------------------------------------------------------------------- + | Data and other attributes defined here: + | + + class SearchData(__builtin__.object) + | SearchData() + | + | A SearchData object describes a query. It has a number of global + | parameters and a chain of search clauses. + | + | Methods defined here: + | + | + | addclause(...) + | addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', + | qstring=string, slack=int, field=string, stemming=1|0, + | subSearch=SearchData) + | Adds a simple clause to the SearchData And/Or chain, or a subquery + | defined by another SearchData object + | + | ---------------------------------------------------------------------- + | Data and other attributes defined here: + | + +FUNCTIONS + connect(...) + connect([confdir=None], [extra_dbs=None], [writable = False]) + -> Db. + + Connects to a Recoll database and returns a Db object. + confdir specifies a Recoll configuration directory + (the default is built like for any Recoll program). + extra_dbs is a list of external databases (xapian directories) + writable decides if we can index new data through this connection + + + + + + + Example code + + The following sample would query the index with a user + language string. See the python/samples + directory inside the &RCL; source for other examples. + + +#!/usr/bin/env python + +import recoll + +db = recoll.connect() +db.setAbstractParams(maxchars=80, contextwords=2) + +query = db.query() +nres = query.execute("some user question") +print "Result count: ", nres +if nres > 5: + nres = 5 +while query.next >= 0 and query.next < nres: + doc = query.fetchone() + print query.next + for k in ("title", "size"): + print k, ":", getattr(doc, k).encode('utf-8') + abs = db.makeDocAbstract(doc, query).encode('utf-8') + print abs + print + + + + + + + + + Installation