2440 lines
102 KiB
Plaintext
2440 lines
102 KiB
Plaintext
|
|
More documentation can be found in the doc/ directory or at http://www.recoll.org
|
|
|
|
|
|
Recoll user manual
|
|
|
|
Jean-Francois Dockes
|
|
|
|
<jean-francois.dockes@wanadoo.fr>
|
|
|
|
Copyright (c) 2005 Jean-Francois Dockes
|
|
|
|
This document introduces full text search notions and describes the
|
|
installation and use of the Recoll application. It currently describes
|
|
Recoll 1.12.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Table of Contents
|
|
|
|
1. Introduction
|
|
|
|
1.1. Giving it a try
|
|
|
|
1.2. Full text search
|
|
|
|
1.3. Recoll overview
|
|
|
|
2. Indexing
|
|
|
|
2.1. Introduction
|
|
|
|
2.2. Index storage
|
|
|
|
2.2.1. Xapian index formats
|
|
|
|
2.2.2. Security aspects
|
|
|
|
2.3. Indexing configuration
|
|
|
|
2.3.1. The indexing configuration GUI
|
|
|
|
2.4. Periodic indexing
|
|
|
|
2.4.1. Starting indexing
|
|
|
|
2.4.2. Using cron to automate indexing
|
|
|
|
2.5. Real time indexing
|
|
|
|
3. Searching with the Qt graphical user interface
|
|
|
|
3.1. Simple search
|
|
|
|
3.2. The result list
|
|
|
|
3.2.1. The result list right-click menu
|
|
|
|
3.3. The preview window
|
|
|
|
3.4. The query language
|
|
|
|
3.5. Complex/advanced search
|
|
|
|
3.6. The term explorer tool
|
|
|
|
3.7. More about wildcards
|
|
|
|
3.8. Multiple databases
|
|
|
|
3.9. Document history
|
|
|
|
3.10. Sorting search results and collapsing duplicates
|
|
|
|
3.11. Search tips, shortcuts
|
|
|
|
3.11.1. Terms and search expansion
|
|
|
|
3.11.2. Working with phrases and proximity
|
|
|
|
3.11.3. Others
|
|
|
|
3.12. Customizing the search interface
|
|
|
|
4. Searching with the KDE KIO slave
|
|
|
|
4.1. What's this
|
|
|
|
4.2. Searchable documents
|
|
|
|
5. Searching on the command line
|
|
|
|
6. Programming interface
|
|
|
|
6.1. Writing a document filter
|
|
|
|
6.1.1. Filter HTML output
|
|
|
|
6.2. Field data processing configuration
|
|
|
|
6.3. API
|
|
|
|
6.3.1. Interface elements
|
|
|
|
6.3.2. Python interface
|
|
|
|
7. Installation
|
|
|
|
7.1. Installing a prebuilt copy
|
|
|
|
7.1.1. Installing through a package system
|
|
|
|
7.1.2. Installing a prebuilt Recoll
|
|
|
|
7.2. Supporting packages
|
|
|
|
7.3. Building from source
|
|
|
|
7.3.1. Prerequisites
|
|
|
|
7.3.2. Building
|
|
|
|
7.3.3. Installation
|
|
|
|
7.4. Configuration overview
|
|
|
|
7.4.1. Main configuration file
|
|
|
|
7.4.2. The mimemap file
|
|
|
|
7.4.3. The mimeconf file
|
|
|
|
7.4.4. The mimeview file
|
|
|
|
7.4.5. Examples of configuration adjustments
|
|
|
|
7.5. The KDE Kicker Recoll applet
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 1. Introduction
|
|
|
|
1.1. Giving it a try
|
|
|
|
If you do not like reading manuals (who does?) and would like to give
|
|
Recoll a try, just perform installation and start the recoll user
|
|
interface, which will index your home directory by default, allowing you
|
|
to search immediately after indexing completes.
|
|
|
|
Do not do this if your home directory contains a huge number of documents
|
|
and you do not want to wait or are very short on disk space. In this case,
|
|
you may first want to customize the configuration to restrict the indexed
|
|
area.
|
|
|
|
Also be aware that you may need to install the appropriate supporting
|
|
applications for document types that need them (for example antiword for
|
|
ms-word files).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
1.2. Full text search
|
|
|
|
Recoll is a full text search application. Full text search applications
|
|
let you find your data by content rather than by external attributes (like
|
|
a file name). More specifically, they will let you specify words (terms)
|
|
that should or should not appear in the text you are looking for, and
|
|
return a list of matching documents, ordered so that the most relevant
|
|
documents will appear first.
|
|
|
|
You do not need to remember in what file or email message you stored a
|
|
given piece of information. You just ask for related terms, and the tool
|
|
will return a list of documents where those terms are prominent, in a
|
|
similar way to Internet search engines.
|
|
|
|
Recoll tries to determine which documents are most relevant to the search
|
|
terms you provide. Computer algorithms for determining relevance can be
|
|
very complex, and in general are inferior to the power of the human mind
|
|
to rapidly determine relevance. The quality of relevance guessing by the
|
|
search tool is probably the most important element for a search
|
|
application.
|
|
|
|
In many cases, you are looking for all the forms of a word, not for a
|
|
specific form or spelling. These different forms may include plurals,
|
|
different tenses for a verb, or terms derived from the same root or stem
|
|
(example: floor, floors, floored, flooring...). Recoll will by default
|
|
expand queries to all such related terms (words that reduce to the same
|
|
stem). This expansion can be disabled at search time.
|
|
|
|
Stemming, by itself, does not accommodate for misspellings or phonetic
|
|
searches. Recoll supports these features through a specific tool (the term
|
|
explorer) which will let you explore the set of index terms along
|
|
different modes.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
1.3. Recoll overview
|
|
|
|
Recoll uses the Xapian information retrieval library as its storage and
|
|
retrieval engine. Xapian is a very mature package using a sophisticated
|
|
probabilistic ranking model. Recoll provides the interface to get data
|
|
into (indexing) and out (searching) of the system.
|
|
|
|
In practice, Xapian works by remembering where terms appear in your
|
|
document files. The acquisition process is called indexing.
|
|
|
|
The resulting index can be big (roughly the size of the original document
|
|
set), but it is not a document archive. Recoll can only display documents
|
|
that still exist at the place from which they were indexed. (Actually,
|
|
there is a way to reconstruct a document from the information in the
|
|
index, but the result is not nice, as all formatting, punctuation and
|
|
capitalization are lost).
|
|
|
|
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
|
files with different character sets, encodings, and languages into the
|
|
same index. It has input filters for many document types.
|
|
|
|
Stemming depends on the document language. Recoll stores the unstemmed
|
|
versions of terms and uses auxiliary databases for term expansion. It can
|
|
switch stemming languages, or add a language, without re-indexing. Storing
|
|
documents in different languages in the same index is possible, and useful
|
|
in practice, but does introduce possibilities of confusion. Recoll
|
|
currently makes no attempt at automatic language recognition.
|
|
|
|
Recoll has many parameters which define exactly what to index, and how to
|
|
classify and decode the source documents. These are kept in configuration
|
|
files. A default configuration is copied into a standard location (usually
|
|
something like /usr/[local/]share/recoll/examples) during installation.
|
|
The default parameters from this file may be overridden by values that you
|
|
set inside your personal configuration, found by default in the .recoll
|
|
sub-directory of your home directory. The default configuration will index
|
|
your home directory with default parameters and should be sufficient for
|
|
giving Recoll a try, but you may want to adjust it later.
|
|
|
|
Indexing is started automatically the first time you execute the recoll
|
|
search graphical user interface, or by executing the recollindex command.
|
|
|
|
Searches are performed inside the recoll program, which has many options
|
|
to help you find what you are looking for.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 2. Indexing
|
|
|
|
2.1. Introduction
|
|
|
|
Indexing is the process by which the set of documents is analyzed and the
|
|
data entered into the database. Recoll indexing is normally incremental:
|
|
documents will only be processed if they have been modified. On the first
|
|
execution, of course, all documents will need processing. A full index
|
|
build can be forced later by specifying an option to the indexing command
|
|
(recollindex -z).
|
|
|
|
Recoll indexing can be performed with two different methods:
|
|
|
|
* Periodic indexing: indexing takes place at discrete times, by
|
|
executing the recollindex command. The typical usage is to have a
|
|
nightly indexing run programmed into your cron file.
|
|
|
|
* Real time indexing: indexing takes place as soon as a file is created
|
|
or changed. recollindex runs as a daemon and uses a file system
|
|
alteration monitor such as Fam, Gamin or inotify do detect file
|
|
changes. Monitoring a big directory tree can consume significant
|
|
system resources.
|
|
|
|
The choice between the two methods is mostly a matter of preference, and
|
|
they can be combined by setting up multiple indexes (ie: use periodic
|
|
indexing on a big documentation directory, and real time indexing on a
|
|
small home directory). Monitoring a big file system tree can consume
|
|
significant system resources, for dubious gains.
|
|
|
|
|
|
|
|
Recoll knows about quite a few different document types. The parameters
|
|
for document types recognition and processing are set in configuration
|
|
files Most file types, like HTML or word processing files, only hold one
|
|
document. Some file types, like mail folder files can hold many
|
|
individually indexed documents.
|
|
|
|
Recoll indexing processes plain text, HTML, openoffice and e-mail files
|
|
internally.
|
|
|
|
Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
|
|
applications for preprocessing. The list is in the installation section.
|
|
After every indexing operation, Recoll updates a list of commands that
|
|
would be needed for indexing existing files types. This list can be
|
|
displayed from the recoll File menu. It is stored in the missing text file
|
|
inside the configuration directory.
|
|
|
|
Without further configuration, Recoll will index all appropriate files
|
|
from your home directory, with a reasonable set of defaults.
|
|
|
|
In some cases, it may be interesting to index different areas of the file
|
|
system to separate databases. You can do this by using multiple
|
|
configuration directories, each indexing a file system area to a specific
|
|
database. See the section about using multiple databases for more
|
|
information on multiple configurations and indexes.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.2. Index storage
|
|
|
|
The default location for the index data is the xapiandb subdirectory of
|
|
the Recoll configuration directory, typically $HOME/.recoll/xapiandb/.
|
|
This can be changed via two different methods (with different purposes):
|
|
|
|
* You can specify a different configuration directory by setting the
|
|
RECOLL_CONFDIR environment variable, or using the -c option to the
|
|
Recoll commands. This method would typically be used to index
|
|
different areas of the file system to different indexes. For example,
|
|
if you were to issue the following commands:
|
|
|
|
export RECOLL_CONFDIR=~/.indexes-email
|
|
recoll
|
|
|
|
|
|
Then Recoll would use configuration files stored in ~/.indexes-email/
|
|
and, (unless specified otherwise in recoll.conf) would look for the
|
|
index in ~/.indexes-email/xapiandb/.
|
|
|
|
Using multiple configuration directories and configuration options
|
|
allows you to tailor multiple configurations and indexes to handle
|
|
whatever subset of the available data that you wish to make
|
|
searchable.
|
|
|
|
* You can also specify a different storage location for the index by
|
|
setting the dbdir parameter in the configuration file (see the
|
|
configuration section). This method would mainly be of use if you
|
|
wanted to keep the configuration directory in its default location,
|
|
but desired another location for the index, typically out of disk
|
|
occupation concerns.
|
|
|
|
The size of the index is determined by the size of the set of documents,
|
|
but the ratio can vary a lot. For a typical mixed set of documents, the
|
|
index size will often be close to the data set size. In specific cases (a
|
|
set of compressed mbox files for example), the index can become much
|
|
bigger than the documents. It may also be much smaller if the documents
|
|
contain a lot of images or other non-indexed data (an extreme example
|
|
being a set of mp3 files where only the tags would be indexed).
|
|
|
|
Of course, images, sound and video do not increase the index size, which
|
|
means that it will be quite typical nowadays (2006), that even a big index
|
|
will be negligible against the total amount of data on the computer.
|
|
|
|
The index data directory (xapiandb) only contains data that can be
|
|
completely rebuilt by an index run, and it can always be destroyed safely.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.2.1. Xapian index formats
|
|
|
|
If your first installation of Recoll was 1.9.0 or more recent, you can
|
|
skip this section.
|
|
|
|
Xapian has had two possible index formats for quite some time. The "old"
|
|
one named Quartz, and the new one named Flint. Xapian 0.9 used Quartz by
|
|
default, but could use Flint if a specific environment variable
|
|
(XAPIAN_PREFER_FLINT) was set. Xapian 1.0 still supports Quartz but will
|
|
use Flint by default for new index creations.
|
|
|
|
The number of disk accesses performed during indexing has been much
|
|
optimized in the new Flint engine and you may see indexing times improved
|
|
by 50% in some cases (compared to Quartz), typically for big indexes where
|
|
disk accesses dominate the indexing time. There is also a more modest
|
|
improvement of index size.
|
|
|
|
Xapian will not convert automatically an existing index from the Quartz to
|
|
the Flint format. If you have an older index and want to take advantage of
|
|
the new format (which can be done without setting the environment variable
|
|
as of Recoll 1.8.2 and Xapian 1.0.0), you will have to explicitly delete
|
|
the old index, then run a normal indexing process.
|
|
|
|
Unfortunately, using the -z option to recollindex is not sufficient to
|
|
change the format, you have to delete all files inside the index directory
|
|
(typically ~/.recoll/xapiandb) before starting indexing.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.2.2. Security aspects
|
|
|
|
The Recoll index does not hold copies of the indexed documents. But it
|
|
does hold enough data to allow for an almost complete reconstruction. If
|
|
confidential data is indexed, access to the database directory should be
|
|
restricted.
|
|
|
|
As of version 1.4, Recoll will create the configuration directory with a
|
|
mode of 0700 (access by owner only). As the index data directory is by
|
|
default a sub-directory of the configuration directory, this should result
|
|
in appropriate protection.
|
|
|
|
If you use another setup, you should think of the kind of protection you
|
|
need for your index, set the directory and files access modes
|
|
appropriately, and also maybe adjust the umask used during index updates.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.3. Indexing configuration
|
|
|
|
Variables set inside the Recoll configuration files control which areas of
|
|
the file system are indexed, and how files are processed. These variables
|
|
can be set either by editing the text files or using the dialogs in the
|
|
recoll GUI.
|
|
|
|
You can also use multiple indexes defined by separate configurations,
|
|
typically to separate personal and shared indexes, or to take advantage of
|
|
the organization of your data to improve search precision.
|
|
|
|
The first time you start recoll, you will be asked whether or not you
|
|
would like recoll to build the index. If you want to adjust the
|
|
configuration before indexing, just click Cancel at this point. That way,
|
|
recoll will have created a ~/.recoll directory containing empty
|
|
configuration files.
|
|
|
|
The configuration is documented inside the installation chapter of this
|
|
document, or in the recoll.conf(5) man page, but the most current
|
|
information will most likely be the comments inside the sample file. The
|
|
most immediately useful variable you may interested in is probably
|
|
topdirs, which determines what subtrees get indexed.
|
|
|
|
The applications needed to index file types other than text, HTML or email
|
|
(ie: pdf, postscript, ms-word...) are described in the external packages
|
|
section
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.3.1. The indexing configuration GUI
|
|
|
|
Most parameters for a given indexing configuration can be set from a
|
|
recoll GUI running on this configuration (either as default, or by setting
|
|
RECOLL_CONFDIR or the -c option.)
|
|
|
|
The interface is started from the Preferences menu. It has two main
|
|
panels. The first panel allows setting global variables, like the list of
|
|
top directories or the list of skipped paths. The second panel allows
|
|
setting variables that can be redefined for subdirectories. This second
|
|
panel has an initially empty list of customisation directories, to which
|
|
you can add. The variables are then set for the currently selected
|
|
directory (or at the top level if the empty line is selected).
|
|
|
|
The meaning for most entries in the interface is self-evident and
|
|
documented by a ToolTip popup on the text label. For more detail, you will
|
|
need to refer to the configuration section of this guide.
|
|
|
|
The configuration tool normally respects the comments and most of the
|
|
formatting inside the configuration file, so that it is quite possible to
|
|
use it on hand-edited files, which you might nevertheless want to backup
|
|
first...
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.4. Periodic indexing
|
|
|
|
2.4.1. Starting indexing
|
|
|
|
Indexing is performed either by the recollindex program, or by the
|
|
indexing thread inside the recoll program (use the File menu). Both
|
|
programs will use the RECOLL_CONFDIR variable or accept a -c confdir
|
|
option to specify a non-default configuration directory.
|
|
|
|
If the recoll program finds no index when it starts, it will automatically
|
|
start indexing (except if canceled).
|
|
|
|
It is best to avoid interrupting the indexing process, as this may
|
|
sometimes leave the index in a bad state. This is not a serious problem,
|
|
as you then just need to delete the index files and restart the indexing.
|
|
The index files are normally stored in the $HOME/.recoll/xapiandb
|
|
directory, which you can just delete if needed. Alternatively, you can
|
|
start recollindex with option -z, which will reset the database before
|
|
indexing.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.4.2. Using cron to automate indexing
|
|
|
|
The most common way to set up indexing is to have a cron task execute it
|
|
every night. For example the following crontab entry would do it every day
|
|
at 3:30AM (supposing recollindex is in your PATH):
|
|
|
|
30 3 * * * recollindex > /tmp/recolltrace 2>&1
|
|
|
|
The usual command to edit your crontab is crontab -e (which will usually
|
|
start the vi editor to edit the file). You may have more sophisticated
|
|
tools available on your system.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
2.5. Real time indexing
|
|
|
|
Real time monitoring/indexing is performed by starting the recollindex -m
|
|
command. With this option, recollindex will detach from the terminal and
|
|
become a daemon, permanently monitoring file changes and updating the
|
|
index.
|
|
|
|
The real time indexing support can be customised during package
|
|
configuration with the --with[out]-fam or --with[out]-inotify options. The
|
|
default is currently to include inotify monitoring on systems that support
|
|
it.
|
|
|
|
The rclmon.sh script can be used to easily start and stop the daemon. It
|
|
can be found in the examples directory (typically
|
|
/usr/local/[share/]recoll/examples).
|
|
|
|
Starting the daemon is normally performed as part of the user session
|
|
script. For example, my out of fashion xdm-based session has a .xsession
|
|
script with the following lines at the end:
|
|
|
|
recollconf=$HOME/.recoll-home
|
|
recolldata=/usr/local/share/recoll
|
|
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
|
|
|
fvwm
|
|
|
|
The indexing daemon gets started, then the window manager, for which the
|
|
session waits.
|
|
|
|
By default the indexing daemon will monitor the state of the X11 session,
|
|
and exit when it finishes, it is not necessary to kill it explicitly.
|
|
(The X11 server monitoring can be disabled with option -x to recollindex).
|
|
|
|
Under KDE, you can place a small script to start recollindex -m under
|
|
$HOME/.kde/Autostart. This will be executed when the session begins.
|
|
|
|
There is a similar mechanism under Gnome (find the session control tool in
|
|
the menus and use the "Startup programs" tab).
|
|
|
|
By default, the indexing daemon will write its messages to a file inside
|
|
the configuration directory (this is controlled by the daemlogfilename and
|
|
daemloglevel configuration parameters). You may want to change this. Also
|
|
the log file will only be truncated when the daemon starts. If the daemon
|
|
runs permanently, the log file may grow quite big, depending on the log
|
|
level.
|
|
|
|
While it is convenient that data is indexed in real time, repeated
|
|
indexing can generate a significant load on the system when files such as
|
|
email folders change. Also, monitoring large file trees by itself
|
|
significantly taxes system resources. You probably do not want to enable
|
|
it if your system is short on resources. Periodic indexing is adequate in
|
|
most cases.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 3. Searching with the Qt graphical user interface
|
|
|
|
The recoll program provides the main user interface for searching. It is
|
|
based on the QT library.
|
|
|
|
recoll has two search modes:
|
|
|
|
* Simple search (the default, on the main screen) has a single entry
|
|
field where you can enter multiple words.
|
|
|
|
* Advanced search (a panel accessed through the Tools menu or the
|
|
toolbox bar icon) shas multiple entry fields, which you may use to
|
|
build a logical condition, with additional filtering on file type and
|
|
location in the file system.
|
|
|
|
In most cases, you can enter the terms as you think them, even if they
|
|
contain embedded punctuation or other non-textual characters. For exemple,
|
|
Recoll can handle things like e-mail addresses, or arbitrary cut and paste
|
|
from another text window, punctation and all.
|
|
|
|
The main case where you should enter text differently from how it is
|
|
printed is for east-asian languages (Chinese, Japanese, Korean). Words
|
|
composed of single or multiple characters should be entered separated by
|
|
white space in this case (they would typically be printed without white
|
|
space).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.1. Simple search
|
|
|
|
1. Start the recoll program.
|
|
|
|
2. Possibly choose a search mode: Any term, All terms, File name or Query
|
|
language.
|
|
|
|
3. Enter search term(s) in the text field at the top of the window.
|
|
|
|
4. Click the Search button or hit the Enter key to start the search.
|
|
|
|
The initial default search mode is All terms. This will look for documents
|
|
containing all of the search terms (the ones with more terms will get
|
|
better scores). Any term will search for documents where at least one of
|
|
the terms appear.
|
|
|
|
File name will specifically look for file names. The entry will be split
|
|
at white space characters, and each pattern will be separately expanded.
|
|
If you want to search for a pattern including white space, you need to use
|
|
double quotes. The point of having a separate file name search is that
|
|
wild card expansion can be performed more efficiently on a relatively
|
|
small subset of the index.
|
|
|
|
The fourth entry (Query Language) is described in its own section.
|
|
|
|
All search modes allow wildcards inside terms (*, ?, []). You may want to
|
|
have a look at the section about wildcards for more information about
|
|
this.
|
|
|
|
You can search for exact phrases (adjacent words in a given order) by
|
|
enclosing the input inside double quotes. Ex: "virtual reality".
|
|
|
|
Character case has no influence on search, except that you can disable
|
|
stem expansion for any term by capitalizing it. Ie: a search for floor
|
|
will also normally look for flooring, floored, etc., but a search for
|
|
Floor will only look for floor, in any character case. Sstemming can also
|
|
be disabled globally in the preferences.
|
|
|
|
Recoll remembers the last few searches that you performed. You can use the
|
|
simple search text entry widget (a combobox) to recall them (click on the
|
|
thing at the right of the text field). Please note, however, that only the
|
|
search texts are remembered, not the mode (all/any/file name).
|
|
|
|
Typing Esc Space while entering a word in the simple search entry will
|
|
open a window with possible completions for the word. The completions are
|
|
extracted from the database.
|
|
|
|
Double-clicking on a word in the result list or a preview window will
|
|
insert it into the simple search entry field.
|
|
|
|
Note that, apart from wildcard characters (single ? characters are ok),
|
|
you can cut and paste any text into an All terms or Any term search field,
|
|
punctuation, newlines and all. Recoll will process it and produce a
|
|
meaningful search. This is what most differentiates this mode from the
|
|
Query Language mode, where you have to care about the syntax.
|
|
|
|
You can use the Tools / Advanced search dialog for more complex searches.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.2. The result list
|
|
|
|
After starting a search, a list of results will instantly be displayed in
|
|
the main list window.
|
|
|
|
By default, the document list is presented in order of relevance (how well
|
|
the system estimates that the document matches the query). You can specify
|
|
a different ordering by using the Tools / Sort parameters dialog.
|
|
|
|
Clicking on the Preview link for an entry will open an internal preview
|
|
window for the document. Further Preview clicks for the same search will
|
|
open tabs in the existing preview window. You can use Shift+Click to force
|
|
the creation of another preview window, which may be useful to view the
|
|
documents side by side. (You can also browse successive results in a
|
|
single preview window by typing Shift+ArrowUp/Down in the window).
|
|
|
|
Clicking the Edit link will attempt to start an external editor. The
|
|
editors can be configured through the user preferences dialog, or by
|
|
editing the mimeview configuration file.
|
|
|
|
The Preview and Edit edit links may not be present for all entries,
|
|
meaning that Recoll has no configured way to preview a given file type
|
|
(which was indexed by name only), or no configured external editor for the
|
|
file type. This can sometimes be adjusted simply by tweaking the mimemap
|
|
and mimeview configuration files (the latter can be modified with the user
|
|
preferences dialog).
|
|
|
|
The format of the result list entries is entirely configurable by using
|
|
the preference dialog to edit an HTML fragment.
|
|
|
|
You can click on the Query details link at the top of the results page to
|
|
see the query actually performed, after stem expansion and other
|
|
processing.
|
|
|
|
Double-clicking on any word inside the result list or a preview window
|
|
will insert it into the simple search text.
|
|
|
|
The result list is divided into pages (the size of which you can change in
|
|
the preferences). Use the arrow buttons in the toolbar or the links at the
|
|
bottom of the page to browse the results.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.2.1. The result list right-click menu
|
|
|
|
Apart from the preview and edit links, you can display a pop-up menu by
|
|
right-clicking over a paragraph in the result list. This menu has the
|
|
following entries:
|
|
|
|
* Preview
|
|
|
|
* Edit
|
|
|
|
* Copy File Name
|
|
|
|
* Copy Url
|
|
|
|
* Save to File
|
|
|
|
* Find similar
|
|
|
|
* Parent document
|
|
|
|
The Preview and Edit entries do the same thing as the corresponding links.
|
|
|
|
The Copy File Name and Copy Url copy the relevant data to the clipboard,
|
|
for later pasting.
|
|
|
|
Save to File allows saving the contents of a result document to a chosen
|
|
file. This entry will only appear if the document does not correspond to
|
|
an existing file, but is a subdocument inside such a file (ie: an email
|
|
attachment). It is especially useful to extract attachments with no
|
|
associated editor.
|
|
|
|
The Find similar entry will select a number of relevant term from the
|
|
current document and enter them into the simple search field. You can then
|
|
start a simple search, with a good chance of finding documents related to
|
|
the current result.
|
|
|
|
The Parent document entry will appear for documents which are not actually
|
|
files but are part of, or attached to, a higher level document. This entry
|
|
is mainly useful for email attachments and permits viewing the message to
|
|
which the document is attached. Note that the entry will also appear for
|
|
an email which is part of an mbox folder file, but that you can't actually
|
|
visualize the folder (there will be an error dialog if you try). Recoll is
|
|
unfortunately not yet smart enough to disable the entry in this case.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.3. The preview window
|
|
|
|
The preview window opens when you first click a Preview link inside the
|
|
result list.
|
|
|
|
Subsequent preview requests for a given search open new tabs in the
|
|
existing window (except if you hold the Shift key while clicking which
|
|
will open a new window for side by side viewing).
|
|
|
|
Starting another search and requesting a preview will create a new preview
|
|
window. The old one stays open until you close it.
|
|
|
|
You can close a preview tab by typing ^W (Ctrl + W) in the window. Closing
|
|
the last tab for a window will also close the window.
|
|
|
|
Of course you can also close a preview window by using the window manager
|
|
button in the top of the frame.
|
|
|
|
You can display successive or previous documents from the result list
|
|
inside a preview tab by typing Shift+Down or Shift+Up (Down and Up are the
|
|
arrow keys).
|
|
|
|
The preview tabs have an internal incremental search function. You
|
|
initiate the search either by typing a / (slash) inside the text area or
|
|
by clicking into the Search for: text field and entering the search
|
|
string. You can then use the Next and Previous buttons to find the
|
|
next/previous occurrence. You can also type F3 inside the text area to get
|
|
to the next occurrence.
|
|
|
|
If you have a search string entered and you use ^Up/^Down to browse the
|
|
results, the search is initiated for each successive document. If the
|
|
string is found, the cursor will be positioned at the first occurrence of
|
|
the search string.
|
|
|
|
A right-click menu in the text area allows switching between displaying
|
|
the main text or the contents of fields associated to the document (ie:
|
|
author, abtract, etc.). This is especially useful in cases where the term
|
|
match did not occur in the main text but in one of the fields.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.4. The query language
|
|
|
|
The query language processor is activated on the simple search entry when
|
|
the search mode selector is set to Query Language.
|
|
|
|
The language is roughly based on the Xesam user search language
|
|
specification.
|
|
|
|
Here follows a sample request that we are going to explain:
|
|
|
|
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
|
|
|
|
|
This would search for all documents with John Doe appearing as a phrase in
|
|
the author field (exactly what this is would depend on the document type,
|
|
ie: the From: header, for an email message), and containing either beatles
|
|
or lennon and either live or unplugged but not potatoes (in any part of
|
|
the document).
|
|
|
|
An element is composed of an optional field specification, and a value,
|
|
separated by a colon. Exemple: Beatles, author:balzac, dc:title:grandet
|
|
|
|
The colon, if present, means "contains". Xesam defines other relations,
|
|
which are not supported for now.
|
|
|
|
All elements in the search entry are normally combined with an implicit
|
|
AND. It is possible to specify that elements be OR'ed instead, as in
|
|
Beatles OR Lennon. The OR must be entered literally (capitals), and it has
|
|
priority over the AND associations: word1 word2 OR word3 means word1 AND
|
|
(word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
|
|
parenthesis, they are not supported for now.
|
|
|
|
An element preceded by a - specifies a term that should not appear. Pure
|
|
negative queries are forbidden.
|
|
|
|
As usual, words inside quotes define a phrase (the order of words is
|
|
significant), so that title:"prejudice pride" is not the same as
|
|
title:prejudice title:pride, and is unlikely to find a result.
|
|
|
|
Recoll currently manages the following default fields:
|
|
|
|
* title, subject or caption are synonyms which specify data to be
|
|
searched for in the document title or subject.
|
|
|
|
* author or from for searching the documents originators.
|
|
|
|
* recipient or to for searching the documents recipients.
|
|
|
|
* keyword for searching the document-specified keywords (few documents
|
|
actually have any).
|
|
|
|
* filename for the document's file name.
|
|
|
|
* ext specifies the file name extension (Ex: ext:html)
|
|
|
|
The field syntax also supports a few field-like, but special, criteria:
|
|
|
|
* dir for filtering the results on file location (Ex:
|
|
dir:/home/me/somedir). Please note that this is quite inefficient,
|
|
that it may produce very slow searches, and that it may be worth in
|
|
some cases to set up separate databases instead.
|
|
|
|
* mime or format for specifying the mime type. This one is quite special
|
|
because you can specify several values which will be OR'ed (the normal
|
|
default for the language is AND). Ex: mime:text/plain mime:text/html.
|
|
Specifying an explicit boolean operator or negation (-) before a mime
|
|
specification is not supported and will produce strange results.
|
|
|
|
* type or rclcat for specifying the category (as in
|
|
text/media/presentation/etc.). The classification of mime types in
|
|
categories is defined in the Recoll configuration (mimeconf), and can
|
|
be modified or extended. The default category names are those which
|
|
permit filtering results in the main GUI screen. Categories are OR'ed
|
|
like mime types above.
|
|
|
|
The document filters used while indexing have the possibility to create
|
|
other fields with arbitrary names, and aliases may be defined in the
|
|
configuration, so that the exact field search possibilities may be
|
|
different for you if someone took care of the customisation.
|
|
|
|
The query language is currently the only way to use the Recoll field
|
|
search capability.
|
|
|
|
Words inside phrases and capitalized words are not stem-expanded.
|
|
Wildcards may be used anywhere inside a term. Specifying a wild-card on
|
|
the left of a term can produce a very slow search (or even an incorrect
|
|
one if the expansion is truncated because of excessive size).
|
|
|
|
You can use the show query link at the top of the result list to check the
|
|
exact query which was finally executed by Xapian.
|
|
|
|
Most Xesam phrase modifiers are unsupported, except for l (small ell) to
|
|
disable stemming, and p to turn an phrase into a NEAR (unordered) search.
|
|
Exemple: "prejudice pride"p
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.5. Complex/advanced search
|
|
|
|
The advanced search dialog helps you build more complex queries. It can be
|
|
opened through the Tools menu or through the main toolbar.
|
|
|
|
The dialog has three parts:
|
|
|
|
* The top part allows constructing a query by combining multiple clauses
|
|
of different types. Each entry field is configurable for the following
|
|
modes:
|
|
|
|
* All terms.
|
|
|
|
* Any term.
|
|
|
|
* None of the terms.
|
|
|
|
* Phrase (exact terms in order within an adjustable window).
|
|
|
|
* Proximity (terms in any order within an adjustable window).
|
|
|
|
* Filename search.
|
|
|
|
Additional entry fields can be created by clicking the Add clause
|
|
button.
|
|
|
|
When searching, the non-empty clauses will be combined either with an
|
|
AND or an OR conjunction, depending on the choice made on the left
|
|
(All clauses or Any clause).
|
|
|
|
Entries of all types except "Phrase" and "Near" accept a mix of single
|
|
words and phrases enclosed in double quotes. Stemming and wildcard
|
|
expansion will be performed as for simple search.
|
|
|
|
* The next part allows filtering the results by their mime types.
|
|
|
|
The state of the file type selection can be saved as the default (the
|
|
file type filter will not be activated at program start-up, but the
|
|
lists will be in the restored state).
|
|
|
|
* The bottom part allows restricting the search results to a sub-tree of
|
|
the indexed area. If you need to do this often, you may think of
|
|
setting up multiple indexes instead, as the performance will be much
|
|
better.
|
|
|
|
Phrases and Proximity searches. These two clauses work in similar ways,
|
|
with the difference that proximity searches do not impose an order on the
|
|
words. In both cases, an adjustable number (slack) of non-matched words
|
|
may be accepted between the searched ones (use the counter on the left to
|
|
adjust this count). For phrases, the default count is zero (exact match).
|
|
For proximity it is ten (meaning that two search terms, would be matched
|
|
if found within a window of twelve words). Examples: a phrase search for
|
|
quick fox with a slack of 0 will match quick fox but not quick brown fox.
|
|
With a slack of 1 it will match the latter, but not fox quick. A proximity
|
|
search for quick fox with the default slack will match the latter, and
|
|
also a fox is a cunning and quick animal.
|
|
|
|
Click on the Start Search button in the advanced search dialog, or type
|
|
Enter in any text field to start the search. The button in the main window
|
|
always performs a simple search.
|
|
|
|
Click on the Show query details link at the top of the result page to see
|
|
the query expansion.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.6. The term explorer tool
|
|
|
|
Recoll automatically manages the expansion of search terms to their
|
|
derivatives (ie: plural/singular, verb inflections). But there are other
|
|
cases where the exact search term is not known. For example, you may not
|
|
remember the exact spelling, or only know the beginning of the name.
|
|
|
|
The term explorer tool (started from the toolbar icon or from the Term
|
|
explorer entry of the Tools menu) can be used to search the full index
|
|
terms list. It has three modes of operations:
|
|
|
|
Wildcard
|
|
|
|
In this mode of operation, you can enter a search string with
|
|
shell-like wildcards (*, ?, []). ie: xapi* would display all index
|
|
terms beginning with xapi. (More about wildcards here).
|
|
|
|
Regular expression
|
|
|
|
This mode will accept a regular expression as input. Example:
|
|
word[0-9]+. The expression is implicitely anchored at the
|
|
beginning. Ie: press will match pression but not expression. You
|
|
can use .*press to match the latter, but be aware that this will
|
|
cause a full index term list scan, which can be quite long.
|
|
|
|
Stem expansion
|
|
|
|
This mode will perform the usual stem expansion normally done as
|
|
part user input processing. As such it is probably mostly useful
|
|
to demonstrate the process.
|
|
|
|
Spelling/Phonetic
|
|
|
|
In this mode, you enter the term as you think it is spelled, and
|
|
Recoll will do its best to find index terms that sound like your
|
|
entry. This mode uses the Aspell spelling application, which must
|
|
be installed on your system for things to work (if your documents
|
|
contain non-ascii characters, Recoll needs an aspell version newer
|
|
than 0.60 for UTF-8 support). The language which is used to build
|
|
the dictionary out of the index terms (which is done at the end of
|
|
an indexing pass) is the one defined by your NLS environment.
|
|
Weird things will probably happen if languages are mixed up.
|
|
|
|
Note that in cases where Recoll does not know the beginning of the string
|
|
to search for (ie a wildcard expression like *coll), the expansion can
|
|
take quite a long time because the full index term list will have to be
|
|
processed. The expansion is currently limited at 200 results for wildcards
|
|
and regular expressions.
|
|
|
|
Double-clicking on a term in the result list will insert it into the
|
|
simple search entry field. You can also cut/paste between the result list
|
|
and any entry field (the end of lines will be taken care of).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.7. More about wildcards
|
|
|
|
All words entered in Recoll search fields will be processed for wildcard
|
|
expansion before the request is finally executed.
|
|
|
|
The wildcard characters are:
|
|
|
|
* * which matches 0 or more characters.
|
|
|
|
* ? which matches a single character.
|
|
|
|
* [] which allow defining sets of characters to be matched (ex: [abc]
|
|
matches a single character which may be 'a' or 'b' or 'c', [0-9]
|
|
matches any number.
|
|
|
|
You should be aware of a few things before using wildcards.
|
|
|
|
* Using a wildcard character at the beginning of a word can make for a
|
|
slow search because Recoll will have to scan the whole index term list
|
|
to find the matches.
|
|
|
|
* Using a * at the end of a word can produce more matches than you would
|
|
think, and strange search results. You can use the term explorer tool
|
|
to check what completions exist for a given term. You can also see
|
|
exactly what search was performed by clicking on the link at the top
|
|
of the result list. In general, for natural language terms, stem
|
|
expansion will produce better results than an ending * (stem expansion
|
|
is turned off when any wildcard character appears in the term).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.8. Multiple databases
|
|
|
|
Multiple Recoll databases or indexes can be created by using several
|
|
configuration directories which are usually set to index different areas
|
|
of the file system. A specific index can be selected for updating or
|
|
searching, using the RECOLL_CONFDIR environment variable or the -c option
|
|
to recoll and recollindex.
|
|
|
|
A recollindex program instance can only update one specific index.
|
|
|
|
A recoll program instance is also associated with a specific index, which
|
|
is the one to be updated by its indexing thread, but it can use any number
|
|
of Recoll indexes for searching. The external indexes can be selected
|
|
through the external indexes tab in the preferences dialog.
|
|
|
|
Index selection is performed in two phases. A set of all usable indexes
|
|
must first be defined, and then the subset of indexes to be used for
|
|
searching. Of course, these parameters are retained across program
|
|
executions (there are kept separately for each Recoll configuration). The
|
|
set of all indexes is usually quite stable, while the active ones might
|
|
typically be adjusted quite frequently.
|
|
|
|
The main index (defined by RECOLL_CONFDIR) is always active. If this is
|
|
undesirable, you can set up your base configuration to index an empty
|
|
directory.
|
|
|
|
As building the set of all indexes can be a little tedious when done
|
|
through the user interface, you can use the RECOLL_EXTRA_DBS environment
|
|
variable to provide an initial set. This might typically be set up by a
|
|
system administrator so that every user does not have to do it. The
|
|
variable should define a colon-separated list of index directories, ie:
|
|
|
|
export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
|
|
|
|
A typical usage scenario for the multiple index feature would be for a
|
|
system administrator to set up a central index for shared data, that you
|
|
choose to search or not in addition to your personal data. Of course,
|
|
there are other possibilities. There are many cases where you know the
|
|
subset of files that should be searched, and where narrowing the search
|
|
can improve the results. You can achieve approximately the same effect
|
|
with the directory filter in advanced search, but multiple indexes will
|
|
have much better performance and may be worth the trouble.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.9. Document history
|
|
|
|
Documents that you actually view (with the internal preview or an external
|
|
tool) are entered into the document history, which is remembered.
|
|
|
|
You can display the history list by using the Tools/Doc History menu
|
|
entry.
|
|
|
|
You can erase the document history by using the Erase document history
|
|
entry in the File menu.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.10. Sorting search results and collapsing duplicates
|
|
|
|
The documents in a result list are normally sorted in order of relevance.
|
|
It is possible to specify different sort parameters by using the Sort
|
|
parameters dialog (located in the Tools menu).
|
|
|
|
The tool sorts a specified number of the most relevant documents in the
|
|
result list, according to specified criteria. The currently available
|
|
criteria are date and mime type.
|
|
|
|
The sort parameters stay in effect until they are explicitly reset, or the
|
|
program exits. An activated sort is indicated in the result list header.
|
|
|
|
Sort parameters are remembered between program invocations, but result
|
|
sorting is normally always inactive when the program starts. It is
|
|
possible to keep the sorting activation state between program invocations
|
|
by checking the Remember sort activation state option in the preferences.
|
|
|
|
It is also possible to hide duplicate entries inside the result list
|
|
(documents with the exact same contents as the displayed one). The test of
|
|
identity is based on an MD5 hash of the document container, not only of
|
|
the text contents (so that ie, a text document with an image added will
|
|
not be a duplicate of the text only). Duplicates hiding is controlled by
|
|
an entry in the Query configuration dialog, and is off by default.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.11. Search tips, shortcuts
|
|
|
|
3.11.1. Terms and search expansion
|
|
|
|
Term completion. Typing Esc Space in the simple search entry field while
|
|
entering a word will either complete the current word if its beginning
|
|
matches a unique term in the index, or open a window to propose a list of
|
|
completions.
|
|
|
|
Picking up new terms from result or preview text. Double-clicking on a
|
|
word in the result list or in a preview window will copy it to the simple
|
|
search entry field.
|
|
|
|
Wildcards. Wildcards can be used inside search terms in all forms of
|
|
searches. More about wildcards.
|
|
|
|
Disabling stem expansion. Entering a capitalized word in any search field
|
|
will prevent stem expansion (no search for gardening if you enter Garden
|
|
instead of garden). This is the only case where character case should make
|
|
a difference for a Recoll search. You can also disable stem expansion or
|
|
change the stemming language in the preferences.
|
|
|
|
Finding related documents. Selecting the Find similar documents entry in
|
|
the result list paragraph right-click menu will select a set of
|
|
"interesting" terms from the current result, and insert them into the
|
|
simple search entry field. You can then possibly edit the list and start a
|
|
search to find documents which may be apparented to the current result.
|
|
|
|
File names. File names are added as terms during indexing, and you can
|
|
specify them as ordinary terms in normal search fields (Recoll used to
|
|
index all directories in the file path as terms. This has been abandoned
|
|
as it did not seem really useful). Alternatively, you can use the specific
|
|
file name search which will only look for file names, and may be faster
|
|
than the generic search especially when using wildcards.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.11.2. Working with phrases and proximity
|
|
|
|
Phrases and Proximity searches. A phrase can be looked for by enclosing it
|
|
in double quotes. Example: "user manual" will look only for occurrences of
|
|
user immediately followed by manual. You can use the This phrase field of
|
|
the advanced search dialog to the same effect. Phrases can be entered
|
|
along simple terms in all simple or advanced search entry fields (except
|
|
This exact phrase).
|
|
|
|
AutoPhrases. This option can be set in the preferences dialog. If it is
|
|
set, a phrase will be automatically built and added to simple searches
|
|
when looking for Any terms. This will not change radically the results,
|
|
but will give a relevance boost to the results where the search terms
|
|
appear as a phrase. Ie: searching for virtual reality will still find all
|
|
documents where either virtual or reality or both appear, but those which
|
|
contain virtual reality should appear sooner in the list.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.11.3. Others
|
|
|
|
Using fields. You can use the query language and field specifications to
|
|
only search certain parts of documents. This can be especially helpful
|
|
with email, for example only searching emails from a specific originator:
|
|
search tips from:helpfulgui
|
|
|
|
Query explanation. You can get an exact description of what the query
|
|
looked for, including stem expansion, and Boolean operators used, by
|
|
clicking on the result list header.
|
|
|
|
Browsing the result list inside a preview window. Entering Shift-Down or
|
|
Shift-Up (Shift + an arrow key) in a preview window will display the next
|
|
or the previous document from the result list. Any secondary search
|
|
currently active will be executed on the new document.
|
|
|
|
Forced opening of a preview window. You can use Shift+Click on a result
|
|
list Preview link to force the creation of a preview window instead of a
|
|
new tab in the existing one.
|
|
|
|
Closing previews. Entering ^W in a tab will close it (and, for the last
|
|
tab, close the preview window). Entering Esc will close the preview window
|
|
and all its tabs.
|
|
|
|
Quitting. Entering ^Q almost anywhere will close the application.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
3.12. Customizing the search interface
|
|
|
|
It is possible to customize some aspects of the search interface by using
|
|
Query configuration entry in the Preferences menu.
|
|
|
|
There are two tabs in the dialog, dealing with the interface itself, and
|
|
with the parameters used for searching and returning results.
|
|
|
|
User interface parameters:
|
|
|
|
* Number of results in a result page:
|
|
|
|
* Hide duplicate results: decides if result list entries are shown for
|
|
identical documents found in different places.
|
|
|
|
* Highlight color for query terms: Terms from the user query are
|
|
highlighted in the result list samples and the preview window. The
|
|
color can be chosen here. Any QT color string should work (ie red,
|
|
#ff0000). The default is blue.
|
|
|
|
* Result list font: There is quite a lot of information shown in the
|
|
result list, and you may want to customize the font and/or font size.
|
|
The rest of the fonts used by Recoll are determined by your generic QT
|
|
config (try the qtconfig command).
|
|
|
|
* Result paragraph format string: allows you to change the presentation
|
|
of each result list entry. This is a qt-html string where the
|
|
following printf-like % substitutions will be performed:
|
|
|
|
* %A. Abstract
|
|
|
|
* %D. Date
|
|
|
|
* %I. Icon image name
|
|
|
|
* %K. Keywords (if any)
|
|
|
|
* %L. Preview and Edit links
|
|
|
|
* %M. Mime type
|
|
|
|
* %N. result Number
|
|
|
|
* %R. Relevance percentage
|
|
|
|
* %S. Size information
|
|
|
|
* %T. Title
|
|
|
|
* %U. Url
|
|
|
|
The default value for the string is:
|
|
|
|
<img src="%I" align="left">%R %S %L <b>%T</b><br>
|
|
%M %D <i>%U</i><br>
|
|
%A %K
|
|
|
|
|
|
You may, for example, try the following for a more web-like
|
|
experience:
|
|
|
|
<u><b><a href="P%N">%T</a></b></u><br>
|
|
%A<font color=#008000>%U - %S</font> - %L
|
|
|
|
|
|
Or the clean looking:
|
|
|
|
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
|
<b>%T</b><br>%S
|
|
<font color="#808080"><i>%U</i></font>
|
|
<table bgcolor="#e0e0e0">
|
|
<tr><td><div>%A</div></td></tr>
|
|
</table>%K
|
|
|
|
|
|
The format of the Preview and Edit links is <a href="Pdocnum"> and <a
|
|
href="Edocnum"> where docnum is what %N would print. This makes the
|
|
title a preview link in the above format.
|
|
|
|
Please note that, due to the way the program handles right mouse
|
|
clicks in the result list, if the custom formatting results in
|
|
multiple paragraphs per result, right clicks will only work inside the
|
|
first one.
|
|
|
|
* HTML help browser: this will let you chose your preferred browser
|
|
which will be started from the Help menu to read the user manual. You
|
|
can enter a simple name if the command is in your PATH, or browse for
|
|
a full pathname.
|
|
|
|
* Auto-start simple search on white space entry: if this is checked, a
|
|
search will be executed each time you enter a space in the simple
|
|
search input field. This lets you look at the result list as you enter
|
|
new terms. This is off by default, you may like it or not...
|
|
|
|
* Start with advanced search dialog open and Start with sort dialog
|
|
open: If you use these dialogs all the time, checking these entries
|
|
will get them to open when recoll starts.
|
|
|
|
* Use desktop preferences to choose document editor: if this is checked,
|
|
the xdg-open utility will be used to open files when you click the
|
|
Edit link in the result list, instead of the application defined in
|
|
mimeview. xdg-open will in term use your desktop preferences to choose
|
|
an appropriate application.
|
|
|
|
Search parameters:
|
|
|
|
* Stemming language: stemming obviously depends on the document's
|
|
language. This listbox will let you chose among the stemming databases
|
|
which were built during indexing (this is set in the main
|
|
configuration file), or later added with recollindex -s (See the
|
|
recollindex manual). Stemming languages which are dynamically added
|
|
will be deleted at the next indexing pass unless they are also added
|
|
in the configuration file.
|
|
|
|
* Dynamically build abstracts: this decides if Recoll tries to build
|
|
document abstracts when displaying the result list. Abstracts are
|
|
constructed by taking context from the document information, around
|
|
the search terms. This can slow down result list display significantly
|
|
for big documents, and you may want to turn it off.
|
|
|
|
* Replace abstracts from documents: this decides if we should synthesize
|
|
and display an abstract in place of an explicit abstract found within
|
|
the document itself.
|
|
|
|
* Synthetic abstract size: adjust to taste...
|
|
|
|
* Synthetic abstract context words: how many words should be displayed
|
|
around each term occurrence.
|
|
|
|
External indexes: This panel will let you browse for additional indexes
|
|
that you may want to search. External indexes are designated by their
|
|
database directory (ie: /home/someothergui/.recoll/xapiandb,
|
|
/usr/local/recollglobal/xapiandb).
|
|
|
|
Once entered, the indexes will appear in the External indexes list, and
|
|
you can chose which ones you want to use at any moment by checking or
|
|
unchecking their entries.
|
|
|
|
Your main database (the one the current configuration indexes to), is
|
|
always implicitly active. If this is not desirable, you can set up your
|
|
configuration so that it indexes, for example, an empty directory. An
|
|
alternative indexer may also need to implement a way of purging the index
|
|
from stale data,
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 4. Searching with the KDE KIO slave
|
|
|
|
4.1. What's this
|
|
|
|
The Recoll KIO slave allows performing a Recoll search by entering an
|
|
appropriate URL in a KDE open dialog, or with an HTML-based interface
|
|
displayed in Konqueror.
|
|
|
|
The HTML-based interface is similar to the QT-based interface, but
|
|
slightly less powerful for now. Its advantage is that you can perform your
|
|
search while staying fully within the KDE framework: drag and drop from
|
|
the result list works normally and you have your normal choice of
|
|
applications for opening files.
|
|
|
|
The alternative interface uses a directory view of search results. Due to
|
|
limitations in the current KIO slave interface, it is currently not
|
|
obviously useful (to me).
|
|
|
|
The interface is described in more detail inside a help file which you can
|
|
access by entering recoll:/ inside the konqueror URL line (this works only
|
|
if the recoll KIO slave has been previously installed).
|
|
|
|
The instructions for building this module are located in the source tree.
|
|
See: kde/kio/recoll/00README.txt
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
4.2. Searchable documents
|
|
|
|
As a sample application, the Recoll KIO slave could allow preparing a set
|
|
of HTML documents (for example a manual) so that they become their own
|
|
search interface inside konqueror.
|
|
|
|
This can be done by either explicitly inserting <a href="recoll:/...">
|
|
links around some document areas, or automatically by adding a very small
|
|
javascript program to the documents, like the following example, which
|
|
would initiate a search by double-clicking any term:
|
|
|
|
<script language="JavaScript">
|
|
function recollsearch() {
|
|
var t = document.getSelection();
|
|
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
|
encodeURIComponent(t);
|
|
}
|
|
</script>
|
|
....
|
|
<body ondblclick="recollsearch()">
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 5. Searching on the command line
|
|
|
|
There are several ways to obtain search results as a text stream, without
|
|
a graphical interface:
|
|
|
|
* By passing option -t to the recoll program.
|
|
|
|
* By using the recollq program.
|
|
|
|
* By writing a custom Python program, using the Recoll Python API.
|
|
|
|
The first two methods work in the same way and accept/need the same
|
|
arguments (except for the additional -t to recoll). The query to be
|
|
executed is specified as command line arguments.
|
|
|
|
recollq is not built by default. You can use the Makefile in the query
|
|
directory to build it. This is a very simple program, and it will often be
|
|
useful to taylor its output format to your needs.
|
|
|
|
recollq has a man page (not installed by default, look in the doc/man
|
|
directory). The Usage string is as follows:
|
|
|
|
recollq [-o|-a|-f] <query string>
|
|
Runs a recoll query and displays result lines.
|
|
Default: will interpret the argument(s) as a query language string
|
|
-o Emulate the gui simple search in ANY TERM mode
|
|
-a Emulate the gui simple search in ALL TERMS mode
|
|
-f Emulate the gui simple search in filename mode
|
|
Common options:
|
|
-c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
|
|
-d also dump file contents
|
|
-n <cnt> limit the maximum number of results (0->no limit, default 2000)
|
|
-b : basic. Just output urls, no mime types or titles
|
|
-m : dump the whole document meta[] array
|
|
-S fld : sort by field name
|
|
-D : sort descending
|
|
|
|
Sample execution:
|
|
|
|
recollq 'ilur -nautique mime:text/html'
|
|
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11)
|
|
OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
|
|
4 results
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
|
|
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
|
|
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 6. Programming interface
|
|
|
|
Recoll has an Application programming Interface, usable both for indexing
|
|
and searching, currently accessible from the Python language.
|
|
|
|
Another less radical way to extend the application is to write filters for
|
|
new types of documents.
|
|
|
|
The processing of metadata attributes for documents (fields) is highly
|
|
configurable.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.1. Writing a document filter
|
|
|
|
Recoll filters are executable programs which translate from a specific
|
|
format (ie: openoffice, acrobat, etc.) to the Recoll indexing input
|
|
format, which may be text/plain or text/html.
|
|
|
|
Recoll filters are usually shell-scripts, but this is in no way necessary.
|
|
These programs are extremely simple and most of the difficulty lies in
|
|
extracting the text from the native format, not outputting what is
|
|
expected by Recoll. Happily enough, most document formats already have
|
|
translators or text extractors which handle the difficult part and can be
|
|
called from the filter. In some case the output of the translating program
|
|
is appropriate, and no intermediate shell-script is needed.
|
|
|
|
Filters are called with a single argument which is the source file name.
|
|
They should output the result to stdout.
|
|
|
|
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
|
the filter if the operation is for indexing or previewing. Some filters
|
|
use this to output a slightly different format. This is not essential.
|
|
|
|
The association of file types to filters is performed in the mimeconf
|
|
file. A sample:
|
|
|
|
|
|
[index]
|
|
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
mimetype=text/plain;charset=utf-8
|
|
|
|
application/ogg = exec rclogg
|
|
|
|
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
|
The fragment specifies that:
|
|
|
|
* application/msword files are processed by executing the antiword
|
|
program, which outputs text/plain encoded in iso-8859-1.
|
|
|
|
* application/ogg files are processed by the rclogg script, with default
|
|
output type (text/html, with encoding specified in the header, or
|
|
utf-8 by default).
|
|
|
|
* text/rtf is processed by unrtf, which outputs text/html. The
|
|
iso-8859-1 encoding is specified because it is not the utf-8 default,
|
|
and not output by unrtf in the HTML header section.
|
|
|
|
The easiest way to write a new filter is probably to start from an
|
|
existing one.
|
|
|
|
Filters which output text/plain text are generally simpler, but they
|
|
cannot specify the character set and other metadata, so they are limited
|
|
to cases where these elements are not needed.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.1.1. Filter HTML output
|
|
|
|
The output HTML could be very minimal like the following example:
|
|
|
|
<html><head>
|
|
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
|
</head>
|
|
<body>some text content</body></html>
|
|
|
|
|
|
You should take care to escape some characters inside the text by
|
|
transforming them into appropriate entities. "&" should be transformed
|
|
into "&", "<" should be transformed into "<". This is not always
|
|
properly done by translating programs which output HTML, and of course
|
|
nerver by those which output plain text.
|
|
|
|
The character set needs to be specified in the header. It does not need to
|
|
be UTF-8 (Recoll will take care of translating it), but it must be
|
|
accurate for good results.
|
|
|
|
Recoll will also make use of other header fields if they are present:
|
|
title, description, keywords.
|
|
|
|
Filters also have the possibility to "invent" field names. This should be
|
|
output as meta tags:
|
|
|
|
<meta name="somefield" content="Some textual data" />
|
|
|
|
See the following section for details about configuring how field data is
|
|
processed by the indexer.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.2. Field data processing configuration
|
|
|
|
Fields are named pieces of information in or about documents, like title,
|
|
author, abstract.
|
|
|
|
The field values for documents can appear in several ways during indexing:
|
|
either output by filters as meta fields in the HTML header section, or
|
|
added as attributes of the Doc object when using the API, or again
|
|
synthetized internally by Recoll.
|
|
|
|
The Recoll query language allows searching for text in a specific field.
|
|
|
|
Recoll defines a number of default fields. Additional ones can be output
|
|
by filters, and described in the fields configuration file.
|
|
|
|
Fields can be:
|
|
|
|
* indexed, meaning that their terms are separately stored in inverted
|
|
lists (with a specific prefix), and that a field-specific search is
|
|
possible.
|
|
|
|
* stored, meaning that their value is recorded in the index data record
|
|
for the document, and can be returned and displayed with search
|
|
results.
|
|
|
|
A field can be either or both indexed and stored.
|
|
|
|
A field becomes indexed by having a prefix defined in the [prefixes]
|
|
section of the fields file. See the comments in there for details
|
|
|
|
A field becomes stored by appearing in the [stored] section of the fields
|
|
file.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.3. API
|
|
|
|
6.3.1. Interface elements
|
|
|
|
A few elements in the interface are specific and and need an explanation.
|
|
|
|
udi
|
|
|
|
An udi (unique document identifier) identifies a document. Because
|
|
of limitations inside the index engine, it is restricted in length
|
|
(to 200 bytes), which is why a regular URI cannot be used. The
|
|
structure and contents of the udi is defined by the application
|
|
and opaque to the index engine. For example, the internal file
|
|
system indexer uses the complete document path (file path +
|
|
internal path), truncated to length, the suppressed part being
|
|
replaced by a hash value.
|
|
|
|
ipath
|
|
|
|
This data value (set as a field in the Doc object) is stored,
|
|
along with the URL, but not indexed by Recoll. Its contents are
|
|
not interpreted, and its use is up to the application. For
|
|
example, the Recoll internal file system indexer stores the part
|
|
of the document access path internal to the container file (ipath
|
|
in this case is a list of subdocument sequential numbers). url and
|
|
ipath are returned in every search result and permit access to the
|
|
original document.
|
|
|
|
Stored and indexed fields
|
|
|
|
The fields file inside the Recoll configuration defines which
|
|
document fields are either "indexed" (searchable), "stored"
|
|
(retrievable with search results), or both.
|
|
|
|
Data for an external indexer, should be stored in a separate index, not
|
|
the one for the Recoll internal file system indexer, except if the latter
|
|
is not used at all). The reason is that the main document indexer purge
|
|
pass would remove all the other indexer's documents, as they were not seen
|
|
during indexing. The main indexer documents would also probably be a
|
|
problem for the external indexer purge operation.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.3.2. Python interface
|
|
|
|
6.3.2.1. Introduction
|
|
|
|
Recoll versions after 1.11 define a Python programming interface, both for
|
|
searching and indexing.
|
|
|
|
The python interface is not built by default and can be found in the
|
|
source package, under python/recoll. The directory contains the usual
|
|
setup.py script which you can use to build and install the module:
|
|
|
|
cd recoll-xxx/python/recoll
|
|
python setup.py build
|
|
python setup.py install
|
|
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.3.2.2. Interface manual
|
|
|
|
NAME
|
|
recoll - This is an interface to the Recoll full text indexer.
|
|
|
|
FILE
|
|
/usr/local/lib/python2.5/site-packages/recoll.so
|
|
|
|
CLASSES
|
|
Db
|
|
Doc
|
|
Query
|
|
SearchData
|
|
|
|
class Db(__builtin__.object)
|
|
| Db([confdir=None], [extra_dbs=None], [writable = False])
|
|
|
|
|
| A Db object holds a connection to a Recoll index. Use the connect()
|
|
| function to create one.
|
|
| confdir specifies a Recoll configuration directory (default:
|
|
| $RECOLL_CONFDIR or ~/.recoll).
|
|
| extra_dbs is a list of external databases (xapian directories)
|
|
| writable decides if we can index new data through this connection
|
|
|
|
|
| Methods defined here:
|
|
|
|
|
|
|
|
| addOrUpdate(...)
|
|
| addOrUpdate(udi, doc, parent_udi=None) -> None
|
|
| Add or update index data for a given document
|
|
| The udi string must define a unique id for the document. It is not
|
|
| interpreted inside Recoll
|
|
| doc is a Doc object
|
|
| if parent_udi is set, this is a unique identifier for the
|
|
| top-level container (ie mbox file)
|
|
|
|
|
| delete(...)
|
|
| delete(udi) -> Bool.
|
|
| Purge index from all data for udi. If udi matches a container
|
|
| document, purge all subdocs (docs with a parent_udi matching udi).
|
|
|
|
|
| makeDocAbstract(...)
|
|
| makeDocAbstract(Doc, Query) -> string
|
|
| Build and return 'keyword-in-context' abstract for document
|
|
| and query.
|
|
|
|
|
| needUpdate(...)
|
|
| needUpdate(udi, sig) -> Bool.
|
|
| Check if the index is up to date for the document defined by udi,
|
|
| having the current signature sig.
|
|
|
|
|
| purge(...)
|
|
| purge() -> Bool.
|
|
| Delete all documents that were not touched during the just finished
|
|
| indexing pass (since open-for-write). These are the documents for
|
|
| the needUpdate() call was not performed, indicating that they no
|
|
| longer exist in the primary storage system.
|
|
|
|
|
| query(...)
|
|
| query() -> Query. Return a new, blank query object for this index.
|
|
|
|
|
| setAbstractParams(...)
|
|
| setAbstractParams(maxchars, contextwords).
|
|
| Set the parameters used to build 'keyword-in-context' abstracts
|
|
|
|
|
| ----------------------------------------------------------------------
|
|
| Data and other attributes defined here:
|
|
|
|
|
|
|
class Doc(__builtin__.object)
|
|
| Doc()
|
|
|
|
|
| A Doc object contains index data for a given document.
|
|
| The data is extracted from the index when searching, or set by the
|
|
| indexer program when updating. The Doc object has no useful methods but
|
|
| many attributes to be read or set by its user. It matches exactly the
|
|
| Rcl::Doc c++ object. Some of the attributes are predefined, but,
|
|
| especially when indexing, others can be set, the name of which will be
|
|
| processed as field names by the indexing configuration.
|
|
| Inputs can be specified as unicode or strings.
|
|
| Outputs are unicode objects.
|
|
| All dates are specified as unix timestamps, printed as strings
|
|
| Predefined attributes (index/query/both):
|
|
| text (index): document plain text
|
|
| url (both)
|
|
| fbytes (both) optional) file size in bytes
|
|
| filename (both)
|
|
| fmtime (both) optional file modification date. Unix time printed
|
|
| as string
|
|
| dbytes (both) document text bytes
|
|
| dmtime (both) document creation/modification date
|
|
| ipath (both) value private to the app.: internal access path
|
|
| inside file
|
|
| mtype (both) mime type for original document
|
|
| mtime (query) dmtime if set else fmtime
|
|
| origcharset (both) charset the text was converted from
|
|
| size (query) dbytes if set, else fbytes
|
|
| sig (both) app-defined file modification signature.
|
|
| For up to date checks
|
|
| relevancyrating (query)
|
|
| abstract (both)
|
|
| author (both)
|
|
| title (both)
|
|
| keywords (both)
|
|
|
|
|
| Methods defined here:
|
|
|
|
|
|
|
|
| ----------------------------------------------------------------------
|
|
| Data and other attributes defined here:
|
|
|
|
|
|
|
class Query(__builtin__.object)
|
|
| Recoll Query objects are used to execute index searches.
|
|
| They must be created by the Db.query() method.
|
|
|
|
|
| Methods defined here:
|
|
|
|
|
|
|
|
| execute(...)
|
|
| execute(query_string, stemming=1|0)
|
|
|
|
|
| Starts a search for query_string, a Recoll search language string
|
|
| (mostly Xesam-compatible).
|
|
| The query can be a simple list of terms (and'ed by default), or more
|
|
| complicated with field specs etc. See the Recoll manual.
|
|
|
|
|
| executesd(...)
|
|
| executesd(SearchData)
|
|
|
|
|
| Starts a search for the query defined by the SearchData object.
|
|
|
|
|
| fetchone(...)
|
|
| fetchone(None) -> Doc
|
|
|
|
|
| Fetches the next Doc object in the current search results.
|
|
|
|
|
| sortby(...)
|
|
| sortby(field=fieldname, ascending=true)
|
|
| Sort results by 'fieldname', in ascending or descending order.
|
|
| Only one field can be used, no subsorts for now.
|
|
| Must be called before executing the search
|
|
|
|
|
| ----------------------------------------------------------------------
|
|
| Data descriptors defined here:
|
|
|
|
|
| next
|
|
| Next index to be fetched from results. Normally increments after
|
|
| each fetchone() call, but can be set/reset before the call effect
|
|
| seeking. Starts at 0
|
|
|
|
|
| ----------------------------------------------------------------------
|
|
| Data and other attributes defined here:
|
|
|
|
|
|
|
class SearchData(__builtin__.object)
|
|
| SearchData()
|
|
|
|
|
| A SearchData object describes a query. It has a number of global
|
|
| parameters and a chain of search clauses.
|
|
|
|
|
| Methods defined here:
|
|
|
|
|
|
|
|
| addclause(...)
|
|
| addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
| qstring=string, slack=int, field=string, stemming=1|0,
|
|
| subSearch=SearchData)
|
|
| Adds a simple clause to the SearchData And/Or chain, or a subquery
|
|
| defined by another SearchData object
|
|
|
|
|
| ----------------------------------------------------------------------
|
|
| Data and other attributes defined here:
|
|
|
|
|
|
|
FUNCTIONS
|
|
connect(...)
|
|
connect([confdir=None], [extra_dbs=None], [writable = False])
|
|
-> Db.
|
|
|
|
Connects to a Recoll database and returns a Db object.
|
|
confdir specifies a Recoll configuration directory
|
|
(the default is built like for any Recoll program).
|
|
extra_dbs is a list of external databases (xapian directories)
|
|
writable decides if we can index new data through this connection
|
|
|
|
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
6.3.2.3. Example code
|
|
|
|
The following sample would query the index with a user language string.
|
|
See the python/samples directory inside the Recoll source for other
|
|
examples.
|
|
|
|
#!/usr/bin/env python
|
|
|
|
import recoll
|
|
|
|
db = recoll.connect()
|
|
db.setAbstractParams(maxchars=80, contextwords=2)
|
|
|
|
query = db.query()
|
|
nres = query.execute("some user question")
|
|
print "Result count: ", nres
|
|
if nres > 5:
|
|
nres = 5
|
|
while query.next >= 0 and query.next < nres:
|
|
doc = query.fetchone()
|
|
print query.next
|
|
for k in ("title", "size"):
|
|
print k, ":", getattr(doc, k).encode('utf-8')
|
|
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
|
print abs
|
|
print
|
|
|
|
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
Chapter 7. Installation
|
|
|
|
7.1. Installing a prebuilt copy
|
|
|
|
Recoll binary packages from the Recoll web site are always linked
|
|
statically to the Xapian libraries, and have no other dependencies. You
|
|
will only have to check or install supporting applications for the file
|
|
types that you want to index beyond text, HTML and mail files, and maybe
|
|
have a look at the configuration section (but this may not be necessary
|
|
for a quick test with default parameters).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.1.1. Installing through a package system
|
|
|
|
If you use a BSD-type port system or a prebuilt package (RPM or other),
|
|
just follow the usual procedure for your system.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.1.2. Installing a prebuilt Recoll
|
|
|
|
The unpackaged binary versions on the Recoll web site are just compressed
|
|
tar files of a build tree, where only the useful parts were kept
|
|
(executables and sample configuration).
|
|
|
|
The executable binary files are built with a static link to libxapian and
|
|
libiconv, to make installation easier (no dependencies).
|
|
|
|
After extracting the tar file, you can proceed with installation as if you
|
|
had built the package from source (that is, just type make install). The
|
|
binary trees are built for installation to /usr/local.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.2. Supporting packages
|
|
|
|
Recoll uses external applications to index some file types. You need to
|
|
install them for the file types that you wish to have indexed (these are
|
|
run-time dependencies. None is needed for building Recoll).
|
|
|
|
After an indexing pass, the commands that were found missing can be
|
|
displayed from the recoll File menu. The list is stored in the missing
|
|
text file inside the configuration directory.
|
|
|
|
A list of common file types which need external commands:
|
|
|
|
* Openoffice: supported natively, but needs the unzip command to be
|
|
installed.
|
|
|
|
* PDF: pdftotext is part of the Xpdf package.
|
|
|
|
* Postscript: pstotext.
|
|
|
|
* MS Word: antiword.
|
|
|
|
* MS Excel and PowerPoint: catdoc.
|
|
|
|
* Wordperfect files: libwpd.
|
|
|
|
* RTF: unrtf
|
|
|
|
* TeX: Recoll uses the untex program. Your distribution may have a
|
|
package for it. If it doesn't, there is a copy of the source on the
|
|
Recoll web site, because the program has no obvious home. The filter
|
|
can also work with detex and will use it if it is installed.
|
|
|
|
* dvi: dvips
|
|
|
|
* djvu: DjVuLibre
|
|
|
|
* MP3: Recoll will use the id3info command from the id3lib package to
|
|
extract tag information. Without it, only the file names will be
|
|
indexed.
|
|
|
|
* Pictures: Recoll uses the Exiftool Perl package to extract tag
|
|
information. Most image file formats are supported.
|
|
|
|
Text, HTML, mail folders Openoffice and Scribus files are processed
|
|
internally. Lyx is used to index Lyx files. Many filters need sed and awk.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.3. Building from source
|
|
|
|
7.3.1. Prerequisites
|
|
|
|
At the very least, you will need to download and install the xapian core
|
|
package (Recoll 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x
|
|
version will work too), and the qt run-time and development packages
|
|
(Recoll development currently uses version 3.3.5, but any 3.3 version is
|
|
probably OK).
|
|
|
|
You will most probably be able to find a binary package for qt for your
|
|
system. You may have to compile Xapian but this is not difficult (if you
|
|
are using FreeBSD, there is a port).
|
|
|
|
You may also need libiconv. Recoll currently uses version 1.9 (this should
|
|
not be critical). On Linux systems, the iconv interface is part of libc
|
|
and you should not need to do anything special.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.3.2. Building
|
|
|
|
Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
|
|
3/4/5/6), FreeBSD 5/6, macosx, and Solaris 8. If you build on another
|
|
system, and need to modify things, I would very much welcome patches.
|
|
|
|
Depending on the qt configuration on your system, you may have to set the
|
|
QTDIR and QMAKESPECS variables in your environment:
|
|
|
|
* QTDIR should point to the directory above the one that holds the qt
|
|
include files (ie: if qt.h is /usr/local/qt/include/qt.h, QTDIR should
|
|
be /usr/local/qt).
|
|
|
|
* QMAKESPECS should be set to the name of one of the qt mkspecs
|
|
sub-directories (ie: linux-g++).
|
|
|
|
On many Linux systems, QTDIR is set by the login scripts, and QMAKESPECS
|
|
is not needed because there is a default link in mkspecs/.
|
|
|
|
Configure options: --without-aspell will disable the code for phonetic
|
|
matching of search terms. --with-fam or --with-inotify will enable the
|
|
code for real time indexing. Inotify support is enabled by default on
|
|
recent Linux systems.
|
|
|
|
Normal procedure:
|
|
|
|
cd recoll-xxx
|
|
configure
|
|
make
|
|
(practices usual hardship-repelling invocations)
|
|
|
|
|
|
There little auto-configuration. The configure script will mainly link one
|
|
of the system-specific files in the mk directory to mk/sysconf. If your
|
|
system is not known yet, it will tell you as much, and you may want to
|
|
manually copy and modify one of the existing files (the new file name
|
|
should be the output of uname -s).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.3.3. Installation
|
|
|
|
Either type make install or execute recollinstall prefix, in the root of
|
|
the source tree. This will copy the commands to prefix/bin and the sample
|
|
configuration files, scripts and other shared data to prefix/share/recoll.
|
|
|
|
If the installation prefix given to recollinstall is different from what
|
|
was specified when executing configure, you will have to set the
|
|
RECOLL_DATADIR environment variable to indicate where the shared data is
|
|
to be found.
|
|
|
|
You can then proceed to configuration.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4. Configuration overview
|
|
|
|
Most of the parameters specific to the recoll GUI are set through the
|
|
Preferences menu and stored in the standard QT place ($HOME/.qt/recollrc).
|
|
You probably do not want to edit this by hand.
|
|
|
|
Recoll indexing options are set inside text configuration files located in
|
|
a configuration directory. There can be several such directories, each of
|
|
which define the parameters for one index.
|
|
|
|
The configuration files can be edited by hand or through the Indexing
|
|
configuration dialog (Preferences menu). The GUI tool will try to respect
|
|
your formatting and comments as much as possible, so it is quite possible
|
|
to use both ways.
|
|
|
|
The most accurate documentation for the configuration parameters is given
|
|
by comments inside the default files, and we will just give a general
|
|
overview here.
|
|
|
|
For each index, there are two sets of configuration files. System-wide
|
|
configuration files are kept in a directory named like
|
|
/usr/[local/]share/recoll/examples, and define default values, shared by
|
|
all indexes. For each index, a parallel set of files defines the
|
|
customized parameters.
|
|
|
|
The default location of the configuration is the .recoll directory in your
|
|
home. Most people will only use this directory.
|
|
|
|
This location can be changed, or others can be added with the
|
|
RECOLL_CONFDIR environment variable or the -c option parameter to recoll
|
|
and recollindex.
|
|
|
|
If the .recoll directory does not exist when recoll or recollindex are
|
|
started, it will be created with a set of empty configuration files.
|
|
recoll will give you a chance to edit the configuration file before
|
|
starting indexing. recollindex will proceed immediately. To avoid
|
|
mistakes, the automatic directory creation will only occur for the default
|
|
location, not if -c or RECOLL_CONFDIR were used (in the latter cases, you
|
|
will have to create the directory).
|
|
|
|
All configuration files share the same format. For example, a short
|
|
extract of the main configuration file might look as follows:
|
|
|
|
# Space-separated list of directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
|
|
|
|
There are three kinds of lines:
|
|
|
|
* Comment (starts with #) or empty.
|
|
|
|
* Parameter affectation (name = value).
|
|
|
|
* Section definition ([somedirname]).
|
|
|
|
Section definitions allow redefining some parameters for a directory
|
|
sub-tree. They stay in effect until another section definition, or the end
|
|
of file, is encountered. Some of the parameters used for indexing are
|
|
looked up hierarchically from the current directory location upwards. Not
|
|
all parameters can be meaningfully redefined, this is specified for each
|
|
in the next section.
|
|
|
|
When found at the beginning of a file path, the tilde character (~) is
|
|
expanded to the name of the user's home directory, as a shell would do.
|
|
|
|
White space is used for separation inside lists. List elements with
|
|
embedded spaces can be quoted using double-quotes.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4.1. Main configuration file
|
|
|
|
recoll.conf is the main configuration file. It defines things like what to
|
|
index (top directories and things to ignore), and the default character
|
|
set to use for document types which do not specify it internally.
|
|
|
|
The default configuration will index your home directory. If this is not
|
|
appropriate, start recoll to create a blank configuration, click Cancel,
|
|
and edit the configuration file before restarting the command. This will
|
|
start the initial indexing, which may take some time.
|
|
|
|
Paramers:
|
|
|
|
topdirs
|
|
|
|
Specifies the list of directories or files to index (recursively
|
|
for directories). The indexer will not follow symbolic links
|
|
inside the indexed trees by default (see the followLinks options
|
|
though).
|
|
|
|
dbdir
|
|
|
|
The name of the Xapian data directory. It will be created if
|
|
needed when the index is initialized. If this is not an absolute
|
|
path, it will be interpreted relative to the configuration
|
|
directory. The value can have embedded spaces but starting or
|
|
trailing spaces will be trimmed. You cannot use quotes here.
|
|
|
|
skippedNames
|
|
|
|
A space-separated list of patterns for names of files or
|
|
directories that should be completely ignored. The list defined in
|
|
the default file is:
|
|
|
|
skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
|
|
*~ recollrc
|
|
|
|
The list can be redefined for sub-directories, but is only
|
|
actually changed for the top level ones in topdirs.
|
|
|
|
The top-level directories are not affected by this list (that is,
|
|
a directory in topdirs might match and would still be indexed).
|
|
|
|
The list in the default configuration does not exclude hidden
|
|
directories (names beginning with a dot), which means that it may
|
|
index quite a few things that you do not want. On the other hand,
|
|
mail user agents like thunderbird usually store messages in hidden
|
|
directories, and you probably want this indexed. One possible
|
|
solution is to have .* in skippedNames, and add things like
|
|
~/.thunderbird or ~/.evolution in topdirs.
|
|
|
|
Not even the file names are indexed for patterns in this list. See
|
|
the recoll_noindex variable in mimemap for an alternative approach
|
|
which indexes the file names.
|
|
|
|
skippedPaths and daemSkippedPaths
|
|
|
|
A space-separated list of patterns for paths of files or
|
|
directories that should be skipped. There is no default in the
|
|
sample configuration file, but the code always adds the
|
|
configuration and database directories in there.
|
|
|
|
skippedPaths is used both by batch and real time indexing.
|
|
daemSkippedPaths can be used to specify things that should be
|
|
indexed at startup, but not monitored.
|
|
|
|
Example of use for skipping text files only in a specific
|
|
directory:
|
|
|
|
skippedPaths = ~/somedir/*.txt
|
|
|
|
|
|
followLinks
|
|
|
|
Specifies if the indexer should follow symbolic links while
|
|
walking the file tree. The default is to ignore symbolic links to
|
|
avoid multiple indexing of linked files. No effort is made to
|
|
avoid duplication when this option is set to true. This option can
|
|
be set individually for each of the topdirs members by using
|
|
sections. It can not be changed below the topdirs level.
|
|
|
|
loglevel,daemloglevel
|
|
|
|
Verbosity level for recoll and recollindex. A value of 4 lists
|
|
quite a lot of debug/information messages. 2 only lists errors.
|
|
The daemversion is specific to the indexing monitor daemon.
|
|
|
|
logfilename, daemlogfilename
|
|
|
|
Where the messages should go. 'stderr' can be used as a special
|
|
value, and is the default. The daemversion is specific to the
|
|
indexing monitor daemon.
|
|
|
|
indexstemminglanguages
|
|
|
|
A list of languages for which the stem expansion databases will be
|
|
built. See recollindex(1) or use the recollindex -l command for
|
|
possible values. You can add a stem expansion database for a
|
|
different language by using recollindex -s, but it will be deleted
|
|
during the next indexing. Only languages listed in the
|
|
configuration file are permanent.
|
|
|
|
defaultcharset
|
|
|
|
The name of the character set used for files that do not contain a
|
|
character set definition (ie: plain text files). This can be
|
|
redefined for any sub-directory. If it is not set at all, the
|
|
character set used is the one defined by the nls environment
|
|
(LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
|
|
|
|
maxfsoccuppc
|
|
|
|
Maximum file system occupation before we stop indexing. The value
|
|
is a percentage, corresponding to what the "Capacity" df output
|
|
column shows. The default value is 0, meaning no checking.
|
|
|
|
idxflushmb
|
|
|
|
Threshold (megabytes of new text data) where we flush from memory
|
|
to disk index. Setting this can help control memory usage. A value
|
|
of 0 means no explicit flushing, letting Xapian use its own
|
|
default, which is flushing every 10000 documents (memory usage
|
|
depends on average document size). The default value is 10.
|
|
|
|
filtersdir
|
|
|
|
A directory to search for the external filter scripts used to
|
|
index some types of files. The value should not be changed, except
|
|
if you want to modify one of the default scripts. The value can be
|
|
redefined for any sub-directory.
|
|
|
|
iconsdir
|
|
|
|
The name of the directory where recoll result list icons are
|
|
stored. You can change this if you want different images.
|
|
|
|
guesscharset
|
|
|
|
Decide if we try to guess the character set of files if no
|
|
internal value is available (ie: for plain text files). This does
|
|
not work well in general, and should probably not be used.
|
|
|
|
usesystemfilecommand
|
|
|
|
Decide if we use the file -i system command as a final step for
|
|
determining the mime type for a file (the main procedure uses
|
|
suffix associations as defined in the mimemap file). This can be
|
|
useful for files with suffix-less names, but it will also cause
|
|
the indexing of many bogus "text" files.
|
|
|
|
indexedmimetypes
|
|
|
|
Recoll normally indexes any file which it knows how to read. This
|
|
list lets you restrict the indexed mime types to what you specify.
|
|
If the variable is unspecified or the list empty (the default),
|
|
all supported types are processed.
|
|
|
|
compressedfilemaxkbs
|
|
|
|
Size limit for compressed (.gz or .bz2) files. These need to be
|
|
decompressed in a temporary directory for identification, which
|
|
can be very wasteful if 'uninteresting' big compressed files are
|
|
present. Negative means no limit, 0 means no processing of any
|
|
compressed file. Defaults to -1.
|
|
|
|
indexallfilenames
|
|
|
|
Recoll indexes file names in a special section of the database to
|
|
allow specific file names searches using wild cards. This
|
|
parameter decides if file name indexing is performed only for
|
|
files with mime types that would qualify them for full text
|
|
indexing, or for all files inside the selected subtrees,
|
|
independently of mime type.
|
|
|
|
idxabsmlen
|
|
|
|
Recoll stores an abstract for each indexed file inside the
|
|
database. The text can come from an actual 'abstract' section in
|
|
the document or will just be the beginning of the document. It is
|
|
stored in the index so that it can be displayed inside the result
|
|
lists without decoding the original file. The idxabsmlen parameter
|
|
defines the size of the stored abstract. The default value is 250
|
|
bytes. The search interface gives you the choice to display this
|
|
stored text or a synthetic abstract built by extracting text
|
|
around the search terms. If you always prefer the synthetic
|
|
abstract, you can reduce this value and save a little space.
|
|
|
|
aspellLanguage
|
|
|
|
Language definitions to use when creating the aspell dictionary.
|
|
The value must match a set of aspell language definition files.
|
|
You can type "aspell config" to see where these are installed
|
|
(look for data-dir). The default if the variable is not set is to
|
|
use your desktop national language environment to guess the value.
|
|
|
|
noaspell
|
|
|
|
If this is set, the aspell dictionary generation is turned off.
|
|
Useful for cases where you don't need the functionality or when it
|
|
is unusable because aspell crashes during dictionary generation.
|
|
|
|
nocjk
|
|
|
|
If this set to true, specific east asian (Chinese Korean Japanese)
|
|
characters/word splitting is turned off. This will save a small
|
|
amount of cpu if you have no CJK documents. If your document base
|
|
does include such text but you are not interested in searching it,
|
|
setting nocjk may be a significant time and space saver.
|
|
|
|
cjkngramlen
|
|
|
|
This lets you adjust the size of n-grams used for indexing CJK
|
|
text. The default value of 2 is probably appropriate in most
|
|
cases. A value of 3 would allow more precision and efficiency on
|
|
longer words, but the index will be approximately twice as large.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4.2. The mimemap file
|
|
|
|
mimemap specifies the file name extension to mime type mappings.
|
|
|
|
For file names without an extension, or with an unknown one, the system's
|
|
file -i command will be executed to determine the mime type (this can be
|
|
switched off inside the main configuration file).
|
|
|
|
The mappings can be specified on a per-subtree basis, which may be useful
|
|
in some cases. Example: gaim logs have a .txt extension but should be
|
|
handled specially, which is possible because they are usually all located
|
|
in one place.
|
|
|
|
mimemap also has a recoll_noindex variable which is a list of suffixes.
|
|
Matching files will be skipped (which avoids unnecessary decompressions or
|
|
file executions). This is partially redundant with skippedNames in the
|
|
main configuration file, with a few differences: it will not affect
|
|
directories, it cannot be made dependant on the file-system location (it
|
|
is a configuration-wide parameter), and the file names will still be
|
|
indexed (not even the file names are indexed for patterns in skippedNames.
|
|
recoll_noindex is used mostly for things known to be unindexable by a
|
|
given Recoll version. Having it there avoids cluttering the more
|
|
user-oriented and locally customized skippedNames.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4.3. The mimeconf file
|
|
|
|
mimeconf specifies how the different mime types are handled for indexing,
|
|
and which icons are displayed in the recoll result lists.
|
|
|
|
Changing the parameters in the [index] section is probably not a good idea
|
|
except if you are a Recoll developer.
|
|
|
|
The [icons] section allows you to change the icons which are displayed by
|
|
recoll in the result lists (the values are the basenames of the png images
|
|
inside the iconsdir directory (specified in recoll.conf).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4.4. The mimeview file
|
|
|
|
mimeview specifies which programs are started when you click on an Edit
|
|
link in a result list. Ie: HTML is normally displayed using firefox, but
|
|
you may prefer Konqueror, your openoffice.org program might be named
|
|
oofice instead of openoffice etc.
|
|
|
|
Changes to this file can be done by direct editing, or through the recoll
|
|
user preferences dialog.
|
|
|
|
As for the other configuration files, the normal usage is to have a
|
|
mimeview inside your own configuration directory, with just the
|
|
non-default entries, which will override those from the central
|
|
configuration file.
|
|
|
|
Please note that these entries must be placed under a [view] section.
|
|
|
|
If Use desktop preferences to choose document editor is checked in the
|
|
user preferences, all mimeview entries will be ignored except the one
|
|
labelled application/x-all (which is set to use xdg-open by default).
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4.5. Examples of configuration adjustments
|
|
|
|
7.4.5.1. Adding an external viewer for an non-indexed type
|
|
|
|
Imagine that you have some kind of file which does not have indexable
|
|
content, but for which you would like to have a functional Edit link in
|
|
the result list (when found by file name). The file names end in .blob and
|
|
can be displayed by application blobviewer.
|
|
|
|
You need two entries in the configuration files for this to work:
|
|
|
|
* In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
|
|
following line:
|
|
|
|
application/x-blobapp = .blob
|
|
|
|
|
|
Note that the mime type is made up here, and you could call it
|
|
diesel/oil just the same.
|
|
|
|
* In $RECOLL_CONFDIR/mimeview under the [view] section:
|
|
|
|
application/x-blobapp = blobviewer %f
|
|
|
|
|
|
We are supposing that blobviewer wants a file name parameter here, you
|
|
would use %u if it liked URLs better.
|
|
|
|
If you just wanted to change the application used by Recoll to display a
|
|
mime type which it already knows, you would just need to edit mimeview.
|
|
The entries you add in your personal file override those in the central
|
|
configuration, which you do not need to alter
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.4.5.2. Adding indexing support for a new file type
|
|
|
|
Let us now imagine that the above .blob files actually contain indexable
|
|
text and that you know how to extract it with a command line program.
|
|
Getting Recoll to index the files is easy. You need to perform the above
|
|
alteration, and also to add data to the mimeconf file (typically in
|
|
~/.recoll/mimeconf):
|
|
|
|
* Under the [index] section, add the following line (more about the
|
|
rclblob indexing script later):
|
|
|
|
application/x-blobapp = exec rclblob
|
|
|
|
|
|
* Under the [icons] section, you should choose an icon to be displayed
|
|
for the files inside the result lists. Icons are normally 64x64 pixels
|
|
PNG files which live in /usr/[local/]share/recoll/images.
|
|
|
|
* Under the [categories] section, you should add the mime type where it
|
|
makes sense (you can also create a category). Categories may be used
|
|
for filtering in advanced search.
|
|
|
|
The rclblob filter should be an executable program or script which exists
|
|
inside /usr/[local/]share/recoll/filters. It will be given a file name as
|
|
argument and should output the text contents on the standard output.
|
|
|
|
The filter programming section describes in more detail how to write a
|
|
filter.
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
7.5. The KDE Kicker Recoll applet
|
|
|
|
The Recoll source tree contains the source code to the recoll_applet, a
|
|
small application derived from the find_applet. This can be used to add a
|
|
small Recoll launcher to the KDE panel.
|
|
|
|
The applet is not automatically built with the main Recoll programs, nor
|
|
is it included with the main source distribution (because the KDE build
|
|
boilerplate makes it relatively big). You can download its source from the
|
|
recoll.org download page. Use the omnipotent configure;make;make install
|
|
incantation to build and install.
|
|
|
|
You can then add the applet to the panel by right-clicking the panel and
|
|
choosing the Add applet entry.
|
|
|
|
The recoll_applet has a small text window where you can type a Recoll
|
|
query (in query language form), and an icon which can be used to restrict
|
|
the search to certain types of files. It is quite primitive, and launches
|
|
a new recoll GUI instance every time (even if it is already running). You
|
|
may find it useful anyway.
|
|
|
|
----------------------------------------------------------------------
|