6431 lines
283 KiB
XML
6431 lines
283 KiB
XML
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
|
|
|
|
<!ENTITY RCL "<application>Recoll</application>">
|
|
<!ENTITY RCLAPPS "<ulink url='http://www.recoll.org/features.html#doctypes'>http://www.recoll.org/features.html</ulink>">
|
|
<!ENTITY RCLVERSION "1.22">
|
|
<!ENTITY XAP "<application>Xapian</application>">
|
|
<!ENTITY WIN "<application>Windows</application>">
|
|
<!ENTITY WIKI "http://bitbucket.org/medoc/recoll/wiki/">
|
|
]>
|
|
|
|
<book lang="en">
|
|
|
|
<bookinfo>
|
|
<title>Recoll user manual</title>
|
|
|
|
<author>
|
|
<firstname>Jean-Francois</firstname>
|
|
<surname>Dockes</surname>
|
|
<affiliation>
|
|
<address><email>jfd@recoll.org</email></address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<copyright>
|
|
<year>2005-2015</year>
|
|
<holder role="mailto:jfd@recoll.org">Jean-Francois Dockes</holder>
|
|
</copyright>
|
|
|
|
<abstract>
|
|
<para><literal>Permission is granted to copy, distribute and/or
|
|
modify this document under the terms of the GNU Free Documentation
|
|
License, Version 1.3 or any later version published by the Free
|
|
Software Foundation; with no Invariant Sections, no Front-Cover
|
|
Texts, and no Back-Cover Texts. A copy of the license can be
|
|
found at the following
|
|
location: <ulink url="http://www.gnu.org/licenses/fdl.html">GNU
|
|
web site</ulink>.</literal></para>
|
|
|
|
<para>This document introduces full text search notions
|
|
and describes the installation and use of the &RCL;
|
|
application. This version describes &RCL; &RCLVERSION;.</para>
|
|
</abstract>
|
|
|
|
|
|
</bookinfo>
|
|
|
|
<chapter id="RCL.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<para>This document introduces full text search notions
|
|
and describes the installation and use of the &RCL;
|
|
application. This version describes &RCL; &RCLVERSION;.</para>
|
|
|
|
<para>&RCL; was for a long time dedicated to Unix-like systems. It
|
|
was only lately (2015) ported to
|
|
<application>MS-Windows</application>. Many references in this
|
|
manual, especially file locations, are specific to Unix, and not
|
|
valid on &WIN;. Some described features are also not available on
|
|
&WIN;. The manual will be progressively updated. Until this happens,
|
|
most references to shared files can be translated by looking under
|
|
the Recoll installation directory (esp. the
|
|
<filename>Share</filename> subdirectory). The user configuration is
|
|
stored by default under <filename>AppData/Local/Recoll</filename>
|
|
inside the user directory, along with the index itself.</para>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.TRYIT">
|
|
<title>Giving it a try</title>
|
|
|
|
<para>If you do not like reading manuals (who does?) but
|
|
wish to give &RCL; a try, just <link
|
|
linkend="RCL.INSTALL.BINARY">install</link> the application
|
|
and start the <command>recoll</command> graphical user
|
|
interface (GUI), which will ask permission to index your home
|
|
directory by default, allowing you to search immediately after
|
|
indexing completes.</para>
|
|
|
|
<para>Do not do this if your home directory contains a huge
|
|
number of documents and you do not want to wait or are very
|
|
short on disk space. In this case, you may first want to customize
|
|
the <link linkend="RCL.INDEXING.CONFIG">configuration</link>
|
|
to restrict the indexed area (for the very impatient with a completed package install, from the <command>recoll</command> GUI: <menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing configuration</guimenuitem>
|
|
</menuchoice>, then adjust the <guilabel>Top
|
|
directories</guilabel> section).</para>
|
|
|
|
<para>Also be aware that, on Unix/Linux, you may need to install the
|
|
appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting
|
|
applications</link> for document types that need them (for
|
|
example <application>antiword</application> for
|
|
<application>Microsoft Word</application> files).</para>
|
|
|
|
<para>The &RCL; installation for &WIN; is self-contained and includes
|
|
most useful auxiliary programs. You will just need to install Python
|
|
2.7.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.SEARCH">
|
|
<title>Full text search</title>
|
|
|
|
<para>&RCL; is a full text search application, which means that it
|
|
finds your data by content rather than by external attributes
|
|
(like the file name). You specify words
|
|
(terms) which should or should not appear in the text you are
|
|
looking for, and receive in return a list of matching
|
|
documents, ordered so that the most
|
|
<emphasis>relevant</emphasis> documents will appear
|
|
first.</para>
|
|
|
|
<para>You do not need to remember in what file or email message you
|
|
stored a given piece of information. You just ask for related
|
|
terms, and the tool will return a list of documents where
|
|
these terms are prominent, in a similar way to Internet search
|
|
engines.</para>
|
|
|
|
<para>Full text search applications try to determine which
|
|
documents are most relevant to the search terms you
|
|
provide. Computer algorithms for determining relevance can be
|
|
very complex, and in general are inferior to the power of the
|
|
human mind to rapidly determine relevance. The quality of
|
|
relevance guessing is probably the most important aspect when
|
|
evaluating a search application.</para>
|
|
|
|
<para>In many cases, you are looking for all the forms of a
|
|
word, including plurals, different tenses for a verb, or terms
|
|
derived from the same root or <emphasis>stem</emphasis>
|
|
(example: <replaceable>floor, floors, floored,
|
|
flooring...</replaceable>). Queries are usually automatically
|
|
expanded to all such related terms (words that reduce to the
|
|
same stem). This can be prevented for searching for a specific
|
|
form.</para>
|
|
|
|
<para>Stemming, by itself, does not accommodate for misspellings
|
|
or phonetic searches. A full text search application may also
|
|
support this form of approximation. For example, a search for
|
|
<replaceable>aliterattion</replaceable> returning no result may
|
|
propose, depending on index contents, <replaceable>alliteration
|
|
alteration alterations altercation</replaceable> as possible
|
|
replacement terms. </para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.RECOLL">
|
|
<title>Recoll overview</title>
|
|
|
|
<para>&RCL; uses the
|
|
<ulink url="http://www.xapian.org">&XAP;</ulink> information retrieval
|
|
library as its storage and retrieval engine. &XAP; is a very
|
|
mature package using <ulink
|
|
url="http://www.xapian.org/docs/intro_ir.html">a sophisticated
|
|
probabilistic ranking model</ulink>.</para>
|
|
|
|
<para>The &XAP; library manages an index database which
|
|
describes where terms appear in your document files. It
|
|
efficiently processes the complex queries which are produced by
|
|
the &RCL; query expansion mechanism, and is in charge of the
|
|
all-important relevance computation task.</para>
|
|
|
|
<para>&RCL; provides the mechanisms and interface to get data
|
|
into and out of the index. This includes translating the many
|
|
possible document formats into pure text, handling term
|
|
variations (using &XAP; stemmers), and spelling approximations
|
|
(using the <application>aspell</application> speller),
|
|
interpreting user queries and presenting results.</para>
|
|
|
|
<para>In a shorter way, &RCL; does the dirty footwork, &XAP;
|
|
deals with the intelligent parts of the process.</para>
|
|
|
|
<para>The &XAP; index can be big (roughly the size of the
|
|
original document set), but it is not a document
|
|
archive. &RCL; can only display documents that still exist at
|
|
the place from which they were indexed. (Actually, there is a
|
|
way to reconstruct a document from the information in the
|
|
index, but the result is not nice, as all formatting,
|
|
punctuation and capitalization are lost).</para>
|
|
|
|
<para>&RCL; stores all internal data in <application>Unicode
|
|
UTF-8</application> format, and it can index files of many types
|
|
with different character sets, encodings, and languages into the
|
|
same index. It can process documents embedded inside other
|
|
documents (for example a pdf document stored inside a Zip
|
|
archive sent as an email attachment...), down to an arbitrary
|
|
depth.</para>
|
|
|
|
<para>Stemming is the process by which &RCL; reduces words to
|
|
their radicals so that searching does not depend, for example, on a
|
|
word being singular or plural (floor, floors), or on a verb tense
|
|
(flooring, floored). Because the mechanisms used for stemming
|
|
depend on the specific grammatical rules for each language, there
|
|
is a separate &XAP; stemmer module for most common languages where
|
|
stemming makes sense.</para>
|
|
|
|
<para>&RCL; stores the unstemmed versions of terms in the main index
|
|
and uses auxiliary databases for term expansion (one for each
|
|
stemming language), which means that you can switch stemming
|
|
languages between searches, or add a language without needing a
|
|
full reindex.</para>
|
|
|
|
<para>Storing documents written in different languages in the same
|
|
index is possible, and commonly done. In this situation, you can
|
|
specify several stemming languages for the index. </para>
|
|
|
|
<para>&RCL; currently makes no attempt at automatic language
|
|
recognition, which means that the stemmer will sometimes be applied
|
|
to terms from other languages with potentially strange results. In
|
|
practise, even if this introduces possibilities of confusion, this
|
|
approach has been proven quite useful, and it is much less
|
|
cumbersome than separating your documents according to what
|
|
language they are written in.</para>
|
|
|
|
<para>By default, &RCL; strips most accents and
|
|
diacritics from terms, and converts them to lower case before
|
|
either storing them in the index or searching for them. As a
|
|
consequence, it is impossible to search for a particular
|
|
capitalization of a term (<literal>US</literal> /
|
|
<literal>us</literal>), or to discriminate two terms based on
|
|
diacritics (<literal>sake</literal> / <literal>saké</literal>,
|
|
<literal>mate</literal> / <literal>maté</literal>).</para>
|
|
|
|
<para>&RCL; versions 1.18 and newer can optionally store the raw
|
|
terms, without accent stripping or case conversion. In this
|
|
configuration, default searches will behave as before, but it is
|
|
possible to perform searches sensitive to case and
|
|
diacritics. This is described in more detail
|
|
in the <link linkend="RCL.INDEXING.CONFIG.SENS">section about index
|
|
case and diacritics sensitivity</link>.</para>
|
|
|
|
<para>&RCL; has many parameters which define exactly what to
|
|
index, and how to classify and decode the source
|
|
documents. These are kept in <link
|
|
linkend="RCL.INDEXING.CONFIG">configuration files</link>. A
|
|
default configuration is copied into a standard location
|
|
(usually something like
|
|
<filename>/usr/share/recoll/examples</filename>)
|
|
during installation. The default values set by the
|
|
configuration files in this directory may be overridden by
|
|
values set inside your personal configuration, found
|
|
by default in the <filename>.recoll</filename> sub-directory
|
|
of your home directory. The default configuration will index
|
|
your home directory with default parameters and should be
|
|
sufficient for giving &RCL; a try, but you may want to adjust
|
|
it later, which can be done either by editing the text files
|
|
or by using configuration menus in the
|
|
<command>recoll</command> GUI. Some other parameters affecting only
|
|
the <command>recoll</command> GUI are stored in the standard
|
|
location defined by <application>Qt</application>.</para>
|
|
|
|
<para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing
|
|
process</link> is started automatically the first time you
|
|
execute the <command>recoll</command> GUI. Indexing can also
|
|
be performed by executing the <command>recollindex</command>
|
|
command. &RCL; indexing is multithreaded by default when
|
|
appropriate hardware resources are available, and can perform
|
|
in parallel multiple tasks among text extraction, segmentation
|
|
and index updates.</para>
|
|
|
|
<para><link linkend="RCL.SEARCH">Searches</link> are usually
|
|
performed inside the <command>recoll</command> GUI, which has many
|
|
options to help you find what you are looking for. However, there
|
|
are other ways to perform &RCL; searches: mostly a <link
|
|
linkend="RCL.SEARCH.COMMANDLINE">
|
|
command line interface</link>, a
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI">
|
|
<application>Python</application>
|
|
programming interface</link>, a <link linkend="RCL.SEARCH.KIO">
|
|
<application>KDE</application> KIO slave module</link>, and
|
|
Ubuntu Unity <ulink url="https://bitbucket.org/medoc/unity-lens-recoll">
|
|
Lens</ulink> (for older versions) or
|
|
<ulink url="https://bitbucket.org/medoc/unity-scope-recoll">
|
|
Scope</ulink> (for current versions) modules.
|
|
</para>
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="RCL.INDEXING">
|
|
<title>Indexing</title>
|
|
|
|
<sect1 id="RCL.INDEXING.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<para>Indexing is the process by which the set of documents is
|
|
analyzed and the data entered into the database. &RCL;
|
|
indexing is normally incremental: documents will only be
|
|
processed if they have been modified since the last run. On
|
|
the first execution, all documents will need processing. A
|
|
full index build can be forced later by specifying an option
|
|
to the indexing command (<command>recollindex</command>
|
|
<option>-z</option> or <option>-Z</option>).</para>
|
|
|
|
<para><command>recollindex</command> skips files which caused an
|
|
error during a previous pass. This is a performance
|
|
optimization, and a new behaviour in version 1.21 (failed files
|
|
were always retried by previous versions). The command line
|
|
option <option>-k</option> can be set to retry failed files, for
|
|
example after updating a filter.</para>
|
|
|
|
<para>The following sections give an overview of different
|
|
aspects of the indexing processes and configuration, with links
|
|
to detailed sections.</para>
|
|
|
|
<para>Depending on your data, temporary files may be needed during
|
|
indexing, some of them possibly quite big. You can use the
|
|
<envar>RECOLL_TMPDIR</envar> or <envar>TMPDIR</envar> environment
|
|
variables to determine where they are created (the default is to
|
|
use <filename>/tmp</filename>). Using <envar>TMPDIR</envar> has
|
|
the nice property that it may also be taken into account by
|
|
auxiliary commands executed by <command>recollindex</command>.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.INTRODUCTION.MODES">
|
|
<title>Indexing modes</title>
|
|
|
|
<para>&RCL; indexing can be performed along two different modes:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara>
|
|
<title><link linkend="RCL.INDEXING.PERIODIC">
|
|
Periodic (or batch) indexing:</link></title>
|
|
<para>indexing takes place at discrete
|
|
times, by executing the <command>recollindex</command>
|
|
command. The typical usage is to have a nightly indexing run
|
|
<link linkend="RCL.INDEXING.PERIODIC.AUTOMAT">
|
|
programmed</link> into
|
|
your <command>cron</command> file.</para>
|
|
</formalpara>
|
|
</listitem>
|
|
<listitem>
|
|
<formalpara><title><link linkend="RCL.INDEXING.MONITOR">Real
|
|
time indexing:</link></title>
|
|
<para>indexing takes place as soon as a file is created or
|
|
changed. <command>recollindex</command> runs as a daemon
|
|
and uses a file system alteration monitor such as
|
|
<application>inotify</application>,
|
|
<application>Fam</application> or
|
|
<application>Gamin</application>
|
|
to detect file changes.</para>
|
|
</formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>The choice between the two methods is mostly a matter of
|
|
preference, and they can be combined by setting up multiple
|
|
indexes (ie: use periodic indexing on a big documentation
|
|
directory, and real time indexing on a small home
|
|
directory). Monitoring a big file system tree can consume
|
|
significant system resources.</para>
|
|
|
|
<para>The choice of method and the parameters used can be
|
|
configured from the <command>recoll</command> GUI:
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing schedule</guimenuitem>
|
|
</menuchoice>
|
|
</para>
|
|
|
|
<para>The <menuchoice><guimenu>File</guimenu>
|
|
</menuchoice> menu also has entries to start or stop
|
|
the current indexing operation. Stopping indexing is performed by
|
|
killing the <command>recollindex</command> process, which will
|
|
checkpoint its state and exit. A later restart of indexing will
|
|
mostly resume from where things stopped (the file tree walk has to
|
|
be restarted from the beginning).</para>
|
|
|
|
<para>When the real time indexer is running, only a stop operation
|
|
is available from the menu. When no indexing is running, you have
|
|
a choice of updating the index or rebuilding it (the first choice
|
|
only processes changed files, the second one zeroes the index
|
|
before starting so that all files are processed).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.INTRODUCTION.CONFIG">
|
|
<title>Configurations, multiple indexes</title>
|
|
|
|
<para>The parameters describing what is to be indexed and
|
|
local preferences are defined in text files contained in a
|
|
<link linkend="RCL.INDEXING.CONFIG">configuration
|
|
directory</link>.</para>
|
|
|
|
<para>All parameters have defaults, defined in system-wide
|
|
files.</para>
|
|
|
|
<para>Without further configuration, &RCL; will index all
|
|
appropriate files from your home directory, with a reasonable
|
|
set of defaults.</para>
|
|
|
|
<para>A default personal configuration directory
|
|
(<filename>$HOME/.recoll/</filename>) is created
|
|
when a &RCL; program is first executed. It is possible to
|
|
create other configuration directories, and use them by
|
|
setting the <envar>RECOLL_CONFDIR</envar> environment
|
|
variable, or giving the <option>-c</option> option to any of
|
|
the &RCL; commands.</para>
|
|
|
|
<para>In some cases, it may be interesting to index different
|
|
areas of the file system to separate databases. You can do this
|
|
by using multiple configuration directories, each indexing a
|
|
file system area to a specific database. Typically, this
|
|
would be done to separate personal and shared
|
|
indexes, or to take advantage of the organization of your data
|
|
to improve search precision.</para>
|
|
|
|
<para>The generated indexes can
|
|
be queried concurrently in a transparent manner.</para>
|
|
|
|
<para>For index generation, multiple configurations are
|
|
totally independant from each other. When multiple indexes need
|
|
to be used for a single search,
|
|
<link linkend="RCL.INDEXING.CONFIG.MULTIPLE">some parameters
|
|
should be consistent among the configurations</link>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Document types</title>
|
|
<para>&RCL; knows about quite a few different document
|
|
types. The parameters for document types recognition and
|
|
processing are set in
|
|
<link linkend="RCL.INDEXING.CONFIG">configuration files</link>.</para>
|
|
|
|
<para>Most file types, like HTML or word processing files, only hold
|
|
one document. Some file types, like email folders or zip
|
|
archives, can hold many individually indexed documents, which may
|
|
themselves be compound ones. Such hierarchies can go quite
|
|
deep, and &RCL; can process, for example, a
|
|
<application>LibreOffice</application>
|
|
document stored as an attachment to an email message inside an
|
|
email folder archived in a zip file...</para>
|
|
|
|
<para>&RCL; indexing processes plain text, HTML, OpenDocument
|
|
(Open/LibreOffice), email formats, and a few others internally.</para>
|
|
|
|
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
|
need external applications for preprocessing. The list is in the
|
|
<link linkend="RCL.INSTALL.EXTERNAL"> installation</link>
|
|
section. After every indexing operation, &RCL; updates a list of
|
|
commands that would be needed for indexing existing files
|
|
types. This list can be displayed by selecting the menu option
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Show Missing Helpers</guimenuitem>
|
|
</menuchoice>
|
|
in the <command>recoll</command> GUI. It is stored in the
|
|
<filename>missing</filename> text file inside the configuration
|
|
directory.</para>
|
|
|
|
<para>By default, &RCL; will try to index any file type that
|
|
it has a way to read. This is sometimes not desirable, and
|
|
there are ways to either exclude some types, or on the
|
|
contrary to define a positive list of types to be
|
|
indexed. In the latter case, any type not in the list will
|
|
be ignored.</para>
|
|
|
|
<note><title>Note about MIME types</title>
|
|
<para>When editing the <literal>indexedmimetypes</literal>
|
|
or <literal>excludedmimetypes</literal> lists, you should use the
|
|
MIME values listed in the <filename>mimemap</filename> file
|
|
or in Recoll result lists in preference to <literal>file -i</literal>
|
|
output: there are a number of differences. The
|
|
<literal>file -i</literal> output should only be used for files
|
|
without extensions, or for which the extension is not listed in
|
|
<filename>mimemap</filename></para></note>
|
|
|
|
<para>Excluding types can be done by adding wildcard name
|
|
patterns to the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES">
|
|
skippedNames</link> list, which
|
|
can be done from the GUI Index configuration menu. For
|
|
versions 1.20 and later, you can alternatively set the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES">
|
|
excludedmimetypes</link> list in the configuration file. This
|
|
can be redefined for subdirectories.</para>
|
|
|
|
<para>You can also define an exclusive list of MIME types to be
|
|
indexed (no others will be indexed), by settting
|
|
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.INDEXEDMIMETYPES">
|
|
indexedmimetypes</link> configuration variable. Example:<programlisting>
|
|
indexedmimetypes = text/html application/pdf
|
|
</programlisting>
|
|
It is possible to redefine this parameter for
|
|
subdirectories. Example:<programlisting>
|
|
[/path/to/my/dir]
|
|
indexedmimetypes = application/pdf
|
|
</programlisting>
|
|
(When using sections like this, don't forget that they remain
|
|
in effect until the end of the file or another section
|
|
indicator).
|
|
</para>
|
|
|
|
<para><literal>excludedmimetypes</literal> or
|
|
<literal>indexedmimetypes</literal>, can be set either by
|
|
editing the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">
|
|
main configuration file
|
|
(<filename>recoll.conf</filename>)</link>, or from the GUI
|
|
index configuration tool.</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Indexing failures</title>
|
|
|
|
<para>Indexing may fail for some documents, for a number of
|
|
reasons: a helper program may be missing, the document may be
|
|
corrupt, we may fail to uncompress a file because no file
|
|
system space is available, etc.</para>
|
|
|
|
<para>&RCL; versions prior to 1.21 always retried to index
|
|
files which had previously caused an error. This guaranteed
|
|
that anything that may have become indexable (for example
|
|
because a helper had been installed) would be indexed. However
|
|
this was bad for performance because some indexing failures
|
|
may be quite costly (for example failing to uncompress a big
|
|
file because of insufficient disk space).</para>
|
|
|
|
<para>The indexer in &RCL; versions 1.21 and later does not
|
|
retry failed file by default. Retrying will only occur if an
|
|
explicit option (<option>-k</option>) is set on the
|
|
<command>recollindex</command> command line, or if a script
|
|
executed when <command>recollindex</command> starts up says
|
|
so. The script is defined by a configuration variable
|
|
(<literal>checkneedretryindexscript</literal>), and makes a
|
|
rather lame attempt at deciding if a helper command may have
|
|
been installed, by checking if any of the common
|
|
<filename>bin</filename> directories have changed.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Recovery</title>
|
|
<para>In the rare case where the index becomes corrupted (which can
|
|
signal itself by weird search results or crashes), the index files
|
|
need to be erased before restarting a clean indexing pass. Just delete
|
|
the <filename>xapiandb</filename> directory (see
|
|
<link linkend="RCL.INDEXING.STORAGE">next section</link>), or,
|
|
alternatively, start the next <command>recollindex</command> with the
|
|
<option>-z</option> option, which will reset the database before
|
|
indexing.</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.STORAGE">
|
|
<title>Index storage</title>
|
|
|
|
<para>The default location for the index data is the
|
|
<filename>xapiandb</filename> subdirectory of the &RCL;
|
|
configuration directory, typically
|
|
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
|
|
changed via two different methods (with different purposes):
|
|
<itemizedlist>
|
|
<listitem><para>You can specify a different configuration
|
|
directory by setting the <envar>RECOLL_CONFDIR</envar>
|
|
environment variable, or using the <option>-c</option>
|
|
option to the &RCL; commands. This method would typically be
|
|
used to index different areas of the file system to
|
|
different indexes. For example, if you were to issue the
|
|
following command:
|
|
<programlisting>recoll -c ~/.indexes-email</programlisting> Then
|
|
&RCL; would use configuration files
|
|
stored in <filename>~/.indexes-email/</filename> and,
|
|
(unless specified otherwise in
|
|
<filename>recoll.conf</filename>) would look for
|
|
the index in
|
|
<filename>~/.indexes-email/xapiandb/</filename>.</para>
|
|
|
|
<para>Using multiple configuration directories and <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
|
options</link> allows you to tailor multiple configurations and
|
|
indexes to handle whatever subset of the available data you wish
|
|
to make searchable.</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem><para>For a given configuration directory, you can
|
|
specify a non-default storage location for the index by setting
|
|
the <varname>dbdir</varname> parameter in the configuration file
|
|
(see the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
|
section</link>). This method would mainly be of use if you wanted
|
|
to keep the configuration directory in its default location, but
|
|
desired another location for the index, typically out of disk
|
|
occupation concerns.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>The size of the index is determined by the size of the set
|
|
of documents, but the ratio can vary a lot. For a typical
|
|
mixed set of documents, the index size will often be close to
|
|
the data set size. In specific cases (a set of compressed mbox
|
|
files for example), the index can become much bigger than the
|
|
documents. It may also be much smaller if the documents
|
|
contain a lot of images or other non-indexed data (an extreme
|
|
example being a set of mp3 files where only the tags would be
|
|
indexed).</para>
|
|
|
|
<para>Of course, images, sound and video do not increase the
|
|
index size, which means that nowadays (2012), typically, even a big
|
|
index will be negligible against the total amount of data on the
|
|
computer.</para>
|
|
|
|
<para>The index data directory (<filename>xapiandb</filename>)
|
|
only contains data that can be completely rebuilt by an index run
|
|
(as long as the original documents exist), and it can always be
|
|
destroyed safely.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.FORMAT">
|
|
<title>&XAP; index formats</title>
|
|
|
|
<para>&XAP; versions usually support several formats for index
|
|
storage. A given major &XAP; version will have a current format,
|
|
used to create new indexes, and will also support the format from
|
|
the previous major version.</para>
|
|
|
|
<para>&XAP; will not convert automatically an existing index
|
|
from the older format to the newer one. If you want to upgrade to
|
|
the new format, or if a very old index needs to be converted
|
|
because its format is not supported any more, you will have to
|
|
explicitly delete the old index, then run a normal indexing
|
|
process.</para>
|
|
|
|
<para>Using the <option>-z</option> option to
|
|
<command>recollindex</command> is not sufficient to change the
|
|
format, you will have to delete all files inside the index
|
|
directory (typically <filename>~/.recoll/xapiandb</filename>)
|
|
before starting the indexing.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.SECURITY">
|
|
<title>Security aspects</title>
|
|
|
|
<para>The &RCL; index does not hold copies of the indexed
|
|
documents. But it does hold enough data to allow for an almost
|
|
complete reconstruction. If confidential data is indexed,
|
|
access to the database directory should be restricted. </para>
|
|
|
|
<para>&RCL; will create the configuration directory with a mode of
|
|
0700 (access by owner only). As the index data directory is by
|
|
default a sub-directory of the configuration directory, this should
|
|
result in appropriate protection.</para>
|
|
|
|
<para>If you use another setup, you should think of the kind
|
|
of protection you need for your index, set the directory
|
|
and files access modes appropriately, and also maybe adjust
|
|
the <literal>umask</literal> used during index updates.</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.CONFIG">
|
|
<title>Index configuration</title>
|
|
|
|
<para>Variables set inside the
|
|
<link linkend="RCL.INSTALL.CONFIG">&RCL; configuration files</link>
|
|
control which areas of the file system are indexed, and how
|
|
files are processed. These variables can be set either by
|
|
editing the text files or by using the
|
|
<link linkend="RCL.INDEXING.CONFIG.GUI"> dialogs in the
|
|
<command>recoll</command> GUI</link>.</para>
|
|
|
|
<para>The first time you start <command>recoll</command>, you
|
|
will be asked whether or not you would like it to build the
|
|
index. If you want to adjust the configuration before
|
|
indexing, just click <guilabel>Cancel</guilabel> at this
|
|
point, which will get you into the configuration interface. If
|
|
you exit at this point, <filename>recoll</filename> will have
|
|
created a <filename>~/.recoll</filename> directory containing
|
|
empty configuration files, which you can edit by hand.</para>
|
|
|
|
<para>The configuration is documented inside the
|
|
<link linkend="RCL.INSTALL.CONFIG">installation chapter</link>
|
|
of this document, or in the
|
|
<citerefentry>
|
|
<refentrytitle>recoll.conf</refentrytitle>
|
|
<manvolnum>5</manvolnum>
|
|
</citerefentry>
|
|
man page, but the most
|
|
current information will most likely be the comments inside the
|
|
sample file. The most immediately useful variable you may
|
|
interested in is probably
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS">
|
|
<varname>topdirs</varname></link>,
|
|
which determines what subtrees get indexed.</para>
|
|
|
|
<para>The applications needed to index file types other than
|
|
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
|
described in the <link linkend="RCL.INSTALL.EXTERNAL">external
|
|
packages section.</link></para>
|
|
|
|
<para>As of Recoll 1.18 there are two incompatible types of Recoll
|
|
indexes, depending on the treatment of character case and
|
|
diacritics. The next section describes the two types in more
|
|
detail.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.MULTIPLE">
|
|
<title>Multiple indexes</title>
|
|
|
|
<para>Multiple &RCL; indexes can be created by
|
|
using several configuration directories which are usually set to
|
|
index different areas of the file system. A specific index can
|
|
be selected for updating or searching, using the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable or the
|
|
<option>-c</option> option to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>When working with the <command>recoll</command> index
|
|
configuration GUI, the configuration directory for which parameters
|
|
are modified is the one which was selected by
|
|
<envar>RECOLL_CONFDIR</envar> or the <option>-c</option> parameter,
|
|
and there is no way to switch configurations within the GUI.</para>
|
|
|
|
<para>Additional configuration directory (beyond
|
|
<filename>~/.recoll</filename>) must be created by hand
|
|
(<command>mkdir</command> or such), the GUI will not do it. This is
|
|
to avoid mistakenly creating additional directories when an
|
|
argument is mistyped.</para>
|
|
|
|
<para>A typical usage scenario for the multiple index feature
|
|
would be for a system administrator to set up a central index
|
|
for shared data, that you choose to search or not in addition to
|
|
your personal data. Of course, there are other
|
|
possibilities. There are many cases where you know the subset of
|
|
files that should be searched, and where narrowing the search
|
|
can improve the results. You can achieve approximately the same
|
|
effect with the directory filter in advanced search, but
|
|
multiple indexes will have much better performance and may be
|
|
worth the trouble.</para>
|
|
|
|
<para>A <command>recollindex</command> program instance can only
|
|
update one specific index.</para>
|
|
|
|
<para>The main index (defined by
|
|
<envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is
|
|
always active. If this is undesirable, you can set up your
|
|
base configuration to index an empty directory.</para>
|
|
|
|
<para>The different search interfaces (GUI, command line, ...)
|
|
have different methods to define the set of indexes to be
|
|
used, see the appropriate section.</para>
|
|
|
|
<para>If a set of multiple indexes are to be used together for
|
|
searches, some configuration parameters must be consistent
|
|
among the set. These are parameters which need to be the same
|
|
when indexing and searching. As the parameters come from the
|
|
main configuration when searching, they need to be compatible
|
|
with what was set when creating the other indexes (which came
|
|
from their respective configuration directories).</para>
|
|
|
|
<para>Most importantly, all indexes to be queried concurrently must
|
|
have the same option concerning character case and diacritics
|
|
stripping, but there are other constraints. Most of the
|
|
relevant parameters are described in the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TERMS">linked
|
|
section</link>.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.SENS">
|
|
<title>Index case and diacritics sensitivity</title>
|
|
|
|
<para>As of &RCL; version 1.18 you have a choice of building an
|
|
index with terms stripped of character case and diacritics, or
|
|
one with raw terms. For a source term of
|
|
<literal>Résumé</literal>, the former will store
|
|
<literal>resume</literal>, the latter
|
|
<literal>Résumé</literal>.</para>
|
|
|
|
<para>Each type of index allows performing searches insensitive to
|
|
case and diacritics: with a raw index, the user entry will be
|
|
expanded to match all case and diacritics variations present in
|
|
the index. With a stripped index, the search term will be stripped
|
|
before searching.</para>
|
|
|
|
<para>A raw index allows for another possibility which a stripped
|
|
index cannot offer: using case and diacritics to discriminate
|
|
between terms, returning different results when searching for
|
|
<literal>US</literal> and <literal>us</literal> or
|
|
<literal>resume</literal> and <literal>résumé</literal>.
|
|
Read the <link linkend="RCL.SEARCH.CASEDIAC">section about search
|
|
case and diacritics sensitivity</link> for more details.</para>
|
|
|
|
<para>The type of index to be created is controlled by the
|
|
<literal>indexStripChars</literal> configuration
|
|
variable which can only be changed by editing the
|
|
configuration file. Any change implies an index reset (not
|
|
automated by &RCL;), and all indexes in a search must be set
|
|
in the same way (again, not checked by &RCL;). </para>
|
|
|
|
<para>If the <literal>indexStripChars</literal> is not set, &RCL;
|
|
1.18 creates a stripped index by default, for
|
|
compatibility with previous versions.</para>
|
|
|
|
<para>As a cost for added capability, a raw index will be slightly
|
|
bigger than a stripped one (around 10%). Also, searches will be
|
|
more complex, so probably slightly slower, and the feature is
|
|
still young, so that a certain amount of weirdness cannot be
|
|
excluded.</para>
|
|
|
|
<para>One of the most adverse consequence of using a raw index
|
|
is that some phrase and proximity searches may become
|
|
impossible: because each term needs to be expanded, and all
|
|
combinations searched for, the multiplicative expansion may
|
|
become unmanageable.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.THREADS">
|
|
<title>Indexing threads configuration</title>
|
|
|
|
<para>The &RCL; indexing process
|
|
<command>recollindex</command> can use multiple threads to
|
|
speed up indexing on multiprocessor systems. The work done
|
|
to index files is divided in several stages and some of the
|
|
stages can be executed by multiple threads. The stages are:
|
|
<orderedlist>
|
|
<listitem>File system walking: this is always performed by
|
|
the main thread.</listitem>
|
|
<listitem>File conversion and data extraction.</listitem>
|
|
<listitem>Text processing (splitting, stemming,
|
|
etc.)</listitem>
|
|
<listitem>&XAP; index update.</listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>You can also read a
|
|
<ulink url="http://www.recoll.org/idxthreads/threadingRecoll.html">
|
|
longer document</ulink> about the transformation of
|
|
&RCL; indexing to multithreading.</para>
|
|
|
|
<para>The threads configuration is controlled by two
|
|
configuration file parameters.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry><term><varname>thrQSizes</varname></term>
|
|
<listitem><para>This variable defines the job input queues
|
|
configuration. There are three possible queues for stages
|
|
2, 3 and 4, and this parameter should give the queue depth
|
|
for each stage (three integer values). If a value of -1 is
|
|
used for a given stage, no queue is used, and the thread
|
|
will go on performing the next stage. In practise, deep
|
|
queues have not been shown to increase performance. A value
|
|
of 0 for the first queue tells &RCL; to perform
|
|
autoconfiguration (no need for anything else in this case,
|
|
thrTCounts is not used) - this is the default
|
|
configuration.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>thrTCounts</varname></term>
|
|
<listitem><para>This defines the number of threads used
|
|
for each stage. If a value of -1 is used for one of
|
|
the queue depths, the corresponding thread count is
|
|
ignored. It makes no sense to use a value other than 1
|
|
for the last stage because updating the &XAP; index is
|
|
necessarily single-threaded (and protected by a
|
|
mutex).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<note><para>If the first value in <varname>thrQSizes</varname> is
|
|
0, <varname>thrTCounts</varname> is ignored.</para></note>
|
|
|
|
<para>The following example would use three queues (of depth 2),
|
|
and 4 threads for converting source documents, 2 for
|
|
processing their text, and one to update the index. This was
|
|
tested to be the best configuration on the test system
|
|
(quadri-processor with multiple disks).
|
|
<programlisting>
|
|
thrQSizes = 2 2 2
|
|
thrTCounts = 4 2 1
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>The following example would use a single queue, and the
|
|
complete processing for each document would be performed by
|
|
a single thread (several documents will still be processed
|
|
in parallel in most cases). The threads will use mutual
|
|
exclusion when entering the index update stage. In practise
|
|
the performance would be close to the precedent case in
|
|
general, but worse in certain cases (e.g. a Zip archive
|
|
would be performed purely sequentially), so the previous
|
|
approach is preferred. YMMV... The 2 last values for
|
|
thrTCounts are ignored.
|
|
<programlisting>
|
|
thrQSizes = 2 -1 -1
|
|
thrTCounts = 6 1 1
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>The following example would disable
|
|
multithreading. Indexing will be performed by a single
|
|
thread.
|
|
<programlisting>
|
|
thrQSizes = -1 -1 -1
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.GUI">
|
|
<title>The index configuration GUI</title>
|
|
|
|
<para>Most parameters for a given index configuration can
|
|
be set from a <command>recoll</command> GUI running on this
|
|
configuration (either as default, or by setting
|
|
<envar>RECOLL_CONFDIR</envar> or the <option>-c</option>
|
|
option.)</para>
|
|
|
|
<para>The interface is started from the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Index Configuration</guimenuitem>
|
|
</menuchoice>
|
|
menu entry. It is divided in four tabs,
|
|
<guilabel>Global parameters</guilabel>, <guilabel>Local
|
|
parameters</guilabel>, <guilabel>Web history</guilabel>
|
|
(which is explained in the next section) and <guilabel>Search
|
|
parameters</guilabel>.</para>
|
|
|
|
<para>The <guilabel>Global parameters</guilabel> tab allows setting
|
|
global variables, like the lists of top directories, skipped paths,
|
|
or stemming languages.</para>
|
|
|
|
<para>The <guilabel>Local parameters</guilabel> tab allows setting
|
|
variables that can be redefined for subdirectories. This second tab
|
|
has an initially empty list of customisation directories, to which
|
|
you can add. The variables are then set for the currently selected
|
|
directory (or at the top level if the empty line is
|
|
selected).</para>
|
|
|
|
<para>The <guilabel>Search parameters</guilabel> section defines
|
|
parameters which are used at query time, but are global to an
|
|
index and affect all search tools, not only the GUI.</para>
|
|
|
|
<para>The meaning for most entries in the interface is
|
|
self-evident and documented by a <literal>ToolTip</literal>
|
|
popup on the text label. For more detail, you will need to
|
|
refer to the <link linkend="RCL.INSTALL.CONFIG">configuration
|
|
section</link> of this guide.</para>
|
|
|
|
<para>The configuration tool normally respects the comments
|
|
and most of the formatting inside the configuration file, so
|
|
that it is quite possible to use it on hand-edited files,
|
|
which you might nevertheless want to backup first...</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.WEBQUEUE">
|
|
<title>Indexing WEB pages you wisit</title>
|
|
|
|
<para>With the help of a <application>Firefox</application>
|
|
extension, &RCL; can index the Internet pages that you visit. The
|
|
extension was initially designed for the
|
|
<application>Beagle</application> indexer, but it has recently be
|
|
renamed and better adapted to &RCL;.</para>
|
|
|
|
<para>The extension works by copying visited WEB pages to an indexing
|
|
queue directory, which &RCL; then processes, indexing the data,
|
|
storing it into a local cache, then removing the file from the
|
|
queue.</para>
|
|
|
|
<para>This feature can be enabled in the GUI
|
|
<guilabel>Index configuration</guilabel>
|
|
panel, or by editing the configuration file (set
|
|
<varname>processwebqueue</varname> to 1).</para>
|
|
|
|
<para>A current pointer to the extension can be found, along with
|
|
up-to-date instructions, on the
|
|
<ulink url="&WIKI;IndexWebHistory">Recoll wiki</ulink>.</para>
|
|
|
|
<para>A copy of the indexed WEB pages is retained by Recoll in a
|
|
local cache (from which previews can be fetched). The cache size can
|
|
be adjusted from the <guilabel>Index configuration</guilabel> /
|
|
<guilabel>Web history</guilabel> panel. Once the maximum size
|
|
is reached, old pages are purged - both from the cache and the index
|
|
- to make room for new ones, so you need to explicitly archive in
|
|
some other place the pages that you want to keep
|
|
indefinitely.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.EXTATTR">
|
|
<title>Extended attributes data</title>
|
|
|
|
<para>User extended attributes are named pieces of information
|
|
that most modern file systems can attach to any file.</para>
|
|
|
|
<para>&RCL; versions 1.19 and later process extended attributes
|
|
as document fields by default. For older versions, this has to
|
|
be activated at build time.</para>
|
|
|
|
<para>A
|
|
<ulink url="http://www.freedesktop.org/wiki/CommonExtendedAttributes">
|
|
freedesktop standard</ulink> defines a few special
|
|
attributes, which are handled as such by &RCL;:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>mime_type</term>
|
|
<listitem><para>If set, this overrides any other
|
|
determination of the file MIME type.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>charset</term>
|
|
<listitem>If set, this defines the file character set
|
|
(mostly useful for plain text files).</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
|
|
<para>By default, other attributes are handled as &RCL; fields.
|
|
On Linux, the <literal>user</literal> prefix is removed from
|
|
the name. This can be configured more precisely inside
|
|
the <link linkend="RCL.INSTALL.CONFIG.FIELDS">
|
|
<filename>fields</filename> configuration file</link>.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.EXTTAGS">
|
|
<title>Importing external tags</title>
|
|
|
|
<para>During indexing, it is possible to import metadata for
|
|
each file by executing commands. For example, this could
|
|
extract user tag data for the file and store it in a field for
|
|
indexing.</para>
|
|
|
|
<para>See the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS">section
|
|
about the <literal>metadatacmds</literal> field</link> in
|
|
the main configuration chapter for a description of the
|
|
configuration syntax.</para>
|
|
|
|
<para>As an example, if you would want &RCL; to use tags managed by
|
|
<application>tmsu</application>, you would add the following to the
|
|
configuration file:</para>
|
|
|
|
<programlisting>[/some/area/of/the/fs]
|
|
metadatacmds = ; tags = tmsu tags %f
|
|
</programlisting>
|
|
|
|
<note><para>Depending on the <application>tmsu</application> version,
|
|
you may need/want to add options like
|
|
<literal>--database=/some/db</literal>.</para></note>
|
|
|
|
<para>You may want to restrict this processing to a subset of
|
|
the directory tree, because it may slow down indexing a bit
|
|
(<literal>[some/area/of/the/fs]</literal>).</para>
|
|
<para>Note the initial semi-colon after the equal sign.</para>
|
|
|
|
<para>In the example above, the output of <command>tmsu</command> is
|
|
used to set a field named <literal>tags</literal>. The field name is
|
|
arbitrary and could be <literal>tmsu</literal> or
|
|
<literal>myfield</literal> just the same, but <literal>tags</literal>
|
|
is an alias for the standard &RCL; <literal>keywords</literal> field,
|
|
and the <command>tmsu</command> output will just augment its
|
|
contents. This will avoid the need to extend the <link
|
|
linkend="RCL.PROGRAM.FIELDS">field configuration</link>.</para>
|
|
|
|
<para>Once re-indexing is performed (you'll need to force the file
|
|
reindexing, &RCL; will not detect the need by itself), you will
|
|
be able to search from the query language, through any of its
|
|
aliases: <literal>tags:some/alternate/values</literal> or
|
|
<literal>tags:all,these,values</literal> (the compact field search
|
|
syntax is supported for recoll 1.20 and later. For
|
|
older versions, you would need to repeat the <literal>tags:</literal>
|
|
specifier for each term, e.g. <literal>tags:some OR
|
|
tags:alternate</literal>).</para>
|
|
|
|
<para>You should be aware that tags changes will not be detected by
|
|
the indexer if the file itself did not change. One possible
|
|
workaround would be to update the file <literal>ctime</literal> when
|
|
you modify the tags, which
|
|
would be consistent with how extended attributes function. A pair of
|
|
<command>chmod</command> commands could accomplish this, or a
|
|
<literal>touch -a</literal> . Alternatively, just
|
|
couple the tag update with a <literal>recollindex -e -i
|
|
filename.</literal></para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INDEXING.PDF">
|
|
<title>The PDF input handler</title>
|
|
|
|
<para>The PDF format is very important for scientific and technical
|
|
documentation, and document archival. It has extensive
|
|
facilities for storing metadata along with the document, and these
|
|
facilities are actually used in the real world.</para>
|
|
|
|
<para>In consequence, the <filename>rclpdf.py</filename> PDF input
|
|
handler has more complex capabilities than most others, and it is
|
|
also more configurable. Specifically, <filename>rclpdf.py</filename>
|
|
can automatically use <application>tesseract</application> to perform
|
|
OCR if the document text is empty, it can be configured to extract
|
|
specific metadata tags from an XMP packet, and to extract PDF
|
|
attachments.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.PDF.OCR">
|
|
<title>OCR with Tesseract</title>
|
|
|
|
<para>If both <application>tesseract</application> and
|
|
<command>pdftoppm</command> (generally from the
|
|
<application>poppler-utils</application> package) are installed,
|
|
the PDF handler may attempt OCR on PDF files with no text
|
|
content. This is controlled by the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
|
|
configuration variable, which is false by default because
|
|
OCR is very slow.</para>
|
|
|
|
<para>The choice of language is very important for successfull
|
|
OCR. Recoll has currently no way to determine this from the
|
|
document itself. You can set the language to use through the
|
|
contents of a <filename>.ocrpdflang</filename> text file in the
|
|
same directory as the PDF document, or through the
|
|
<envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
|
|
through the contents of an <filename>ocrpdf</filename> text file
|
|
inside the configuration directory. If none of the above are used,
|
|
&RCL; will try to guess the language from the NLS
|
|
environment.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.PDF.XMP">
|
|
<title>XMP fields extraction</title>
|
|
|
|
<para>The <filename>rclpdf.py</filename> script in &RCL; version
|
|
1.23.2 and later can extract XMP metadata fields by executing the
|
|
<command>pdfinfo</command> command (usually found with
|
|
<application>poppler-utils</application>). This is controlled by
|
|
the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</link>
|
|
configuration variable, which specifies which tags to extract and,
|
|
possibly, how to rename them.</para>
|
|
|
|
<para>The <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</link>
|
|
variable can be used to designate a file with Python code to edit
|
|
the metadata fields (available for &RCL; 1.23.3 and later. 1.23.2
|
|
has equivalent code inside the handler script). Example:</para>
|
|
<programlisting>import sys
|
|
import re
|
|
|
|
class MetaFixer(object):
|
|
def __init__(self):
|
|
pass
|
|
|
|
def metafix(self, nm, txt):
|
|
if nm == 'bibtex:pages':
|
|
txt = re.sub(r'--', '-', txt)
|
|
elif nm == 'someothername':
|
|
# do something else
|
|
pass
|
|
elif nm == 'stillanother':
|
|
# etc.
|
|
pass
|
|
|
|
return txt
|
|
</programlisting>
|
|
|
|
|
|
<!-- <para> There is a <ulink url="&WIKI;PDFXMP.wiki">complete example of XMP
|
|
tags setup</ulink>, including a nice result list paragraph format in the
|
|
&RCL; Wiki </para> -->
|
|
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.PDF.ATTACH">
|
|
<title>PDF attachment indexing</title>
|
|
|
|
<para>If <application>pdftk</application> is installed, and if the
|
|
the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
|
|
configuration variable is set, the PDF input handler will try to
|
|
extract PDF attachements for indexing as sub-documents of the PDF
|
|
file. This is disabled by default, because it slows down PDF
|
|
indexing a bit even if not one attachment is ever found (PDF
|
|
attachments are uncommon in my experience).</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.PERIODIC">
|
|
<title>Periodic indexing</title>
|
|
|
|
<sect2 id="RCL.INDEXING.PERIODIC.EXEC">
|
|
<title>Running indexing</title>
|
|
|
|
<para>Indexing is always performed by the
|
|
<command>recollindex</command> program, which can be started
|
|
either from the command line or from the <guimenu>File</guimenu>
|
|
menu in the <command>recoll</command> GUI program. When started
|
|
from the GUI, the indexing will run on the same configuration
|
|
<command>recoll</command> was started on. When started from the
|
|
command line, <command>recollindex</command> will use the
|
|
<envar>RECOLL_CONFDIR</envar> variable or accept a
|
|
<option>-c</option> <replaceable>confdir</replaceable> option
|
|
to specify a non-default configuration directory.</para>
|
|
|
|
<para>If the <command>recoll</command> program finds no index
|
|
when it starts, it will automatically start indexing (except
|
|
if canceled).</para>
|
|
|
|
<para>The <command>recollindex</command> indexing process can be
|
|
interrupted by sending an interrupt (<keysym>Ctrl-C</keysym>,
|
|
SIGINT) or terminate
|
|
(SIGTERM) signal. Some time may elapse before the process exits,
|
|
because it needs to properly flush and close the index. This can
|
|
also be done from the <command>recoll</command> GUI
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Stop Indexing</guimenuitem>
|
|
</menuchoice>
|
|
menu entry.</para>
|
|
|
|
<para>After such an interruption, the index will be somewhat
|
|
inconsistent because some operations which are normally
|
|
performed at the end of the indexing pass will have been
|
|
skipped (for example, the stemming and spelling databases
|
|
will be inexistant or out of date). You just need to restart
|
|
indexing at a later time to restore consistency. The
|
|
indexing will restart at the interruption point (the full
|
|
file tree will be traversed, but files that were indexed up
|
|
to the interruption and for which the index is still up to
|
|
date will not need to be reindexed).</para>
|
|
|
|
<para><command>recollindex</command> has a number of other options
|
|
which are described in its man page. Only a few will be
|
|
described here.</para>
|
|
<para>Option <option>-z</option> will reset the index when
|
|
starting. This is almost the same as destroying the index
|
|
files (the nuance is that the &XAP; format version will not
|
|
be changed).</para>
|
|
<para>Option <option>-Z</option> will force the update of all
|
|
documents without resetting the index first. This will not
|
|
have the "clean start" aspect of <option>-z</option>, but
|
|
the advantage is that the index will remain available for
|
|
querying while it is rebuilt, which can be a significant
|
|
advantage if it is very big (some installations need days
|
|
for a full index rebuild).</para>
|
|
|
|
<para>Option <option>-k</option> will force retrying files
|
|
which previously failed to be indexed, for example because
|
|
of a missing helper program.</para>
|
|
|
|
<para>Of special interest also, maybe, are
|
|
the <option>-i</option> and <option>-f</option>
|
|
options. <option>-i</option> allows indexing an explicit
|
|
list of files (given as command line parameters or read on
|
|
<literal>stdin</literal>). <option>-f</option> tells
|
|
<command>recollindex</command> to ignore file selection
|
|
parameters from the configuration. Together, these options
|
|
allow building a custom file selection process for some area
|
|
of the file system, by adding the top directory to the
|
|
<varname>skippedPaths</varname> list and using an
|
|
appropriate file selection method to build the file list to
|
|
be fed to <command>recollindex</command>
|
|
<option>-if</option>. Trivial example:</para>
|
|
|
|
<programlisting>
|
|
find . -name indexable.txt -print | recollindex -if
|
|
</programlisting>
|
|
|
|
<para><command>recollindex</command> <option>-i</option> will
|
|
not descend into subdirectories specified as parameters,
|
|
but just add them as index entries. It is
|
|
up to the external file selection method to build the complete
|
|
file list.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.PERIODIC.AUTOMAT">
|
|
<title>Using <command>cron</command> to automate
|
|
indexing</title>
|
|
|
|
<para>The most common way to set up indexing is to have a cron
|
|
task execute it every night. For example the following
|
|
<filename>crontab</filename> entry would do it every day at
|
|
3:30AM (supposing <command>recollindex</command> is in your
|
|
PATH):
|
|
|
|
<screen><![CDATA[
|
|
30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
|
|
]]></screen>
|
|
|
|
Or, using <command>anacron</command>:
|
|
<screen><![CDATA[
|
|
1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>As of version 1.17 the &RCL; GUI has dialogs to manage
|
|
<filename>crontab</filename> entries for
|
|
<command>recollindex</command>. You can reach them from the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing Schedule</guimenuitem>
|
|
</menuchoice>
|
|
menu. They only
|
|
work with the good old <command>cron</command>, and do not give
|
|
access to all features of <command>cron</command> scheduling.</para>
|
|
|
|
<para>The usual command to edit your
|
|
<filename>crontab</filename> is <command>crontab</command>
|
|
<option>-e</option> (which will usually start the
|
|
<command>vi</command> editor to edit the file). You may have
|
|
more sophisticated tools available on your system.</para>
|
|
|
|
<para>Please be aware that there may be differences between your
|
|
usual interactive command line environment and the one seen by
|
|
crontab commands. Especially the PATH variable may be of
|
|
concern. Please check the crontab manual pages about possible
|
|
issues.</para>
|
|
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.MONITOR">
|
|
<title>Real time indexing</title>
|
|
|
|
<para>Real time monitoring/indexing is performed by starting the
|
|
<command>recollindex</command> <option>-m</option> command.
|
|
With this option, <command>recollindex</command> will detach
|
|
from the terminal and become a daemon, permanently monitoring
|
|
file changes and updating the index.</para>
|
|
|
|
<para>Under <application>KDE</application>,
|
|
<application>Gnome</application> and some other desktop
|
|
environments, the daemon can automatically started when you log
|
|
in, by creating a desktop file inside the
|
|
<filename>~/.config/autostart</filename> directory. This can be
|
|
done for you by the &RCL; GUI. Use the
|
|
<guimenu>Preferences->Indexing Schedule</guimenu> menu.</para>
|
|
|
|
<para>With older <application>X11</application> setups, starting
|
|
the daemon is normally performed as part of the user session
|
|
script.</para>
|
|
|
|
<para>The <filename>rclmon.sh</filename> script can be used to
|
|
easily start and stop the daemon. It can be found in the
|
|
<filename>examples</filename> directory (typically
|
|
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
|
|
|
|
<para>For example, my out of fashion
|
|
<application>xdm</application>-based session has a
|
|
<filename>.xsession</filename> script with the following lines
|
|
at the end:</para>
|
|
|
|
<programlisting>recollconf=$HOME/.recoll-home
|
|
recolldata=/usr/local/share/recoll
|
|
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
|
|
|
fvwm
|
|
|
|
</programlisting>
|
|
|
|
<para>The indexing daemon gets started, then the window manager,
|
|
for which the session waits.</para> <para>By default the
|
|
indexing daemon will monitor the state of the X11 session, and
|
|
exit when it finishes, it is not necessary to kill it
|
|
explicitly. (The <application>X11</application> server
|
|
monitoring can be disabled with option <option>-x</option> to
|
|
<command>recollindex</command>).</para>
|
|
|
|
<para>If you use the daemon completely out of an
|
|
<application>X11</application> session, you need to add option
|
|
<option>-x</option> to disable <application>X11</application>
|
|
session monitoring (else the daemon will not start).</para>
|
|
|
|
<para>By default, the messages from the indexing daemon will be
|
|
setn to the same file as those from the interactive commands
|
|
(<literal>logfilename</literal>). You may want to change this
|
|
by setting the <varname>daemlogfilename</varname> and
|
|
<varname>daemloglevel</varname> configuration parameters. Also
|
|
the log file will only be truncated when the daemon starts. If
|
|
the daemon runs permanently, the log file may grow quite big,
|
|
depending on the log level.</para>
|
|
|
|
<para>When building &RCL;, the real time indexing support can be
|
|
customised during package <link
|
|
linkend="RCL.INSTALL.BUILDING.BUILD">configuration</link> with
|
|
the <option>--with[out]-fam</option> or
|
|
<option>--with[out]-inotify</option> options. The default is
|
|
currently to include <application>inotify</application>
|
|
monitoring on systems that support it, and, as of &RCL; 1.17,
|
|
<application>gamin</application> support on
|
|
<application>FreeBSD</application>.</para>
|
|
|
|
<para>While it is convenient that data is indexed in real time,
|
|
repeated indexing can generate a significant load on the
|
|
system when files such as email folders change. Also,
|
|
monitoring large file trees by itself significantly taxes
|
|
system resources. You probably do not want to enable it if
|
|
your system is short on resources. Periodic indexing is
|
|
adequate in most cases.</para>
|
|
|
|
<note><title>Increasing resources for inotify</title>
|
|
<para>On Linux systems, monitoring a big tree may need
|
|
increasing the resources available to inotify, which are
|
|
normally defined in <filename>/etc/sysctl.conf</filename>.
|
|
<programlisting>
|
|
### inotify
|
|
#
|
|
# cat /proc/sys/fs/inotify/max_queued_events - 16384
|
|
# cat /proc/sys/fs/inotify/max_user_instances - 128
|
|
# cat /proc/sys/fs/inotify/max_user_watches - 16384
|
|
#
|
|
# -- Change to:
|
|
#
|
|
fs.inotify.max_queued_events=32768
|
|
fs.inotify.max_user_instances=256
|
|
fs.inotify.max_user_watches=32768
|
|
</programlisting>
|
|
|
|
</para>
|
|
<para>Especially, you will need to trim your tree or adjust
|
|
the <literal>max_user_watches</literal> value if indexing exits with
|
|
a message about errno <literal>ENOSPC</literal> (28) from
|
|
<function>inotify_add_watch</function>.</para>
|
|
</note>
|
|
|
|
<sect2 id="RCL.INDEXING.MONITOR.FASTFILES">
|
|
<title>Slowing down the reindexing rate for fast changing
|
|
files</title>
|
|
|
|
<para>When using the real time monitor, it may happen that some
|
|
files need to be indexed, but change so often that they impose an
|
|
excessive load for the system.</para>
|
|
|
|
<para>&RCL; provides a configuration option to specify the minimum
|
|
time before which a file, specified by a wildcard pattern, cannot be
|
|
reindexed. See the <varname>mondelaypatterns</varname> parameter in
|
|
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">
|
|
configuration section</link>.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
<chapter id="RCL.SEARCH">
|
|
<title>Searching</title>
|
|
|
|
<sect1 id="RCL.SEARCH.GUI">
|
|
<title>Searching with the Qt graphical user interface</title>
|
|
|
|
<para>The <command>recoll</command> program provides the main user
|
|
interface for searching. It is based on the
|
|
<application>Qt</application> library.</para>
|
|
|
|
<para><command>recoll</command> has two search modes:</para>
|
|
<itemizedlist>
|
|
<listitem><para>Simple search (the default, on the main screen) has
|
|
a single entry field where you can enter multiple words.</para>
|
|
</listitem>
|
|
<listitem><para>Advanced search (a panel accessed through the
|
|
<guilabel>Tools</guilabel> menu or the toolbox bar icon) has
|
|
multiple entry fields, which you may use to build a logical
|
|
condition, with additional filtering on file type, location
|
|
in the file system, modification date, and size.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>In most cases, you can enter the terms as you
|
|
think them, even if they contain embedded punctuation or other
|
|
non-textual characters. For
|
|
example, &RCL; can handle things like email addresses, or
|
|
arbitrary cut and paste from another text window, punctation
|
|
and all.</para>
|
|
|
|
<para>The main case where you should enter text differently from
|
|
how it is printed is for east-asian languages (Chinese,
|
|
Japanese, Korean). Words composed of single or multiple
|
|
characters should be entered separated by white space in this
|
|
case (they would typically be printed without white
|
|
space).</para>
|
|
|
|
<para>Some searches can be quite complex, and you may want to
|
|
re-use them later, perhaps with some tweaking. &RCL; versions
|
|
1.21 and later can save and restore searches, using XML files. See
|
|
<link linkend="RCL.SEARCH.SAVING">Saving and restoring
|
|
queries</link>.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SIMPLE">
|
|
<title>Simple search</title>
|
|
|
|
<procedure>
|
|
<step><para>Start the <command>recoll</command> program.</para>
|
|
</step>
|
|
<step><para>Possibly choose a search mode: <guilabel>Any
|
|
term</guilabel>, <guilabel>All terms</guilabel>,
|
|
<guilabel>File name</guilabel> or
|
|
<guilabel>Query language</guilabel>.</para>
|
|
</step>
|
|
<step><para>Enter search term(s) in the text field at the top of the
|
|
window.</para>
|
|
</step>
|
|
<step><para>Click the <guilabel>Search</guilabel> button or
|
|
hit the <keycap>Enter</keycap> key to start the search.</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>The initial default search mode is <guilabel>Query
|
|
language</guilabel>. Without special directives, this will look for
|
|
documents containing all of the search terms (the ones with more
|
|
terms will get better scores), just like the <guilabel>All
|
|
terms</guilabel> mode which will ignore such
|
|
directives. <guilabel>Any term</guilabel> will search for documents
|
|
where at least one of the terms appear. </para>
|
|
|
|
<para>The <guilabel>Query Language</guilabel> features are
|
|
described in <link linkend="RCL.SEARCH.LANG">a separate
|
|
section</link>.</para>
|
|
|
|
<para>All search modes allow wildcards inside terms
|
|
(<literal>*</literal>, <literal>?</literal>,
|
|
<literal>[]</literal>). You may want to have a look at the
|
|
<link linkend="RCL.SEARCH.WILDCARDS">section about wildcards</link>
|
|
for more information about this.</para>
|
|
|
|
<para><guilabel>File name</guilabel> will specifically look for file
|
|
names. The point of having a separate file name
|
|
search is that wild card expansion can be performed more
|
|
efficiently on a small subset of the index (allowing
|
|
wild cards on the left of terms without excessive penality).
|
|
Things to know:
|
|
<itemizedlist>
|
|
<listitem><para>White space in the entry should match white
|
|
space in the file name, and is not treated specially.</para>
|
|
</listitem>
|
|
<listitem><para>The search is insensitive to character case and
|
|
accents, independantly of the type of index.</para>
|
|
</listitem>
|
|
<listitem><para>An entry without any wild card
|
|
character and not capitalized will be prepended and appended
|
|
with '*' (ie: <replaceable>etc</replaceable> ->
|
|
<replaceable>*etc*</replaceable>, but
|
|
<replaceable>Etc</replaceable> ->
|
|
<replaceable>etc</replaceable>).</para>
|
|
</listitem>
|
|
<listitem><para>If you have a big index (many files),
|
|
excessively generic fragments may result in inefficient
|
|
searches.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>You can search for exact phrases (adjacent words in a
|
|
given order) by enclosing the input inside double quotes. Ex:
|
|
<literal>"virtual reality"</literal>.</para>
|
|
|
|
<para>When using a stripped index, character case has no influence on
|
|
search, except that you can disable stem expansion for any term by
|
|
capitalizing it. Ie: a search for <literal>floor</literal> will also
|
|
normally look for <literal>flooring</literal>,
|
|
<literal>floored</literal>, etc., but a search for
|
|
<literal>Floor</literal> will only look for <literal>floor</literal>,
|
|
in any character case. Stemming can also be disabled globally in the
|
|
preferences. When using a raw index, <link
|
|
linkend="RCL.SEARCH.CASEDIAC">the rules are a bit more
|
|
complicated</link>.</para>
|
|
|
|
<para>&RCL; remembers the last few searches that you
|
|
performed. You can use the simple search text entry widget (a
|
|
combobox) to recall them (click on the thing at the right of the
|
|
text field). Please note, however, that only the search texts
|
|
are remembered, not the mode (all/any/file name).</para>
|
|
|
|
<para>Typing <keycap>Esc</keycap> <keycap>Space</keycap> while
|
|
entering a word in the simple search entry will open a window
|
|
with possible completions for the word. The completions are
|
|
extracted from the database.</para>
|
|
|
|
<para>Double-clicking on a word in the result list or a preview
|
|
window will insert it into the simple search entry field.</para>
|
|
|
|
<para>You can cut and paste any text into an <guilabel>All
|
|
terms</guilabel> or <guilabel>Any term</guilabel> search field,
|
|
punctuation, newlines and all - except for wildcard characters
|
|
(single <literal>?</literal> characters are ok). &RCL; will process
|
|
it and produce a meaningful search. This is what most differentiates
|
|
this mode from the <guilabel>Query Language</guilabel> mode, where
|
|
you have to care about the syntax.</para>
|
|
|
|
<para>You can use the <link linkend="RCL.SEARCH.GUI.COMPLEX">
|
|
<menuchoice>
|
|
<guimenu>Tools</guimenu>
|
|
<guimenuitem>Advanced search</guimenuitem>
|
|
</menuchoice>
|
|
</link> dialog for more complex searches.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RESLIST">
|
|
<title>The default result list</title>
|
|
|
|
<para>After starting a search, a list of results will instantly
|
|
be displayed in the main list window.</para>
|
|
|
|
<para>By default, the document list is presented in order of
|
|
relevance (how well the system estimates that the document
|
|
matches the query). You can sort the result by ascending or
|
|
descending date by using the vertical arrows in the toolbar.</para>
|
|
|
|
<para>Clicking on the
|
|
<literal>Preview</literal> link for an entry will open an
|
|
internal preview window for the document. Further
|
|
<literal>Preview</literal> clicks for the same search will open
|
|
tabs in the existing preview window. You can use
|
|
<keycap>Shift</keycap>+Click to force the creation of another
|
|
preview window, which may be useful to view the documents side
|
|
by side. (You can also browse successive results in a single
|
|
preview window by typing
|
|
<keycap>Shift</keycap>+<keycap>ArrowUp/Down</keycap> in the
|
|
window).</para>
|
|
|
|
<para>Clicking the <literal>Open</literal> link will
|
|
start an external viewer for the document. By default, &RCL; lets
|
|
the desktop choose the appropriate application for most document
|
|
types (there is a short list of exceptions, see further). If you
|
|
prefer to completely customize the choice of applications, you can
|
|
uncheck the <guilabel>Use desktop preferences</guilabel> option in
|
|
the GUI preferences dialog, and click the <guilabel>Choose editor
|
|
applications</guilabel> button to adjust the predefined &RCL;
|
|
choices. The tool accepts multiple selections of MIME types (e.g. to
|
|
set up the editor for the dozens of office file types).</para>
|
|
|
|
<para>Even when <guilabel>Use desktop preferences</guilabel> is
|
|
checked, there is a small list of exceptions, for MIME types where
|
|
the &RCL; choice should override the desktop one. These are
|
|
applications which are well integrated with &RCL;, especially
|
|
<application>evince</application> for viewing PDF and Postscript
|
|
files because of its support for opening the document at a specific
|
|
page and passing a search string as an argument. Of course, you can
|
|
edit the list (in the GUI preferences) if you would prefer to lose
|
|
the functionality and use the standard desktop tool.</para>
|
|
|
|
<para>You may also change the choice of applications by editing the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename></link> configuration file if you find
|
|
this more convenient.</para>
|
|
|
|
<para>Each result entry also has a right-click menu with an
|
|
<guilabel>Open With</guilabel> entry. This lets you choose an
|
|
application from the list of those which registered with the desktop
|
|
for the document MIME type.</para>
|
|
|
|
<para>The <literal>Preview</literal> and <literal>Open</literal>
|
|
edit links may not be present for all entries, meaning that
|
|
&RCL; has no configured way to preview a given file type (which
|
|
was indexed by name only), or no configured external editor for
|
|
the file type. This can sometimes be adjusted simply by tweaking
|
|
the <link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<filename>mimemap</filename></link> and
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename></link> configuration files (the latter
|
|
can be modified with the user preferences dialog).</para>
|
|
|
|
<para>The format of the result list entries is entirely
|
|
configurable by using the preference dialog to
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">edit an HTML
|
|
fragment</link>.</para>
|
|
|
|
<para>You can click on the <literal>Query details</literal> link
|
|
at the top of the results page to see the query actually
|
|
performed, after stem expansion and other processing.</para>
|
|
|
|
<para>Double-clicking on any word inside the result list or a
|
|
preview window will insert it into the simple search text.</para>
|
|
|
|
<para>The result list is divided into pages (the size of which
|
|
you can change in the preferences). Use the arrow buttons in the
|
|
toolbar or the links at the bottom of the page to browse the
|
|
results.</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.RESLIST.SUGGS">
|
|
<title>No results: the spelling suggestions</title>
|
|
|
|
<para>When a search yields no result, and if the
|
|
<application>aspell</application> dictionary is configured, &RCL;
|
|
will try to check for misspellings among the query terms, and
|
|
will propose lists of replacements. Clicking on one of the
|
|
suggestions will replace the word and restart the search. You can
|
|
hold any of the modifier keys (Ctrl, Shift, etc.) while clicking
|
|
if you would rather stay on the suggestion screen because several
|
|
terms need replacement.</para>
|
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.RESULTLIST.MENU">
|
|
<title>The result list right-click menu</title>
|
|
|
|
<para>Apart from the preview and edit links, you can display a
|
|
pop-up menu by right-clicking over a paragraph in the result
|
|
list. This menu has the following entries:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><guilabel>Preview</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open With</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Run Script</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy File Name</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy Url</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Save to File</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Find similar</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Preview Parent
|
|
document</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open Parent
|
|
document</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open Snippets
|
|
Window</guilabel></para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <guilabel>Preview</guilabel> and
|
|
<guilabel>Open</guilabel> entries do the same thing as the
|
|
corresponding links.</para>
|
|
|
|
<para><guilabel>Open With</guilabel> lets you open the document
|
|
with one of the applications claiming to be able to handle its MIME
|
|
type (the information comes from the <literal>.desktop</literal>
|
|
files in
|
|
<filename>/usr/share/applications</filename>).</para>
|
|
|
|
<para><guilabel>Run Script</guilabel> allows starting an arbitrary
|
|
command on the result file. It will only appear for results which
|
|
are top-level files. See <link
|
|
linkend="RCL.SEARCH.GUI.RUNSCRIPT">further</link> for a more
|
|
detailed description.</para>
|
|
|
|
<para>The <guilabel>Copy File Name</guilabel> and
|
|
<guilabel>Copy Url</guilabel> copy the relevant data to the
|
|
clipboard, for later pasting.</para>
|
|
|
|
<para><guilabel>Save to File</guilabel> allows saving the
|
|
contents of a result document to a chosen file. This entry
|
|
will only appear if the document does not correspond to an
|
|
existing file, but is a subdocument inside such a file (ie: an
|
|
email attachment). It is especially useful to extract attachments
|
|
with no associated editor.</para>
|
|
|
|
<para>The <guilabel>Open/Preview Parent document</guilabel> entries
|
|
allow working with the higher level document (e.g. the email
|
|
message an attachment comes from). &RCL; is sometimes not totally
|
|
accurate as to what it can or can't do in this area. For example
|
|
the <guilabel>Parent</guilabel> entry will also appear for an
|
|
email which is part of an mbox folder file, but you can't actually
|
|
visualize the mbox (there will be an error dialog if you
|
|
try).</para>
|
|
|
|
<para>If the document is a top-level file, <guilabel>Open
|
|
Parent</guilabel> will start the default file manager on the
|
|
enclosing filesystem directory.</para>
|
|
|
|
<para>The <guilabel>Find similar</guilabel> entry will select
|
|
a number of relevant term from the current document and enter
|
|
them into the simple search field. You can then start a simple
|
|
search, with a good chance of finding documents related to the
|
|
current result. I can't remember a single instance where this
|
|
function was actually useful to me...</para>
|
|
|
|
<para id="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">The <guilabel>Open Snippets Window</guilabel> entry will only
|
|
appear for documents which support page breaks (typically
|
|
PDF, Postscript, DVI). The snippets window lists extracts from
|
|
the document, taken around search terms occurrences, along with the
|
|
corresponding page number, as links which can be used to start
|
|
the native viewer on the appropriate page. If the viewer supports
|
|
it, its search function will also be primed with one of the
|
|
search terms.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RESTABLE">
|
|
<title>The result table</title>
|
|
|
|
<para>In &RCL; 1.15 and newer, the results can be displayed in
|
|
spreadsheet-like fashion. You can switch to this presentation by
|
|
clicking the table-like icon in the toolbar (this is a toggle,
|
|
click again to restore the list).</para>
|
|
|
|
<para>Clicking on the column headers will allow sorting by the
|
|
values in the column. You can click again to invert the order, and
|
|
use the header right-click menu to reset sorting to the default
|
|
relevance order (you can also use the sort-by-date arrows to do
|
|
this).</para>
|
|
|
|
<para>Both the list and the table display the same underlying
|
|
results. The sort order set from the table is still active if you
|
|
switch back to the list mode. You can click twice on a date sort
|
|
arrow to reset it from there.</para>
|
|
|
|
<para>The header right-click menu allows adding or deleting
|
|
columns. The columns can be resized, and their order can be changed
|
|
(by dragging). All the changes are recorded when you quit
|
|
<command>recoll</command></para>
|
|
|
|
<para>Hovering over a table row will update the detail area at the
|
|
bottom of the window with the corresponding values. You can click
|
|
the row to freeze the display. The bottom area is equivalent to a
|
|
result list paragraph, with links for starting a preview or a
|
|
native application, and an equivalent right-click menu. Typing
|
|
<keycap>Esc</keycap> (the Escape key) will unfreeze the
|
|
display.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RUNSCRIPT">
|
|
<title>Running arbitrary commands on result files (1.20 and later)</title>
|
|
|
|
<para>Apart from the <guilabel>Open</guilabel> and <guilabel>Open
|
|
With</guilabel> operations, which allow starting an application on a
|
|
result document (or a temporary copy), based on its MIME type, it is
|
|
also possible to run arbitrary commands on results which are
|
|
top-level files, using the <guilabel>Run Script</guilabel> entry in
|
|
the results pop-up menu.</para>
|
|
|
|
<para>The commands which will appear in the <guilabel>Run
|
|
Script</guilabel> submenu must be defined by
|
|
<literal>.desktop</literal> files inside the
|
|
<filename>scripts</filename> subdirectory of the current
|
|
configuration directory.</para>
|
|
|
|
<para>Here follows an example of a <literal>.desktop</literal> file,
|
|
which could be named for example,
|
|
<filename>~/.recoll/scripts/myscript.desktop</filename> (the exact
|
|
file name inside the directory is irrelevant):
|
|
<programlisting>
|
|
[Desktop Entry]
|
|
Type=Application
|
|
Name=MyFirstScript
|
|
Exec=/home/me/bin/tryscript %F
|
|
MimeType=*/*
|
|
</programlisting>
|
|
The <literal>Name</literal> attribute defines the label which will
|
|
appear inside the <guilabel>Run Script</guilabel> menu. The
|
|
<literal>Exec</literal> attribute defines the program to be run,
|
|
which does not need to actually be a script, of course. The
|
|
<literal>MimeType</literal> attribute is not used, but needs to exist.
|
|
</para>
|
|
|
|
<para>The commands defined this way can also be used from links
|
|
inside the <link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA">
|
|
result paragraph</link>.</para>
|
|
|
|
<para>As an example, it might make sense to write a script which
|
|
would move the document to the trash and purge it from the &RCL;
|
|
index.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.THUMBNAILS">
|
|
<title>Displaying thumbnails</title>
|
|
|
|
<para>The default format for the result list entries and the
|
|
detail area of the result table display an icon for each result
|
|
document. The icon is either a generic one determined from the
|
|
MIME type, or a thumbnail of the document appearance. Thumbnails
|
|
are only displayed if found in the standard
|
|
<application>freedesktop</application> location, where they would
|
|
typically have been created by a file manager.</para>
|
|
|
|
<para>Recoll has no capability to create thumbnails. A relatively
|
|
simple trick is to use the <guilabel>Open parent
|
|
document/folder</guilabel> entry in the result list popup
|
|
menu. This should open a file manager window on the containing
|
|
directory, which should in turn create the thumbnails (depending on
|
|
your settings). Restarting the search should then display the
|
|
thumbnails.</para>
|
|
|
|
<para>There are also <ulink url="&WIKI;ResultsThumbnails.wiki">some
|
|
pointers about thumbnail generation</ulink> on the &RCL; wiki.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.PREVIEW">
|
|
<title>The preview window</title>
|
|
|
|
<para>The preview window opens when you first click a
|
|
<literal>Preview</literal> link inside the result list.</para>
|
|
|
|
<para>Subsequent preview requests for a given search open new
|
|
tabs in the existing window (except if you hold the
|
|
<keycap>Shift</keycap> key while clicking which will open a new
|
|
window for side by side viewing).</para>
|
|
|
|
<para>Starting another search and requesting a preview will
|
|
create a new preview window. The old one stays open until you
|
|
close it.</para>
|
|
|
|
<para>You can close a preview tab by typing <keycap>Ctrl-W</keycap>
|
|
(<keycap>Ctrl</keycap> + <keycap>W</keycap>) in the
|
|
window. Closing the last tab for a window will also close the
|
|
window.</para>
|
|
|
|
<para>Of course you can also close a preview window by using the
|
|
window manager button in the top of the frame.</para>
|
|
|
|
<para>You can display successive or previous documents from the
|
|
result list inside a preview tab by typing
|
|
<keycap>Shift</keycap>+<keycap>Down</keycap> or
|
|
<keycap>Shift</keycap>+<keycap>Up</keycap> (<keycap>Down</keycap>
|
|
and <keycap>Up</keycap> are the arrow keys).</para>
|
|
|
|
<para>A right-click menu in the text area allows switching
|
|
between displaying the main text or the contents of fields
|
|
associated to the document (ie: author, abtract, etc.). This is
|
|
especially useful in cases where the term match did not occur in
|
|
the main text but in one of the fields. In the case of
|
|
images, you can switch between three displays: the image
|
|
itself, the image metadata as extracted
|
|
by <command>exiftool</command> and the fields, which is the
|
|
metadata stored in the index.</para>
|
|
|
|
|
|
<para>You can print the current preview window contents by typing
|
|
<keycap>Ctrl-P</keycap> (<keycap>Ctrl</keycap> +
|
|
<keycap>P</keycap>) in the window text.</para>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.PREVIEW.SEARCH">
|
|
<title>Searching inside the preview</title>
|
|
|
|
<para>The preview window has an internal search capability,
|
|
mostly controlled by the panel at the bottom of the window,
|
|
which works in two modes: as a classical editor incremental
|
|
search, where we look for the text entered in the entry
|
|
zone, or as a way to walk the matches between the document
|
|
and the &RCL; query that found it.</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>Incremental text search</term>
|
|
<listitem><para>The preview tabs have an internal incremental search
|
|
function. You initiate the search either by typing a
|
|
<keycap>/</keycap> (slash) or <keycap>CTL-F</keycap>
|
|
inside the text area or by clicking into
|
|
the <guilabel>Search for:</guilabel> text field and
|
|
entering the search string. You can then use the
|
|
<guilabel>Next</guilabel>
|
|
and <guilabel>Previous</guilabel> buttons
|
|
to find the next/previous occurrence. You can also type
|
|
<keycap>F3</keycap> inside the text area to get to the next
|
|
occurrence.</para>
|
|
<para>If you have a search string entered and you use
|
|
Ctrl-Up/Ctrl-Down to browse the results, the search is
|
|
initiated for each successive document. If the string is
|
|
found, the cursor will be positioned at the first
|
|
occurrence of the search string.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Walking the match lists</term>
|
|
<listitem><para>If the entry area is empty when you click
|
|
the <guilabel>Next</guilabel>
|
|
or <guilabel>Previous</guilabel> buttons, the editor will
|
|
be scrolled to show the next match to any search term
|
|
(the next highlighted zone). If you select a search group
|
|
from the dropdown list and click <guilabel>Next</guilabel>
|
|
or <guilabel>Previous</guilabel>, the match list for this
|
|
group will be walked. This is not the same as a text
|
|
search, because the occurences will include non-exact
|
|
matches (as caused by stemming or wildcards). The search
|
|
will revert to the text mode as soon as you edit the
|
|
entry area.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.FRAGBUTS">
|
|
<title>The Query Fragments window</title>
|
|
|
|
<para>Selecting the <menuchoice><guimenu>Tools</guimenu>
|
|
<guimenuitem>Query Fragments</guimenuitem></menuchoice> menu
|
|
entry will open a window with radio- and check-buttons which
|
|
can be used to activate query language fragments for
|
|
filtering the current query. This can be useful if you have
|
|
frequent reusable selectors, for example, filtering on
|
|
alternate directories, or searching just one category of
|
|
files, not covered by the standard category
|
|
selectors.</para>
|
|
|
|
<para>The contents of the window are entirely customizable, and
|
|
defined by the contents of the <filename>fragbuts.xml</filename>
|
|
file inside the configuration directory. The sample file
|
|
distributed with &RCL; (which you should be able to find under
|
|
<filename>/usr/share/recoll/examples/fragbuts.xml</filename>),
|
|
contains an example which filters the results from the WEB
|
|
history.</para>
|
|
|
|
<para>Here follows an example:
|
|
<programlisting>
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<fragbuts version="1.0">
|
|
|
|
<radiobuttons>
|
|
|
|
<fragbut>
|
|
<label>Include Web Results</label>
|
|
<frag></frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Exclude Web Results</label>
|
|
<frag>-rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Only Web Results</label>
|
|
<frag>rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
</radiobuttons>
|
|
|
|
<buttons>
|
|
|
|
<fragbut>
|
|
<label>Year 2010</label>
|
|
<frag>date:2010-01-01/2010-12-31</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>My Great Directory Only</label>
|
|
<frag>dir:/my/great/directory</frag>
|
|
</fragbut>
|
|
|
|
</buttons>
|
|
</fragbuts>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>Each <literal>radiobuttons</literal> or
|
|
<literal>buttons</literal> section defines a line of
|
|
checkbuttons or radiobuttons inside the window. Any number of
|
|
buttons can be selected, but the radiobuttons in a line are
|
|
exclusive.</para>
|
|
|
|
<para>Each <literal>fragbut</literal> section defines the label
|
|
for a button, and the Query Language fragment which will be
|
|
added (as an AND filter) before performing the query if the
|
|
button is active.</para>
|
|
|
|
<para>This feature is new in &RCL; 1.20, and will probably be
|
|
refined depending on user feedback.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.COMPLEX">
|
|
<title>Complex/advanced search</title>
|
|
|
|
<para>The advanced search dialog helps you build more complex queries
|
|
without memorizing the search language constructs. It can be opened
|
|
through the <guilabel>Tools</guilabel> menu or through the main
|
|
toolbar.</para>
|
|
|
|
<para>&RCL; keeps a history of searches. See
|
|
<link linkend="RCL.SEARCH.GUI.COMPLEX.HISTORY">
|
|
Advanced search history</link>.</para>
|
|
|
|
<para>The dialog has two tabs:</para>
|
|
|
|
<orderedlist>
|
|
|
|
<listitem><para>The first tab lets you specify terms to search
|
|
for, and permits specifying multiple clauses which are combined
|
|
to build the search.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>The second tab lets filter the results according
|
|
to file size, date of modification, MIME type, or
|
|
location.</para>
|
|
</listitem>
|
|
|
|
</orderedlist>
|
|
|
|
<para>Click on the <guilabel>Start Search</guilabel> button in
|
|
the advanced search dialog, or type <keycap>Enter</keycap> in
|
|
any text field to start the search. The button in
|
|
the main window always performs a simple search.</para>
|
|
|
|
<para>Click on the <literal>Show query details</literal> link at
|
|
the top of the result page to see the query expansion.</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.TERMS">
|
|
<title>Avanced search: the "find" tab</title>
|
|
|
|
<para>This part of the dialog lets you constructc a query by
|
|
combining multiple clauses of different types. Each entry
|
|
field is configurable for the following modes:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>All terms.</para>
|
|
</listitem>
|
|
<listitem><para>Any term.</para>
|
|
</listitem>
|
|
<listitem><para>None of the terms.</para>
|
|
</listitem>
|
|
<listitem><para>Phrase (exact terms in order within an
|
|
adjustable window).</para>
|
|
</listitem>
|
|
<listitem><para>Proximity (terms in any order within an
|
|
adjustable window).</para>
|
|
</listitem>
|
|
<listitem><para>Filename search.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Additional entry fields can be created by clicking the
|
|
<guilabel>Add clause</guilabel> button.</para>
|
|
|
|
<para>When searching, the non-empty clauses will be
|
|
combined either with an AND or an OR conjunction, depending on
|
|
the choice made on the left (<guilabel>All clauses</guilabel> or
|
|
<guilabel>Any clause</guilabel>).</para>
|
|
|
|
<para>Entries of all types except "Phrase" and "Near" accept
|
|
a mix of single words and phrases enclosed in double quotes.
|
|
Stemming and wildcard expansion will be performed as for simple
|
|
search. </para>
|
|
|
|
<formalpara><title>Phrases and Proximity searches</title>
|
|
<para>These two clauses work in similar ways, with the
|
|
difference that proximity searches do not impose an order on the
|
|
words. In both cases, an adjustable number (slack) of non-matched words
|
|
may be accepted between the searched ones (use the counter on
|
|
the left to adjust this count). For phrases, the default count
|
|
is zero (exact match). For proximity it is ten (meaning that two search
|
|
terms, would be matched if found within a window of twelve
|
|
words). Examples: a phrase search for <literal>quick
|
|
fox</literal> with a slack of 0 will match <literal>quick
|
|
fox</literal> but not <literal>quick brown fox</literal>. With
|
|
a slack of 1 it will match the latter, but not <literal>fox
|
|
quick</literal>. A proximity search for <literal>quick
|
|
fox</literal> with the default slack will match the
|
|
latter, and also <literal>a fox is a cunning and quick
|
|
animal</literal>.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.FILTER">
|
|
<title>Avanced search: the "filter" tab</title>
|
|
|
|
<para>This part of the dialog has several sections which allow
|
|
filtering the results of a search according to a number of
|
|
criteria</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
<para>The first section allows filtering by dates of last
|
|
modification. You can specify both a minimum and a maximum date. The
|
|
initial values are set according to the oldest and newest documents
|
|
found in the index.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The next section allows filtering the results by
|
|
file size. There are two entries for minimum and maximum
|
|
size. Enter decimal numbers. You can use suffix multipliers:
|
|
<literal>k/K</literal>, <literal>m/M</literal>,
|
|
<literal>g/G</literal>, <literal>t/T</literal> for 1E3, 1E6,
|
|
1E9, 1E12 respectively.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The next section allows filtering the results by their MIME
|
|
types, or MIME categories (ie: media/text/message/etc.).</para>
|
|
<para>You can transfer the types between two boxes, to define
|
|
which will be included or excluded by the search.</para>
|
|
<para>The state of the file type selection can be saved as
|
|
the default (the file type filter will not be activated at
|
|
program start-up, but the lists will be in the restored
|
|
state).</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The bottom section allows restricting the search results to a
|
|
sub-tree of the indexed area. You can use the
|
|
<guilabel>Invert</guilabel> checkbox to search for files not in
|
|
the sub-tree instead. If you use directory filtering often and on
|
|
big subsets of the file system, you may think of setting up
|
|
multiple indexes instead, as the performance may be
|
|
better.</para>
|
|
<para>You can use relative/partial paths for filtering. Ie,
|
|
entering <literal>dirA/dirB</literal> would match either
|
|
<filename>/dir1/dirA/dirB/myfile1</filename> or
|
|
<filename>/dir2/dirA/dirB/someother/myfile2</filename>.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.HISTORY">
|
|
<title>Avanced search history</title>
|
|
|
|
<para>The advanced search tool memorizes the last 100 searches
|
|
performed. You can walk the saved searches by using the up and
|
|
down arrow keys while the keyboard focus belongs to the advanced
|
|
search dialog.</para>
|
|
|
|
<para>The complex search history can be erased, along with the
|
|
one for simple search, by selecting the <menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Erase Search History</guimenuitem>
|
|
</menuchoice> menu entry.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.TERMEXPLORER">
|
|
<title>The term explorer tool</title>
|
|
|
|
<para>&RCL; automatically manages the expansion of search terms
|
|
to their derivatives (ie: plural/singular, verb
|
|
inflections). But there are other cases where the exact search
|
|
term is not known. For example, you may not remember the exact
|
|
spelling, or only know the beginning of the name.</para>
|
|
|
|
<para>The search will only propose replacement terms with
|
|
spelling variations when no matching document were found. In some
|
|
cases, both proper spellings and mispellings are present in the
|
|
index, and it may be interesting to look for them explicitely.</para>
|
|
|
|
<para>The term explorer tool (started from the toolbar icon or
|
|
from the <guilabel>Term explorer</guilabel> entry of the
|
|
<guilabel>Tools</guilabel> menu) can be used to search the full index
|
|
terms list. It has three modes of operations:</para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Wildcard</term>
|
|
<listitem><para>In this mode of operation, you can enter a
|
|
search string with shell-like wildcards (*, ?, []). ie:
|
|
<replaceable>xapi*</replaceable> would display all index terms
|
|
beginning with <replaceable>xapi</replaceable>. (More
|
|
about wildcards <link
|
|
linkend="RCL.SEARCH.WILDCARDS">here</link>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Regular expression</term>
|
|
<listitem><para>This mode will accept a regular expression
|
|
as input. Example:
|
|
<replaceable>word[0-9]+</replaceable>. The expression is
|
|
implicitely anchored at the beginning. Ie:
|
|
<replaceable>press</replaceable> will match
|
|
<replaceable>pression</replaceable> but not
|
|
<replaceable>expression</replaceable>. You can use
|
|
<replaceable>.*press</replaceable> to match the latter,
|
|
but be aware that this will cause a full index term list
|
|
scan, which can be quite long.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
|
|
<term>Stem expansion</term>
|
|
<listitem><para>This mode will perform the usual stem expansion
|
|
normally done as part user input processing. As such it is
|
|
probably mostly useful to demonstrate the process.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Spelling/Phonetic</term> <listitem><para>In this
|
|
mode, you enter the term as you think it is spelled, and
|
|
&RCL; will do its best to find index terms that sound like
|
|
your entry. This mode uses the
|
|
<application>Aspell</application> spelling application,
|
|
which must be installed on your system for things to work
|
|
(if your documents contain non-ascii characters, &RCL;
|
|
needs an aspell version newer than 0.60 for UTF-8
|
|
support). The language which is used to build the
|
|
dictionary out of the index terms (which is done at the
|
|
end of an indexing pass) is the one defined by your NLS
|
|
environment. Weird things will probably happen if
|
|
languages are mixed up.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<para>Note that in cases where &RCL; does not know the beginning
|
|
of the string to search for (ie a wildcard expression like
|
|
<replaceable>*coll</replaceable>), the expansion can take quite
|
|
a long time because the full index term list will have to be
|
|
processed. The expansion is currently limited at 10000 results for
|
|
wildcards and regular expressions. It is possible to change the
|
|
limit in the configuration file.</para>
|
|
|
|
<para>Double-clicking on a term in the result list will insert
|
|
it into the simple search entry field. You can also cut/paste
|
|
between the result list and any entry field (the end of lines
|
|
will be taken care of).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.MULTIDB">
|
|
<title>Multiple indexes</title>
|
|
|
|
<para>See the <link linkend="RCL.INDEXING.CONFIG.MULTIPLE">section
|
|
describing the use of multiple indexes</link> for
|
|
generalities. Only the aspects concerning
|
|
the <command>recoll</command> GUI are described here.</para>
|
|
|
|
<para>A <command>recoll</command> program instance is always
|
|
associated with a specific index, which is the one to be updated
|
|
when requested from the <guimenu>File</guimenu> menu, but it can
|
|
use any number of &RCL; indexes for searching. The external
|
|
indexes can be selected through the <guilabel>external
|
|
indexes</guilabel> tab in the preferences dialog.</para>
|
|
|
|
<para>Index selection is performed in two phases. A set of all
|
|
usable indexes must first be defined, and then the subset of
|
|
indexes to be used for searching. These parameters
|
|
are retained across program executions (there are kept
|
|
separately for each &RCL; configuration). The set of all indexes
|
|
is usually quite stable, while the active ones might typically
|
|
be adjusted quite frequently.</para>
|
|
|
|
<para>The main index (defined by
|
|
<envar>RECOLL_CONFDIR</envar>) is always active. If this is
|
|
undesirable, you can set up your base configuration to index
|
|
an empty directory.</para>
|
|
|
|
<para>When adding a new index to the set, you can select either
|
|
a &RCL; configuration directory, or directly a &XAP; index
|
|
directory. In the first case, the &XAP; index directory will
|
|
be obtained from the selected configuration.</para>
|
|
|
|
<para>As building the set of all indexes can be a little tedious
|
|
when done through the user interface, you can use the
|
|
<envar>RECOLL_EXTRA_DBS</envar> environment
|
|
variable to provide an initial set. This might typically be
|
|
set up by a system administrator so that every user does not
|
|
have to do it. The variable should define a colon-separated list
|
|
of index directories, ie:
|
|
</para>
|
|
<screen>export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</screen>
|
|
|
|
<para>Another environment variable,
|
|
<envar>RECOLL_ACTIVE_EXTRA_DBS</envar> allows adding to the active
|
|
list of indexes. This variable was suggested and implemented by a
|
|
&RCL; user. It is mostly useful if you use scripts to mount
|
|
external volumes with &RCL; indexes. By using
|
|
<envar>RECOLL_EXTRA_DBS</envar> and
|
|
<envar>RECOLL_ACTIVE_EXTRA_DBS</envar>, you can add and activate
|
|
the index for the mounted volume when starting
|
|
<command>recoll</command>.
|
|
</para>
|
|
|
|
<para><envar>RECOLL_ACTIVE_EXTRA_DBS</envar> is available for
|
|
&RCL; versions 1.17.2 and later. A change was made in the same
|
|
update so that <command>recoll</command> will
|
|
automatically deactivate unreachable indexes when starting
|
|
up.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.HISTORY">
|
|
<title>Document history</title>
|
|
|
|
<para>Documents that you actually view (with the internal preview
|
|
or an external tool) are entered into the document history,
|
|
which is remembered.</para>
|
|
<para>You can display the history list by using
|
|
the <guilabel>Tools/</guilabel><guilabel>Doc History</guilabel> menu
|
|
entry.</para>
|
|
<para>You can erase the document history by using the
|
|
<guilabel>Erase document history</guilabel> entry in the
|
|
<guimenu>File</guimenu> menu.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SORT">
|
|
<title>Sorting search results and collapsing duplicates</title>
|
|
|
|
<para>The documents in a result list are normally sorted in
|
|
order of relevance. It is possible to specify a different sort
|
|
order, either by using the vertical arrows in the GUI toolbox to
|
|
sort by date, or switching to the result table display and clicking
|
|
on any header. The sort order chosen inside the result table
|
|
remains active if you switch back to the result list, until you
|
|
click one of the vertical arrows, until both are unchecked (you are
|
|
back to sort by relevance).</para>
|
|
|
|
<para>Sort parameters are remembered between program
|
|
invocations, but result sorting is normally always inactive
|
|
when the program starts. It is possible to keep the sorting
|
|
activation state between program invocations by checking the
|
|
<guilabel>Remember sort activation state</guilabel> option in
|
|
the preferences.</para>
|
|
|
|
<para>It is also possible to hide duplicate entries inside
|
|
the result list (documents with the exact same contents as the
|
|
displayed one). The test of identity is based on an MD5 hash
|
|
of the document container, not only of the text contents (so
|
|
that ie, a text document with an image added will not be a
|
|
duplicate of the text only). Duplicates hiding is controlled
|
|
by an entry in the <guilabel>GUI configuration</guilabel>
|
|
dialog, and is off by default.</para>
|
|
|
|
<para>As of release 1.19, when a result document does have
|
|
undisplayed duplicates, a <literal>Dups</literal>
|
|
link will be shown with the result list entry. Clicking the
|
|
link will display the paths (URLs + ipaths) for the duplicate
|
|
entries.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.TIPS">
|
|
<title>Search tips, shortcuts</title>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.TERMS">
|
|
<title>Terms and search expansion</title>
|
|
|
|
<formalpara><title>Term completion</title>
|
|
<para>Typing <keycap>Esc</keycap> <keycap>Space</keycap> in
|
|
the simple search entry field while entering a word will
|
|
either complete the current word if its beginning matches a
|
|
unique term in the index, or open a window to propose a list
|
|
of completions.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Picking up new terms from result or preview
|
|
text</title>
|
|
<para>Double-clicking on a word in the result list or in a
|
|
preview window will copy it to the simple search entry field.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Wildcards</title>
|
|
<para>Wildcards can be used inside search terms in all forms
|
|
of searches. <link linkend="RCL.SEARCH.WILDCARDS">
|
|
More about wildcards</link>.
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Automatic suffixes</title>
|
|
<para>Words like <literal>odt</literal> or <literal>ods</literal>
|
|
can be automatically turned into query language
|
|
<literal>ext:xxx</literal> clauses. This can be enabled in the
|
|
<guilabel>Search preferences</guilabel> panel in the GUI.
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Disabling stem expansion</title>
|
|
<para>Entering a capitalized word in any search field will prevent
|
|
stem expansion (no search for
|
|
<literal>gardening</literal> if you enter
|
|
<literal>Garden</literal> instead of
|
|
<literal>garden</literal>). This is the only case where
|
|
character case should make a difference for a &RCL;
|
|
search. You can also disable stem expansion or change the
|
|
stemming language in the preferences.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Finding related documents</title>
|
|
<para>Selecting the <guilabel>Find similar documents</guilabel> entry
|
|
in the result list paragraph right-click menu will select a
|
|
set of "interesting" terms from the current result, and insert
|
|
them into the simple search entry field. You can then possibly
|
|
edit the list and start a search to find documents which may
|
|
be apparented to the current result.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>File names</title>
|
|
<para>File names are added as terms during indexing, and you can
|
|
specify them as ordinary terms in normal search fields (&RCL; used
|
|
to index all directories in the file path as terms. This has been
|
|
abandoned as it did not seem really useful). Alternatively, you
|
|
can use the specific file name search which will
|
|
<emphasis>only</emphasis> look for file names, and may be
|
|
faster than the generic search especially when using wildcards.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.PHRASES">
|
|
<title>Working with phrases and proximity</title>
|
|
|
|
<formalpara><title>Phrases and Proximity searches</title>
|
|
<para>A phrase can be looked for by enclosing it in double
|
|
quotes. Example: <literal>"user manual"</literal> will look
|
|
only for occurrences of <literal>user</literal> immediately
|
|
followed by <literal>manual</literal>. You can use the
|
|
<guilabel>This phrase</guilabel> field of the advanced
|
|
search dialog to the same effect. Phrases can be entered along
|
|
simple terms in all simple or advanced search entry fields
|
|
(except <guilabel>This exact phrase</guilabel>).</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>AutoPhrases</title>
|
|
<para>This option can be set in the preferences dialog. If it is
|
|
set, a phrase will be automatically built and added to simple
|
|
searches when looking for <literal>Any terms</literal>. This
|
|
will not change radically the results, but will give a relevance
|
|
boost to the results where the search terms appear as a
|
|
phrase. Ie: searching for <literal>virtual reality</literal>
|
|
will still find all documents where either
|
|
<literal>virtual</literal> or <literal>reality</literal> or
|
|
both appear, but those which contain <literal>virtual
|
|
reality</literal> should appear sooner in the list.</para>
|
|
</formalpara>
|
|
|
|
<para>Phrase searches can strongly slow down a query if most of the
|
|
terms in the phrase are common. This is why the
|
|
<varname>autophrase</varname> option is off by default for &RCL;
|
|
versions before 1.17. As of version 1.17,
|
|
<varname>autophrase</varname> is on by default, but very common
|
|
terms will be removed from the constructed phrase. The removal
|
|
threshold can be adjusted from the search preferences.</para>
|
|
|
|
<formalpara><title>Phrases and abbreviations</title> <para>As of
|
|
&RCL; version 1.17, dotted abbreviations like
|
|
<literal>I.B.M.</literal> are also automatically indexed as a word
|
|
without the dots: <literal>IBM</literal>. Searching for the word
|
|
inside a phrase (ie: <literal>"the IBM company"</literal>) will only
|
|
match the dotted abrreviation if you increase the phrase slack (using the
|
|
advanced search panel control, or the <literal>o</literal> query
|
|
language modifier). Literal occurences of the word will be matched
|
|
normally.</para></formalpara>
|
|
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.MISC">
|
|
<title>Others</title>
|
|
|
|
<formalpara><title>Using fields</title>
|
|
<para>You can use the <link linkend="RCL.SEARCH.LANG">query
|
|
language </link> and field specifications
|
|
to only search certain parts of documents. This can be
|
|
especially helpful with email, for example only searching
|
|
emails from a specific originator:
|
|
<literal>search tips from:helpfulgui</literal>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Ajusting the result table columns</title>
|
|
<para>When displaying results in table mode, you can use a
|
|
right click on the table headers to activate a pop-up menu
|
|
which will let you adjust what columns are displayed. You can
|
|
drag the column headers to adjust their order. You can click
|
|
them to sort by the field displayed in the column. You can
|
|
also save the result list in CSV format.</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara><title>Changing the GUI geometry</title>
|
|
<para>It is possible to configure the GUI in wide form
|
|
factor by dragging the toolbars to one of the sides (their
|
|
location is remembered between sessions), and moving the
|
|
category filters to a menu (can be set in the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>GUI configuration</guimenuitem>
|
|
<guimenuitem>User interface</guimenuitem>
|
|
</menuchoice> panel).</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Query explanation</title>
|
|
<para>You can get an exact description of what the query
|
|
looked for, including stem expansion, and Boolean operators
|
|
used, by clicking on the result list header.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Advanced search history</title>
|
|
<para>As of &RCL; 1.18, you can display any of the last 100 complex
|
|
searches performed by using the up and down arrow keys while the
|
|
advanced search panel is active.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Browsing the result list inside a preview
|
|
window</title>
|
|
<para>Entering <keycap>Shift-Down</keycap> or <keycap>Shift-Up</keycap>
|
|
(<keycap>Shift</keycap> + an arrow key) in a preview window will
|
|
display the next or the previous document from the result
|
|
list. Any secondary search currently active will be executed on
|
|
the new document.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Scrolling the result list from the keyboard</title>
|
|
<para>You can use <keycap>PageUp</keycap> and <keycap>PageDown</keycap>
|
|
to scroll the result list, <keycap>Shift+Home</keycap> to go back
|
|
to the first page. These work even while the focus is in the
|
|
search entry.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Result table: moving the focus to the table</title>
|
|
<para>You can use <keycap>Ctrl-r</keycap> to move the focus
|
|
from the search entry to the table, and then use the arrow keys
|
|
to change the current row. <keycap>Ctrl-Shift-s</keycap> returns to
|
|
the search.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Result table: open / preview</title>
|
|
<para>With the focus in the result table, you can use
|
|
<keycap>Ctrl-o</keycap> to open the document from the current
|
|
row, <keycap>Ctrl-Shift-o</keycap> to open the document and close
|
|
<command>recoll</command>, <keycap>Ctrl-d</keycap> to preview
|
|
the document.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Editing a new search while the focus is not
|
|
in the search entry</title>
|
|
<para>You can use the <keycap>Ctrl-Shift-S</keycap> shortcut to
|
|
return the cursor to the search entry (and select the current
|
|
search text), while the focus is anywhere in the main
|
|
window.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Forced opening of a preview window</title>
|
|
<para>You can use <keycap>Shift</keycap>+Click on a result list
|
|
<literal>Preview</literal> link to force the creation of a
|
|
preview window instead of a new tab in the existing one.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Closing previews</title>
|
|
<para>Entering <keycap>Ctrl-W</keycap> in a tab will
|
|
close it (and, for the last tab, close the preview
|
|
window). Entering <keycap>Esc</keycap> will close the preview
|
|
window and all its tabs.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Printing previews</title>
|
|
<para>Entering <keycap>Ctrl-P</keycap> in a preview window will print
|
|
the currently displayed text.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Quitting</title>
|
|
<para>Entering <keycap>Ctrl-Q</keycap> almost anywhere will
|
|
close the application.</para>
|
|
</formalpara>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.SAVING">
|
|
<title>Saving and restoring queries (1.21 and later)</title>
|
|
|
|
<para>Both simple and advanced query dialogs save recent
|
|
history, but the amount is limited: old queries will eventually
|
|
be forgotten. Also, important queries may be difficult to find
|
|
among others. This is why both types of queries can also be
|
|
explicitely saved to files, from the GUI menus:
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Save last query / Load last query</guimenuitem>
|
|
</menuchoice>
|
|
</para>
|
|
|
|
<para>The default location for saved queries is a subdirectory
|
|
of the current configuration directory, but saved queries are
|
|
ordinary files and can be written or moved anywhere.</para>
|
|
|
|
<para>Some of the saved query parameters are part of the
|
|
preferences (e.g. <literal>autophrase</literal> or the active
|
|
external indexes), and may differ when the query is
|
|
loaded from the time it was saved. In this case, &RCL; will warn
|
|
of the differences, but will not change the user
|
|
preferences.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.CUSTOM">
|
|
<title>Customizing the search interface</title>
|
|
|
|
<para>You can customize some aspects of the search interface by using
|
|
the <guimenu>GUI configuration</guimenu> entry in the
|
|
<guimenu>Preferences</guimenu> menu.</para>
|
|
|
|
<para>There are several tabs in the dialog, dealing with the
|
|
interface itself, the parameters used for searching and
|
|
returning results, and what indexes are searched.</para>
|
|
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.UI">
|
|
<title>User interface parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Highlight color for query
|
|
terms</guilabel>: Terms from the user query are highlighted in
|
|
the result list samples and the preview window. The color can
|
|
be chosen here. Any Qt color string should work (ie
|
|
<literal>red</literal>, <literal>#ff0000</literal>). The
|
|
default is <literal>blue</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Style sheet</guilabel>:
|
|
The name of a <application>Qt</application> style sheet
|
|
text file which is applied to the whole Recoll application
|
|
on startup. The default value is empty, but there is a
|
|
skeleton style sheet (<filename>recoll.qss</filename>)
|
|
inside the <filename>/usr/share/recoll/examples</filename>
|
|
directory. Using a style sheet, you can change most
|
|
<command>recoll</command> graphical parameters:
|
|
colors, fonts, etc. See the sample file for a few
|
|
simple examples.</para>
|
|
<para>You should be aware that parameters (e.g.: the
|
|
background color) set inside the &RCL; GUI style sheet
|
|
will override global system preferences, with possible
|
|
strange side effects: for example if you set the
|
|
foreground to a light color and the background to a
|
|
dark one in the desktop preferences, but only the
|
|
background is set inside the &RCL; style sheet, and it
|
|
is light too, then text will appear light-on-light
|
|
inside the &RCL; GUI.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Maximum text size highlighted for
|
|
preview</guilabel> Inserting highlights on search term inside
|
|
the text before inserting it in the preview window involves
|
|
quite a lot of processing, and can be disabled over the given
|
|
text size to speed up loading.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Prefer HTML to plain text for
|
|
preview</guilabel> if set, Recoll will display HTML as such
|
|
inside the preview window. If this causes problems with the Qt
|
|
HTML display, you can uncheck it to display the plain text
|
|
version instead. </para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Plain text to HTML line style</guilabel>:
|
|
when displaying plain text inside the preview window, &RCL;
|
|
tries to preserve some of the original text line breaks and
|
|
indentation. It can either use PRE HTML tags, which will
|
|
well preserve the indentation but will force horizontal
|
|
scrolling for long lines, or use BR tags to break at the
|
|
original line breaks, which will let the editor introduce
|
|
other line breaks according to the window width, but will
|
|
lose some of the original indentation. The third option has
|
|
been available in recent releases and is probably now the best
|
|
one: use PRE tags with line wrapping.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Choose editor
|
|
applicationsr</guilabel>: this opens a dialog which allows you
|
|
to select the application to be used to open each MIME
|
|
type. The default is nornally to use the
|
|
<command>xdg-open</command> utility, but you can override it.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Exceptions</guilabel>: even wen
|
|
<command>xdg-open</command> is used by default for opening
|
|
documents, you can set exceptions for MIME types that will
|
|
still be opened according to &RCL; preferences. This is useful
|
|
for passing parameters like page numbers or search strings to
|
|
applications that support them
|
|
(e.g. <application>evince</application>). This cannot be done
|
|
with <command>xdg-open</command> which only supports passing
|
|
one parameter.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Document filter choice
|
|
style</guilabel>: this will let you choose if the document
|
|
categories are displayed as a list or a set of buttons, or a
|
|
menu.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Start with simple search
|
|
mode</guilabel>: this lets you choose the value of the simple
|
|
search type on program startup. Either a fixed value
|
|
(e.g. <literal>Query Language</literal>, or the value in use
|
|
when the program last exited.</para></listitem>
|
|
|
|
<listitem><para><guilabel>Auto-start simple search on white
|
|
space entry</guilabel>: if this is checked, a search will be
|
|
executed each time you enter a space in the simple search input
|
|
field. This lets you look at the result list as you enter new
|
|
terms. This is off by default, you may like it or not...</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Start with advanced search dialog open
|
|
</guilabel>: If you use this dialog frequently, checking
|
|
the entries will get it to open when recoll starts.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Remember sort activation
|
|
state</guilabel> if set, Recoll will remember the sort tool
|
|
stat between invocations. It normally starts with sorting
|
|
disabled.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.RL">
|
|
<title>Result list parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Number of results in a result
|
|
page</guilabel></para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Result list font</guilabel>: There is
|
|
quite a lot of information shown in the result list, and you
|
|
may want to customize the font and/or font size. The rest of
|
|
the fonts used by &RCL; are determined by your generic Qt
|
|
config (try the <command>qtconfig</command> command).</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.RESULTPARA">
|
|
<para><guilabel>Edit result list paragraph format string</guilabel>:
|
|
allows you to change the presentation of each result list
|
|
entry. See the <link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
result list customisation section</link>.</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.RESULTHEAD">
|
|
<para><guilabel>Edit result page HTML header insert</guilabel>:
|
|
allows you to define text inserted at the end of the result
|
|
page HTML header.
|
|
More detail in the <link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
result list customisation section.</link></para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><guilabel>Date format</guilabel>: allows specifying the
|
|
format used for displaying dates inside the result list. This
|
|
should be specified as an strftime() string (man strftime).</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.ABSSEP">
|
|
<para><guilabel>Abstract snippet separator</guilabel>:
|
|
for synthetic abstracts built from index data, which are
|
|
usually made of several snippets from different parts of the
|
|
document, this defines the snippet separator, an ellipsis by
|
|
default. </para>
|
|
</listitem>
|
|
|
|
</itemizedlist></para>
|
|
</formalpara>
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.SEARCH">
|
|
<title>Search parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Hide duplicate results</guilabel>:
|
|
decides if result list entries are shown for identical
|
|
documents found in different places.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Stemming language</guilabel>:
|
|
stemming obviously depends on the document's language. This
|
|
listbox will let you chose among the stemming databases which
|
|
were built during indexing (this is set in the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">main configuration
|
|
file</link>), or later added with <command>recollindex
|
|
-s</command> (See the recollindex manual). Stemming languages
|
|
which are dynamically added will be deleted at the next
|
|
indexing pass unless they are also added in the configuration
|
|
file.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Automatically add phrase to simple
|
|
searches</guilabel>: a phrase will be automatically built and
|
|
added to simple searches when looking for <literal>Any
|
|
terms</literal>. This will give a relevance boost to the
|
|
results where the search terms appear as a phrase (consecutive
|
|
and in order).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Autophrase term frequency threshold
|
|
percentage</guilabel>: very frequent terms should not be included
|
|
in automatic phrase searches for performance reasons. The
|
|
parameter defines the cutoff percentage (percentage of the
|
|
documents where the term appears).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Replace abstracts from
|
|
documents</guilabel>: this decides if we should synthesize and
|
|
display an abstract in place of an explicit abstract found
|
|
within the document itself.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Dynamically build
|
|
abstracts</guilabel>: this decides if &RCL; tries to build
|
|
document abstracts (lists of <emphasis>snippets</emphasis>)
|
|
when displaying the result list. Abstracts are constructed by
|
|
taking context from the document information, around the search
|
|
terms.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Synthetic abstract size</guilabel>:
|
|
adjust to taste...</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Synthetic abstract context
|
|
words</guilabel>: how many words should be displayed around
|
|
each term occurrence.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Query language magic file name
|
|
suffixes</guilabel>: a list of words which automatically get
|
|
turned into <literal>ext:xxx</literal> file name suffix clauses
|
|
when starting a query language query (ie: <literal>doc xls
|
|
xlsx...</literal>). This will save some typing for people who
|
|
use file types a lot when querying.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.EXTRADB">
|
|
<title>External indexes:</title>
|
|
<para>This panel will let you browse for additional indexes
|
|
that you may want to search. External indexes are designated by
|
|
their database directory (ie:
|
|
<filename>/home/someothergui/.recoll/xapiandb</filename>,
|
|
<filename>/usr/local/recollglobal/xapiandb</filename>).</para>
|
|
</formalpara>
|
|
|
|
<para>Once entered, the indexes will appear in the
|
|
<guilabel>External indexes</guilabel> list, and you can
|
|
chose which ones you want to use at any moment by checking or
|
|
unchecking their entries.</para>
|
|
|
|
<para>Your main database (the one the current configuration
|
|
indexes to), is always implicitly active. If this is not
|
|
desirable, you can set up your configuration so that it indexes,
|
|
for example, an empty directory. An alternative indexer may also
|
|
need to implement a way of purging the index from stale data,
|
|
</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
<title>The result list format</title>
|
|
|
|
<para>Newer versions of Recoll (from 1.17) normally use WebKit HTML
|
|
widgets for the result list and the
|
|
<link linkend="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">
|
|
snippets window</link> (this may be disabled at build time).
|
|
Total customisation is possible with full support for CSS and
|
|
Javascript. Conversely, there are limits to what you can do with
|
|
the older Qt QTextBrowser, but still, it is possible to decide
|
|
what data each result will contain, and how it will be
|
|
displayed.</para>
|
|
|
|
<para>The result list presentation can be exhaustively customized
|
|
by adjusting two elements:
|
|
|
|
<itemizedlist>
|
|
<listitem><para>The paragraph format</para></listitem>
|
|
<listitem><para>HTML code inside the header section. For
|
|
versions 1.21 and later, this is also used for the
|
|
<link linkend="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">
|
|
snippets window</link> </para></listitem>
|
|
</itemizedlist>
|
|
The paragraph format and the header fragment can be edited
|
|
from the <guilabel>Result list</guilabel> tab of the
|
|
<guilabel>GUI configuration</guilabel>.
|
|
</para>
|
|
|
|
<para>The header fragment is used both for the result list and
|
|
the snippets window. The snippets list is a table and has a
|
|
<literal>snippets</literal> class attribute. Each paragraph in
|
|
the result list is a table, with class
|
|
<literal>respar</literal>, but this can be changed by editing
|
|
the paragraph format.</para>
|
|
|
|
<para>There are a few examples on the
|
|
<ulink url="http://www.recoll.org/custom.html">page about
|
|
customising the result list</ulink> on the &RCL; web site.</para>
|
|
|
|
<sect4 id="RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA">
|
|
<title>The paragraph format</title>
|
|
|
|
<para>This is an arbitrary HTML string where the following printf-like
|
|
<literal>%</literal> substitutions will be performed:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>%A</title><para>Abstract</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%D</title><para>Date</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%I</title><para>Icon image
|
|
name. This is normally determined from the MIME type. The
|
|
associations are defined inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<filename>mimeconf</filename> configuration file</link>.
|
|
If a thumbnail for the file is found at
|
|
the standard Freedesktop location, this will be displayed
|
|
instead.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%K</title><para>Keywords (if
|
|
any)</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%L</title><para>Precooked Preview,
|
|
Edit, and possibly Snippets links</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%M</title><para>MIME
|
|
type</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%N</title><para>result Number inside
|
|
the result page</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%P</title><para>Parent folder
|
|
Url. In the case of an embedded document, this is the parent folder
|
|
for the top level container file.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%R</title><para>Relevance
|
|
percentage</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%S</title><para>Size
|
|
information</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%T</title><para>Title or Filename if
|
|
not set.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%t</title><para>Title or Filename if
|
|
not set.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%U</title><para>Url</para></formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
The format of the Preview, Edit, and Snippets links is
|
|
<literal><a href="P%N"></literal>,
|
|
<literal><a href="E%N"></literal>
|
|
and
|
|
<literal><a href="A%N"></literal>
|
|
where <replaceable>docnum</replaceable> (%N) expands to the document
|
|
number inside the result page).</para>
|
|
|
|
<para>A link target defined as <literal>"F%N"</literal> will open
|
|
the document corresponding to the <literal>%P</literal> parent
|
|
folder expansion, usually creating a file manager window on the
|
|
folder where the container file resides. E.g.:
|
|
<programlisting><a href="F%N">%P</a></programlisting>
|
|
</para>
|
|
|
|
<para>A link target defined as
|
|
<literal>R%N|<replaceable>scriptname</replaceable></literal> will
|
|
run the corresponding script on the result file (if the document is
|
|
embedded, the script will be started on the top-level parent).
|
|
See the <link linkend="RCL.SEARCH.GUI.RUNSCRIPT">section about
|
|
defining scripts</link>.</para>
|
|
|
|
<para>In addition to the predefined values above, all strings
|
|
like <literal>%(fieldname)</literal> will be replaced by the
|
|
value of the field named <literal>fieldname</literal> for this
|
|
document. Only stored fields can be accessed in this way, the
|
|
value of indexed but not stored fields is not known at this
|
|
point in the search process
|
|
(see <link linkend="RCL.PROGRAM.FIELDS">field
|
|
configuration</link>). There are currently very few fields
|
|
stored by default, apart from the values above
|
|
(only <literal>author</literal>
|
|
and <literal>filename</literal>), so this feature will need
|
|
some custom local configuration to be useful. An example
|
|
candidate would be the <literal>recipient</literal> field
|
|
which is generated by the message input handlers.</para>
|
|
|
|
<para>The default value for the paragraph format string is:
|
|
<screen><![CDATA[
|
|
"<table class=\"respar\">\n"
|
|
"<tr>\n"
|
|
"<td><a href='%U'><img src='%I' width='64'></a></td>\n"
|
|
"<td>%L <i>%S</i> <b>%T</b><br>\n"
|
|
"<span style='white-space:nowrap'><i>%M</i> %D</span> <i>%U</i> %i<br>\n"
|
|
"%A %K</td>\n"
|
|
"</tr></table>\n"
|
|
]]></screen>
|
|
|
|
You may, for example, try the following for a more web-like
|
|
experience:
|
|
|
|
<screen><![CDATA[
|
|
<u><b><a href="P%N">%T</a></b></u><br>
|
|
%A<font color=#008000>%U - %S</font> - %L
|
|
]]></screen>
|
|
|
|
Note that the P%N link in the above paragraph makes the title a
|
|
preview link. Or the clean looking:
|
|
|
|
<screen><![CDATA[
|
|
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
|
<b>%T&</b><br>%S
|
|
<font color="#808080"><i>%U</i></font>
|
|
<table bgcolor="#e0e0e0">
|
|
<tr><td><div>%A</div></td></tr>
|
|
</table>%K
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>These samples, and some others are
|
|
<ulink url="http://www.recoll.org/custom.html">on the web
|
|
site, with pictures to show how they look.</ulink></para>
|
|
|
|
<para>It is also possible to
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.ABSSEP">
|
|
define the value of the snippet separator inside the abstract
|
|
section</link>.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
</sect1> <!-- search GUI -->
|
|
|
|
<sect1 id="RCL.SEARCH.KIO">
|
|
<title>Searching with the KDE KIO slave</title>
|
|
|
|
<sect2 id="RCL.SEARCH.KIO.INTRO">
|
|
<title>What's this</title>
|
|
|
|
<para>The &RCL; KIO slave allows performing a &RCL; search
|
|
by entering an appropriate URL in a KDE open dialog, or with an
|
|
HTML-based interface displayed in
|
|
<command>Konqueror</command>.</para>
|
|
|
|
<para>The HTML-based interface is similar to the Qt-based
|
|
interface, but slightly less powerful for now. Its advantage is
|
|
that you can perform your search while staying fully within the
|
|
KDE framework: drag and drop from the result list works normally
|
|
and you have your normal choice of applications for opening
|
|
files.</para>
|
|
|
|
<para>The alternative interface uses a directory view of search
|
|
results. Due to limitations in the current KIO slave interface,
|
|
it is currently not obviously useful (to me).</para>
|
|
|
|
<para>The interface is described in more detail inside a help
|
|
file which you can access by entering
|
|
<filename>recoll:/</filename> inside the
|
|
<command>konqueror</command> URL line (this works only if the
|
|
recoll KIO slave has been previously installed).</para>
|
|
|
|
|
|
<para>The instructions for building this module are located in the
|
|
source tree. See:
|
|
<filename>kde/kio/recoll/00README.txt</filename>. Some Linux
|
|
distributions do package the kio-recoll module, so check before
|
|
diving into the build process, maybe it's already out there ready for
|
|
one-click installation.</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.SEARCH.KIO.SEARCHABLEDOCS">
|
|
<title>Searchable documents</title>
|
|
|
|
<para>As a sample application, the &RCL; KIO slave could allow
|
|
preparing a set of HTML documents (for example a manual) so that
|
|
they become their own search interface inside
|
|
<command>konqueror</command>.</para>
|
|
|
|
<para>This can be done by either explicitly inserting
|
|
<literal><![CDATA[<a href="recoll://...">]]></literal> links
|
|
around some document areas, or automatically by adding a
|
|
very small <application>javascript</application> program to the
|
|
documents, like the following example, which would initiate a search by
|
|
double-clicking any term:</para>
|
|
|
|
<programlisting><script language="JavaScript">
|
|
function recollsearch() {
|
|
var t = document.getSelection();
|
|
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
|
encodeURIComponent(t);
|
|
}
|
|
</script>
|
|
....
|
|
<body ondblclick="recollsearch()">
|
|
|
|
</programlisting>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.COMMANDLINE">
|
|
<title>Searching on the command line</title>
|
|
|
|
<para>There are several ways to obtain search results as a text
|
|
stream, without a graphical interface:</para>
|
|
<itemizedlist>
|
|
<listitem><para>By passing option <option>-t</option> to the
|
|
<command>recoll</command> program, or by calling it as
|
|
<command>recollq</command> (through a link).</para>
|
|
</listitem>
|
|
<listitem><para>By using the <command>recollq</command> program.</para>
|
|
</listitem>
|
|
<listitem><para>By writing a custom
|
|
<application>Python</application> program, using the
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI">Recoll Python API</link>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The first two methods work in the same way and accept/need the same
|
|
arguments (except for the additional <option>-t</option> to
|
|
<command>recoll</command>). The query to be executed is specified
|
|
as command line arguments.</para>
|
|
|
|
<para><command>recollq</command> is not built by default. You can
|
|
use the <filename>Makefile</filename> in the
|
|
<filename>query</filename> directory to build it. This is a very
|
|
simple program, and if you can program a little c++, you may find it
|
|
useful to taylor its output format to your needs. Not that recollq is
|
|
only really useful on systems where the Qt libraries (or even the X11
|
|
ones) are not available. Otherwise, just use <literal>recoll
|
|
-t</literal>, which takes the exact same parameters and options which
|
|
are described for <command>recollq</command></para>
|
|
|
|
<para><command>recollq</command> has a man page (not installed by
|
|
default, look in the <filename>doc/man</filename> directory). The
|
|
Usage string is as follows:</para>
|
|
<programlisting>
|
|
recollq: usage:
|
|
-P: Show the date span for all the documents present in the index
|
|
[-o|-a|-f] [-q] <query string>
|
|
Runs a recoll query and displays result lines.
|
|
Default: will interpret the argument(s) as a xesam query string
|
|
query may be like:
|
|
implicit AND, Exclusion, field spec: t1 -t2 title:t3
|
|
OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4)
|
|
Phrase: "t1 t2" (needs additional quoting on cmd line)
|
|
-o Emulate the GUI simple search in ANY TERM mode
|
|
-a Emulate the GUI simple search in ALL TERMS mode
|
|
-f Emulate the GUI simple search in filename mode
|
|
-q is just ignored (compatibility with the recoll GUI command line)
|
|
Common options:
|
|
-c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
|
|
-d also dump file contents
|
|
-n [first-]<cnt> define the result slice. The default value for [first]
|
|
is 0. Without the option, the default max count is 2000.
|
|
Use n=0 for no limit
|
|
-b : basic. Just output urls, no mime types or titles
|
|
-Q : no result lines, just the processed query and result count
|
|
-m : dump the whole document meta[] array for each result
|
|
-A : output the document abstracts
|
|
-S fld : sort by field <fld>
|
|
-s stemlang : set stemming language to use (must exist in index...)
|
|
Use -s "" to turn off stem expansion
|
|
-D : sort descending
|
|
-i <dbdir> : additional index, several can be given
|
|
-e use url encoding (%xx) for urls
|
|
-F <field name list> : output exactly these fields for each result.
|
|
The field values are encoded in base64, output in one line and
|
|
separated by one space character. This is the recommended format
|
|
for use by other programs. Use a normal query with option -m to
|
|
see the field names.
|
|
</programlisting>
|
|
|
|
<para>Sample execution:</para>
|
|
<programlisting>recollq 'ilur -nautique mime:text/html'
|
|
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11)
|
|
OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
|
|
4 results
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
|
|
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
|
|
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.SYNONYMS">
|
|
<title>Using Synonyms (1.22)</title>
|
|
|
|
<formalpara><title>Term synonyms:</title>
|
|
<para>there are a number of ways to use term synonyms for searching text:
|
|
<itemizedlist>
|
|
<listitem><para>At index creation time, they can be used to alter the
|
|
indexed terms, either increasing or decreasing their number, by
|
|
expanding the original terms to all synonyms, or by
|
|
reducing all synonym terms to a canonical one.</para></listitem>
|
|
<listitem><para>At query time, they can be used to match texts
|
|
containing terms which are synonyms of the ones specified by the user,
|
|
either by expanding the query for all synonyms, or by reducing the user
|
|
entry to canonical terms (the latter only works if the corresponding
|
|
processing has been performed while creating the index).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<para>&RCL; only uses synonyms at query time. A user query term which
|
|
part of a synonym group will be optionally expanded into an
|
|
<literal>OR</literal> query for all terms in the group.</para>
|
|
|
|
<para>Synonym groups are defined inside ordinary text files. Each line
|
|
in the file defines a group.</para>
|
|
|
|
<para>Example:
|
|
<programlisting>
|
|
hi hello "good morning"
|
|
|
|
# not sure about "au revoir" though. Is this english ?
|
|
bye goodbye "see you" \
|
|
"au revoir"
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>As usual, lines beginning with a <literal>#</literal> are comments,
|
|
empty lines are ignored, and lines can be continued by ending them with
|
|
a backslash.
|
|
</para>
|
|
|
|
<para>Multi-word synonyms are supported, but be aware that these will
|
|
generate phrase queries, which may degrade performance and will disable
|
|
stemming expansion for the phrase terms.</para>
|
|
|
|
<para>The synonyms file can be specified in the <guilabel>Search
|
|
parameters</guilabel> tab of the <guilabel>GUI configuration</guilabel>
|
|
<guilabel>Preferences</guilabel> menu entry, or as an option for
|
|
command-line searches.</para>
|
|
|
|
<para>Once the file is defined, the use of synonyms can be enabled or
|
|
disabled directly from the <guilabel>Preferences</guilabel>
|
|
menu.</para>
|
|
|
|
<para>The synonyms are searched for matches with user terms after the
|
|
latter are stem-expanded, but the contents of the synonyms file itself
|
|
is not subjected to stem expansion. This means that a match will not be
|
|
found if the form present in the synonyms file is not present anywhere
|
|
in the document set.</para>
|
|
|
|
<para>The synonyms function is probably not going to help you find your
|
|
letters to Mr. Smith. It is best used for domain-specific searches. For
|
|
example, it was initially suggested by a user performing searches among
|
|
historical documents: the synonyms file would contains nicknames and
|
|
aliases for each of the persons of interest.</para>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.PTRANS">
|
|
<title>Path translations</title>
|
|
|
|
<para>In some cases, the document paths stored inside the index do
|
|
not match the actual ones, so that document
|
|
previews and accesses will fail. This can occur in a number of
|
|
circumstances:</para>
|
|
<itemizedlist>
|
|
<listitem><para>When using multiple indexes it is a relatively common
|
|
occurrence that some will actually reside on a remote volume, for
|
|
exemple mounted via NFS. In this case, the paths used to access
|
|
the documents on the local machine are not necessarily the same
|
|
than the ones used while indexing on the remote machine. For
|
|
example, <filename>/home/me</filename> may have been used as
|
|
a <literal>topdirs</literal> elements while indexing, but the
|
|
directory might be mounted
|
|
as <filename>/net/server/home/me</filename> on the local
|
|
machine.</para></listitem>
|
|
|
|
<listitem><para>The case may also occur with removable
|
|
disks. It is perfectly possible to configure an index to
|
|
live with the documents on the removable disk, but it may
|
|
happen that the disk is not mounted at the same place so
|
|
that the documents paths from the index are
|
|
invalid.</para></listitem>
|
|
|
|
<listitem><para>As a last exemple, one could imagine that a big
|
|
directory has been moved, but that it is currently
|
|
inconvenient to run the indexer.</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>&RCL; has a facility for rewriting access paths when
|
|
extracting the data from the index. The translations can be
|
|
defined for the main index and for any additional query
|
|
index.</para>
|
|
|
|
<para>The path translation facility will be useful
|
|
whenever the documents paths seen by the indexer are not the same
|
|
as the ones which should be used at query time.</para>
|
|
|
|
<para>In the above NFS example, &RCL; could be instructed to
|
|
rewrite any <filename>file:///home/me</filename> URL from the
|
|
index to <filename>file:///net/server/home/me</filename>,
|
|
allowing accesses from the client.</para>
|
|
|
|
<para>The translations are defined in the
|
|
<link linkend="RCL.INSTALL.CONFIG.PTRANS">
|
|
<filename>ptrans</filename></link> configuration file, which
|
|
can be edited by hand or from the GUI external indexes
|
|
configuration dialog: <menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>External index dialog</guimenuitem>
|
|
</menuchoice>, then click the <guilabel>Paths
|
|
translations</guilabel> button on the right below the index
|
|
list.</para>
|
|
|
|
<note><para>Due to a current bug, the GUI must be restarted
|
|
after changing the <filename>ptrans</filename> values (even when they
|
|
were changed from the GUI).</para></note>
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.LANG">
|
|
<title>The query language</title>
|
|
|
|
<para>The query language processor is activated in the GUI
|
|
simple search entry when the search mode selector is set to
|
|
<guilabel>Query Language</guilabel>. It can also be used with the KIO
|
|
slave or the command line search. It broadly has the same
|
|
capabilities as the complex search interface in the
|
|
GUI.</para>
|
|
|
|
<para>The language was based on the now defunct
|
|
<ulink url="http://www.xesam.org/main/XesamUserSearchLanguage95">
|
|
Xesam</ulink> user search language specification.</para>
|
|
|
|
<para>If the results of a query language search puzzle you and you
|
|
doubt what has been actually searched for, you can use the GUI
|
|
<literal>Show Query</literal> link at the top of the result list to
|
|
check the exact query which was finally executed by Xapian.</para>
|
|
|
|
<para>Here follows a sample request that we are going to
|
|
explain:</para>
|
|
|
|
<programlisting>
|
|
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
|
</programlisting>
|
|
|
|
<para>This would search for all documents with
|
|
<replaceable>John Doe</replaceable>
|
|
appearing as a phrase in the author field (exactly what this is
|
|
would depend on the document type, ie: the
|
|
<literal>From:</literal> header, for an email message),
|
|
and containing either <replaceable>beatles</replaceable> or
|
|
<replaceable>lennon</replaceable> and either
|
|
<replaceable>live</replaceable> or
|
|
<replaceable>unplugged</replaceable> but not
|
|
<replaceable>potatoes</replaceable> (in any part of the document).</para>
|
|
|
|
<para>An element is composed of an optional field specification,
|
|
and a value, separated by a colon (the field separator is the last
|
|
colon in the element). Examples:
|
|
<replaceable>Eugenie</replaceable>,
|
|
<replaceable>author:balzac</replaceable>,
|
|
<replaceable>dc:title:grandet</replaceable>
|
|
<replaceable>dc:title:"eugenie grandet"</replaceable>
|
|
</para>
|
|
|
|
<para>The colon, if present, means "contains". Xesam defines other
|
|
relations, which are mostly unsupported for now (except in special
|
|
cases, described further down).</para>
|
|
|
|
<para>All elements in the search entry are normally combined
|
|
with an implicit AND. It is possible to specify that elements be
|
|
OR'ed instead, as in <replaceable>Beatles</replaceable>
|
|
<literal>OR</literal> <replaceable>Lennon</replaceable>. The
|
|
<literal>OR</literal> must be entered literally (capitals), and
|
|
it has priority over the AND associations:
|
|
<replaceable>word1</replaceable>
|
|
<replaceable>word2</replaceable> <literal>OR</literal>
|
|
<replaceable>word3</replaceable>
|
|
means
|
|
<replaceable>word1</replaceable> AND
|
|
(<replaceable>word2</replaceable> <literal>OR</literal>
|
|
<replaceable>word3</replaceable>)
|
|
not
|
|
(<replaceable>word1</replaceable> AND
|
|
<replaceable>word2</replaceable>) <literal>OR</literal>
|
|
<replaceable>word3</replaceable>. </para>
|
|
|
|
<para>&RCL; versions 1.21 and later, allow using parentheses to
|
|
group elements, which will sometimes make things clearer, and may
|
|
allow expressing combinations which would have been difficult
|
|
otherwise.</para>
|
|
|
|
<para>An element preceded by a <literal>-</literal> specifies a
|
|
term that should <emphasis>not</emphasis> appear.</para>
|
|
|
|
<para>As usual, words inside quotes define a phrase
|
|
(the order of words is significant), so that
|
|
<replaceable>title:"prejudice pride"</replaceable> is not the same as
|
|
<replaceable>title:prejudice title:pride</replaceable>, and is
|
|
unlikely to find a result.</para>
|
|
|
|
<para>Words inside phrases and capitalized words are not
|
|
stem-expanded. Wildcards may be used anywhere inside a term.
|
|
Specifying a wild-card on the left of a term can produce a very
|
|
slow search (or even an incorrect one if the expansion is
|
|
truncated because of excessive size). Also see
|
|
<link linkend="RCL.SEARCH.WILDCARDS">
|
|
More about wildcards</link>.</para>
|
|
|
|
<para>To save you some typing, recent &RCL; versions (1.20 and later)
|
|
interpret a comma-separated list of terms as an AND list inside the
|
|
field. Use slash characters ('/') for an OR list. No white space
|
|
is allowed. So
|
|
<programlisting>author:john,lennon</programlisting> will search for
|
|
documents with <literal>john</literal> and <literal>lennon</literal>
|
|
inside the <literal>author</literal> field (in any order), and
|
|
<programlisting>author:john/ringo</programlisting> would search for
|
|
<literal>john</literal> or <literal>ringo</literal>.</para>
|
|
|
|
<para>Modifiers can be set on a double-quote value, for example to specify
|
|
a proximity search (unordered). See
|
|
<link linkend="RCL.SEARCH.LANG.MODIFIERS">the modifier
|
|
section</link>. No space must separate the final
|
|
double-quote and the modifiers value, e.g. <replaceable>"two
|
|
one"po10</replaceable></para>
|
|
|
|
<para>&RCL; currently manages the following default fields:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>title</literal>,
|
|
<literal>subject</literal> or <literal>caption</literal> are
|
|
synonyms which specify data to be searched for in the
|
|
document title or subject.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>author</literal> or
|
|
<literal>from</literal> for searching the documents
|
|
originators.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>recipient</literal> or
|
|
<literal>to</literal> for searching the documents
|
|
recipients.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>keyword</literal> for searching the
|
|
document-specified keywords (few documents actually have
|
|
any).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>filename</literal> for the document's
|
|
file name. This is not necessarily set for all documents:
|
|
internal documents contained inside a compound one (for example
|
|
an EPUB section) do not inherit the container file name any more,
|
|
this was replaced by an explicit field (see next). Sub-documents
|
|
can still have a specific <literal>filename</literal>, if it is
|
|
implied by the document format, for example the attachment file
|
|
name for an email attachment.</para></listitem>
|
|
|
|
<listitem><para><literal>containerfilename</literal>. This is
|
|
set for all documents, both top-level and contained
|
|
sub-documents, and is always the name of the filesystem directory
|
|
entry which contains the data. The terms from this field can
|
|
only be matched by an explicit field specification (as opposed
|
|
to terms from <literal>filename</literal> which are also indexed
|
|
as general document content). This avoids getting matches for
|
|
all the sub-documents when searching for the container file
|
|
name.</para></listitem>
|
|
|
|
<listitem><para><literal>ext</literal> specifies the file
|
|
name extension (Ex: <literal>ext:html</literal>)</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>&RCL; 1.20 and later have a way to specify aliases for the
|
|
field names, which will save typing, for example by aliasing
|
|
<literal>filename</literal> to <replaceable>fn</replaceable> or
|
|
<literal>containerfilename</literal> to
|
|
<replaceable>cfn</replaceable>. See the <link
|
|
linkend="RCL.INSTALL.CONFIG.FIELDS">section about the
|
|
<filename>fields</filename> file</link></para>
|
|
|
|
<para>The document input handlers used while indexing have the
|
|
possibility to create other fields with arbitrary names, and
|
|
aliases may be defined in the configuration, so that the exact
|
|
field search possibilities may be different for you if someone
|
|
took care of the customisation.</para>
|
|
|
|
<para>The field syntax also supports a few field-like, but
|
|
special, criteria:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>dir</literal> for filtering the
|
|
results on file location
|
|
(Ex: <literal>dir:/home/me/somedir</literal>).
|
|
<literal>-dir</literal>
|
|
also works to find results not in the specified directory
|
|
(release >= 1.15.8). Tilde expansion will be performed as
|
|
usual (except for a bug in versions 1.19 to
|
|
1.19.11p1). Wildcards will be expanded, but
|
|
please <link linkend="RCL.SEARCH.WILDCARDS.PATH"> have a
|
|
look</link> at an important limitation of wildcards in
|
|
path filters.</para>
|
|
|
|
<para>Relative paths also make sense, for example,
|
|
<literal>dir:share/doc</literal> would match either
|
|
<filename>/usr/share/doc</filename> or
|
|
<filename>/usr/local/share/doc</filename> </para>
|
|
|
|
<para>Several <literal>dir</literal> clauses can be specified,
|
|
both positive and negative. For example the following makes sense:
|
|
<programlisting>
|
|
dir:recoll dir:src -dir:utils -dir:common
|
|
</programlisting> This would select results which have both
|
|
<filename>recoll</filename> and <filename>src</filename> in the
|
|
path (in any order), and which have not either
|
|
<filename>utils</filename> or
|
|
<filename>common</filename>.</para>
|
|
|
|
<para>You can also use <literal>OR</literal> conjunctions
|
|
with <literal>dir:</literal> clauses.</para>
|
|
|
|
<para>A special aspect of <literal>dir</literal> clauses is
|
|
that the values in the index are not transcoded to UTF-8, and
|
|
never lower-cased or unaccented, but stored as binary. This means
|
|
that you need to enter the values in the exact lower or upper
|
|
case, and that searches for names with diacritics may sometimes
|
|
be impossible because of character set conversion
|
|
issues. Non-ASCII UNIX file paths are an unending source of
|
|
trouble and are best avoided.</para>
|
|
|
|
<para>You need to use double-quotes around the path value if it
|
|
contains space characters.</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem><para><literal>size</literal> for filtering the
|
|
results on file size. Example:
|
|
<literal>size<10000</literal>. You can use
|
|
<literal><</literal>, <literal>></literal> or
|
|
<literal>=</literal> as operators. You can specify a range like the
|
|
following: <literal>size>100 size<1000</literal>. The usual
|
|
<literal>k/K, m/M, g/G, t/T</literal> can be used as (decimal)
|
|
multipliers. Ex: <literal>size>1k</literal> to search for files
|
|
bigger than 1000 bytes.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>date</literal> for searching or filtering
|
|
on dates. The syntax for the argument is based on the ISO8601
|
|
standard for dates and time intervals. Only dates are supported, no
|
|
times. The general syntax is 2 elements separated by a
|
|
<literal>/</literal> character. Each element can be a date or a
|
|
period of time. Periods are specified as
|
|
<literal>P</literal><replaceable>n</replaceable><literal>Y</literal><replaceable>n</replaceable><literal>M</literal><replaceable>n</replaceable><literal>D</literal>.
|
|
The <replaceable>n</replaceable> numbers are the respective numbers
|
|
of years, months or days, any of which may be missing. Dates are
|
|
specified as
|
|
<replaceable>YYYY</replaceable>-<replaceable>MM</replaceable>-<replaceable>DD</replaceable>.
|
|
The days and months parts may be missing. If the
|
|
<literal>/</literal> is present but an element is missing, the
|
|
missing element is interpreted as the lowest or highest date in the
|
|
index. Examples:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>2001-03-01/2002-05-01</literal> the
|
|
basic syntax for an interval of dates.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001-03-01/P1Y2M</literal> the
|
|
same specified with a period.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001/</literal> from the beginning of
|
|
2001 to the latest date in the index.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001</literal> the whole year of
|
|
2001</para></listitem>
|
|
<listitem><para><literal>P2D/</literal> means 2 days ago up to
|
|
now if there are no documents with dates in the future.</para>
|
|
</listitem>
|
|
<listitem><para><literal>/2003</literal> all documents from
|
|
2003 or older.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>Periods can also be specified with small letters (ie:
|
|
p2y).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>mime</literal> or
|
|
<literal>format</literal> for specifying the
|
|
MIME type. These clauses are processed besides the normal
|
|
Boolean logic of the search. Multiple values will be OR'ed
|
|
(instead of the normal AND). You can specify types to be
|
|
excluded, with the usual <literal>-</literal>, and use
|
|
wildcards. Example: <replaceable>mime:text/*
|
|
-mime:text/plain</replaceable>
|
|
Specifying an explicit boolean
|
|
operator before a <literal>mime</literal> specification is not
|
|
supported and will produce strange results. </para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>type</literal> or
|
|
<literal>rclcat</literal> for specifying the category (as in
|
|
text/media/presentation/etc.). The classification of MIME
|
|
types in categories is defined in the &RCL; configuration
|
|
(<filename>mimeconf</filename>), and can be modified or
|
|
extended. The default category names are those which permit
|
|
filtering results in the main GUI screen. Categories are OR'ed
|
|
like MIME types above, and can be negated with
|
|
<literal>-</literal>.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<note><para>
|
|
<literal>mime</literal>, <literal>rclcat</literal>,
|
|
<literal>size</literal> and <literal>date</literal> criteria
|
|
always affect the whole query (they are applied as a final
|
|
filter), even if set with other terms inside a parenthese.</para>
|
|
</note>
|
|
|
|
<note><para>
|
|
<literal>mime</literal> (or the equivalent
|
|
<literal>rclcat</literal>) is the <emphasis>only</emphasis>
|
|
field with an <literal>OR</literal> default. You do need to use
|
|
<literal>OR</literal> with <literal>ext</literal> terms for
|
|
example.</para> </note>
|
|
|
|
<sect2 id="RCL.SEARCH.LANG.MODIFIERS">
|
|
<title>Modifiers</title>
|
|
|
|
<para>Some characters are recognized as search modifiers when found
|
|
immediately after the closing double quote of a phrase, as in
|
|
<literal>"some term"modifierchars</literal>. The actual "phrase"
|
|
can be a single term of course. Supported modifiers:
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>l</literal> can be used to turn off
|
|
stemming (mostly makes sense with <literal>p</literal> because
|
|
stemming is off by default for phrases).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>s</literal> can be used to turn off
|
|
synonym expansion, if a synonyms file is in place (only for
|
|
&RCL; 1.22 and later).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>o</literal> can be used to specify a
|
|
"slack" for phrase and proximity searches: the number of
|
|
additional terms that may be found between the specified
|
|
ones. If <literal>o</literal> is followed by an integer number,
|
|
this is the slack, else the default is 10.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>p</literal> can be used to turn the
|
|
default phrase search into a proximity one
|
|
(unordered). Example: <literal>"order any in"p</literal></para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>C</literal> will turn on case
|
|
sensitivity (if the index supports it).</para></listitem>
|
|
|
|
<listitem><para><literal>D</literal> will turn on diacritics
|
|
sensitivity (if the index supports it).</para></listitem>
|
|
|
|
<listitem><para>A weight can be specified for a query element
|
|
by specifying a decimal value at the start of the
|
|
modifiers. Example: <literal>"Important"2.5</literal>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
|
|
</sect2> <!-- search modifiers -->
|
|
|
|
</sect1> <!-- rcl.search.lang -->
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.CASEDIAC">
|
|
<title>Search case and diacritics sensitivity</title>
|
|
|
|
<para>For &RCL; versions 1.18 and later, and <emphasis>when working
|
|
with a raw index</emphasis> (not the default), searches can be
|
|
sensitive to character case and diacritics. How this happens
|
|
is controlled by configuration variables and what search data is
|
|
entered.</para>
|
|
|
|
<para>The general default is that searches entered without upper-case
|
|
or accented characters are insensitive to case and diacritics. An
|
|
entry of <literal>resume</literal> will match any of
|
|
<literal>Resume</literal>, <literal>RESUME</literal>,
|
|
<literal>résumé</literal>, <literal>Résumé</literal> etc.</para>
|
|
|
|
<para>Two configuration variables can automate switching on
|
|
sensitivity (they were documented but actually did nothing until
|
|
&RCL; 1.22):</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>autodiacsens</term><listitem><para>If this is set, search
|
|
sensitivity to diacritics will be turned on as soon as an
|
|
accented character exists in a search term. When the variable
|
|
is set to true, <literal>resume</literal> will start a
|
|
diacritics-unsensitive search, but <literal>résumé</literal>
|
|
will be matched exactly. The default value is
|
|
<emphasis>false</emphasis>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>autocasesens</term><listitem><para>If this is set, search
|
|
sensitivity to character case will be turned on as soon as an
|
|
upper-case character exists in a search term <emphasis>except
|
|
for the first one</emphasis>. When the variable is set to
|
|
true, <literal>us</literal> or <literal>Us</literal> will
|
|
start a diacritics-unsensitive search, but
|
|
<literal>US</literal> will be matched exactly. The default
|
|
value is <emphasis>true</emphasis> (contrary to
|
|
<literal>autodiacsens</literal>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>As in the past, capitalizing the first letter of a word will
|
|
turn off its stem expansion and have no effect on
|
|
case-sensitivity.</para>
|
|
|
|
<para>You can also explicitely activate case and diacritics
|
|
sensitivity by using modifiers with the query
|
|
language. <literal>C</literal> will make the term case-sensitive, and
|
|
<literal>D</literal> will make it
|
|
diacritics-sensitive. Examples:</para>
|
|
<programlisting>
|
|
"us"C
|
|
</programlisting>
|
|
|
|
<para>will search for the term <literal>us</literal> exactly
|
|
(<literal>Us</literal> will not be a match).</para>
|
|
|
|
<programlisting>
|
|
"resume"D
|
|
</programlisting>
|
|
<para>will search for the term <literal>resume</literal> exactly
|
|
(<literal>résumé</literal> will not be a match).</para>
|
|
|
|
|
|
<para>When either case or diacritics sensitivity is activated, stem
|
|
expansion is turned off. Having both does not make much sense.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.ANCHORWILD">
|
|
<title>Anchored searches and wildcards</title>
|
|
|
|
<para>Some special characters are interpreted by &RCL; in search
|
|
strings to expand or specialize the search. Wildcards expand a root
|
|
term in controlled ways. Anchor characters can restrict a search to
|
|
succeed only if the match is found at or near the beginning of the
|
|
document or one of its fields.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.WILDCARDS">
|
|
<title>More about wildcards</title>
|
|
|
|
<para>All words entered in &RCL; search fields will be processed
|
|
for wildcard expansion before the request is finally
|
|
executed.</para>
|
|
|
|
<para>The wildcard characters are:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>*</literal> which matches 0 or more
|
|
characters.</para>
|
|
</listitem>
|
|
<listitem><para><literal>?</literal> which matches
|
|
a single character.</para>
|
|
</listitem>
|
|
<listitem><para><literal>[]</literal> which allow
|
|
defining sets of characters to be matched (ex:
|
|
<literal>[</literal><userinput>abc</userinput><literal>]</literal>
|
|
matches a single character which may be 'a' or 'b' or 'c',
|
|
<literal>[</literal><userinput>0-9</userinput><literal>]</literal>
|
|
matches any number.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>You should be aware of a few things when using
|
|
wildcards.</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>Using a wildcard character at the beginning of
|
|
a word can make for a slow search because &RCL; will have to
|
|
scan the whole index term list to find the
|
|
matches. However, this is much less a problem for field
|
|
searches, and queries
|
|
like <replaceable>author:*@domain.com</replaceable> can
|
|
sometimes be very useful.</para></listitem>
|
|
|
|
<listitem><para>For &RCL; version 18 only, when working with a
|
|
raw index (preserving character case and diacritics), the
|
|
literal part of a wildcard expression will be matched
|
|
exactly for case and diacritics. This is not true any
|
|
more for versions 19 and later.</para></listitem>
|
|
|
|
<listitem><para>Using a <literal>*</literal> at the end of a
|
|
word can produce more matches than you would think, and
|
|
strange search results. You can use the
|
|
<link linkend="RCL.SEARCH.GUI.TERMEXPLORER">term
|
|
explorer</link> tool to check what completions exist for
|
|
a given term. You can also see exactly what search was
|
|
performed by clicking on the link at the top of the result
|
|
list. In general, for natural language terms, stem
|
|
expansion will produce better results than an
|
|
ending <literal>*</literal> (stem expansion is turned off
|
|
when any wildcard character appears in the
|
|
term).</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<sect3 id="RCL.SEARCH.WILDCARDS.PATH">
|
|
<title>Wildcards and path filtering</title>
|
|
|
|
<para>Due to the way that &RCL; processes wildcards
|
|
inside <literal>dir</literal> path filtering clauses, they
|
|
will have a multiplicative effect on the query size. A clause
|
|
containg wildcards in several paths elements, like, for
|
|
example,
|
|
<literal>dir:</literal><replaceable>/home/me/*/*/docdir</replaceable>,
|
|
will almost certainly fail if your indexed tree is of any realistic
|
|
size.</para>
|
|
|
|
<para>Depending on the case, you may be able to work around
|
|
the issue by specifying the paths elements more narrowly, with
|
|
a constant prefix, or by using 2
|
|
separate <literal>dir:</literal> clauses instead of multiple
|
|
wildcards, as
|
|
in <literal>dir:</literal><replaceable>/home/me</replaceable> <literal>dir:</literal><replaceable>docdir</replaceable>. The
|
|
latter query is not equivalent to the initial one because it
|
|
does not specify a number of directory levels, but that's
|
|
the best we can do (and it may be actually more useful in
|
|
some cases).</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2> <!-- wildchars -->
|
|
|
|
<sect2 id="RCL.SEARCH.ANCHOR">
|
|
<title>Anchored searches</title>
|
|
|
|
<para>Two characters are used to specify that a search hit should
|
|
occur at the beginning or at the end of the
|
|
text. <literal>^</literal> at the beginning of a term or phrase
|
|
constrains the search to happen at the start, <literal>$</literal>
|
|
at the end force it to happen at the end.</para>
|
|
|
|
<para>As this function is implemented as a phrase search it is
|
|
possible to specify a maximum distance at which the hit should
|
|
occur, either through the controls of the advanced search panel, or
|
|
using the query language, for example, as in:
|
|
<programlisting>"^someterm"o10</programlisting> which would force
|
|
<literal>someterm</literal> to be found within 10 terms of the
|
|
start of the text. This can be combined with a field search as in
|
|
<literal>somefield:"^someterm"o10</literal> or
|
|
<literal>somefield:someterm$</literal>.</para>
|
|
|
|
<para>This feature can also be used with an actual phrase search,
|
|
but in this case, the distance applies to the whole phrase and
|
|
anchor, so that, for example, <literal>bla bla my unexpected
|
|
term</literal> at the beginning of the text would be a match for
|
|
<literal>"^my term"o5</literal>.</para>
|
|
|
|
<para>Anchored searches can be very useful for searches inside
|
|
somewhat structured documents like scientific articles, in case
|
|
explicit metadata has not been supplied (a most frequent case), for
|
|
example for looking for matches inside the abstract or the list of
|
|
authors (which occur at the top of the document).</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
</sect1> <!-- wildchars and anchors -->
|
|
|
|
<sect1 id="RCL.SEARCH.DESKTOP">
|
|
<title>Desktop integration</title>
|
|
|
|
<para>Being independant of the desktop type has its drawbacks: &RCL;
|
|
desktop integration is minimal. However there are a few tools
|
|
available:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>The <application>KDE</application> KIO Slave was
|
|
described in a <link linkend="RCL.SEARCH.KIO">previous
|
|
section</link>.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>If you use a recent version of Ubuntu Linux, you may
|
|
find the <ulink url="&WIKI;UnityLens">Ubuntu Unity
|
|
Lens</ulink> module useful.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>There is also an independantly developed
|
|
<ulink
|
|
url="http://kde-apps.org/content/show.php/recollrunner?content=128203">
|
|
Krunner plugin</ulink>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>Here follow a few other things that may help.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.SHORTCUT">
|
|
<title>Hotkeying recoll</title>
|
|
|
|
<para>It is surprisingly convenient to be able to show or hide the
|
|
&RCL; GUI with a single keystroke. Recoll comes with a small
|
|
Python script, based on the <application>libwnck</application> window
|
|
manager interface library, which will allow you to do just
|
|
this. The detailed instructions are on
|
|
<ulink url="&WIKI;HotRecoll">this wiki page</ulink>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.KICKER-APPLET">
|
|
<title>The KDE Kicker Recoll applet</title>
|
|
|
|
<para>This is probably obsolete now. Anyway:</para>
|
|
<para>The &RCL; source tree contains the source code to the
|
|
<application>recoll_applet</application>, a small application derived
|
|
from the <application>find_applet</application>. This can be used to
|
|
add a small &RCL; launcher to the KDE panel.</para>
|
|
|
|
<para>The applet is not automatically built with the main &RCL;
|
|
programs, nor is it included with the main source distribution
|
|
(because the KDE build boilerplate makes it relatively big). You can
|
|
download its source from the recoll.org download page. Use the
|
|
omnipotent <userinput>configure;make;make install</userinput>
|
|
incantation to build and install.</para>
|
|
|
|
<para>You can then add the applet to the panel by right-clicking the
|
|
panel and choosing the <guilabel>Add applet</guilabel> entry.</para>
|
|
|
|
<para>The <application>recoll_applet</application> has a small text
|
|
window where you can type a &RCL; query (in query language form),
|
|
and an icon which can be used to restrict the search to certain
|
|
types of files. It is quite primitive, and launches a new recoll
|
|
GUI instance every time (even if it is already running). You may
|
|
find it useful anyway.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1> <!-- rcl.search.desktop -->
|
|
|
|
</chapter> <!-- Search -->
|
|
|
|
|
|
<chapter id="RCL.PROGRAM">
|
|
<title>Programming interface</title>
|
|
|
|
<para>&RCL; has an Application Programming Interface, usable both
|
|
for indexing and searching, currently accessible from the
|
|
<application>Python</application> language.</para>
|
|
|
|
<para>Another less radical way to extend the application is to
|
|
write input handlers for new types of documents.</para>
|
|
|
|
<para>The processing of metadata attributes for documents
|
|
(<literal>fields</literal>) is highly configurable.</para>
|
|
|
|
|
|
|
|
<sect1 id="RCL.PROGRAM.FILTERS">
|
|
<title>Writing a document input handler</title>
|
|
|
|
<note><title>Terminology</title><para>The small programs or pieces
|
|
of code which handle the processing of the different document
|
|
types for &RCL; used to be called <literal>filters</literal>,
|
|
which is still reflected in the name of the directory which
|
|
holds them and many configuration variables. They were named
|
|
this way because one of their primary functions is to filter
|
|
out the formatting directives and keep the text
|
|
content. However these modules may have other behaviours, and
|
|
the term <literal>input handler</literal> is now progressively
|
|
substituted in the documentation. <literal>filter</literal> is
|
|
still used in many places though.</para></note>
|
|
|
|
<para>&RCL; input handlers cooperate to translate from the multitude
|
|
of input document formats, simple ones
|
|
as <application>opendocument</application>,
|
|
<application>acrobat</application>), or compound ones such
|
|
as <application>Zip</application>
|
|
or <application>Email</application>, into the final &RCL;
|
|
indexing input format, which is plain text.
|
|
Most input handlers are executable
|
|
programs or scripts. A few handlers are coded in C++ and live
|
|
inside <command>recollindex</command>. This latter kind will not
|
|
be described here.</para>
|
|
|
|
<para>There are currently (since version 1.13) two kinds of
|
|
external executable input handlers:
|
|
<itemizedlist>
|
|
<listitem><para>Simple <literal>exec</literal> handlers
|
|
run once and exit. They can be bare programs like
|
|
<command>antiword</command>, or scripts using other
|
|
programs. They are very simple to write, because they just
|
|
need to print the converted document to the standard
|
|
output. Their output can be plain text or HTML. HTML is
|
|
usually preferred because it can store metadata fields and
|
|
it allows preserving some of the formatting for the GUI
|
|
preview.</para>
|
|
</listitem>
|
|
<listitem><para>Multiple <literal>execm</literal> handlers
|
|
can process multiple files (sparing the process startup
|
|
time which can be very significant), or multiple documents
|
|
per file (e.g.: for <application>zip</application> or
|
|
<application>chm</application> files). They communicate
|
|
with the indexer through a simple protocol, but are
|
|
nevertheless a bit more complicated than the older
|
|
kind. Most of new handlers are written in
|
|
<application>Python</application>, using a common module
|
|
to handle the protocol. There is an exception,
|
|
<command>rclimg</command> which is written in Perl. The
|
|
subdocuments output by these handlers can be directly
|
|
indexable (text or HTML), or they can be other simple or
|
|
compound documents that will need to be processed by
|
|
another handler.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>In both cases, handlers deal with regular file system
|
|
files, and can process either a single document, or a
|
|
linear list of documents in each file. &RCL; is responsible
|
|
for performing up to date checks, deal with more complex
|
|
embedding and other upper level issues.</para>
|
|
|
|
<para>A simple handler returning a
|
|
document in <literal>text/plain</literal> format, can transfer
|
|
no metadata to the indexer. Generic metadata, like document
|
|
size or modification date, will be gathered and stored by
|
|
the indexer.</para>
|
|
|
|
<para>Handlers that produce <literal>text/html</literal>
|
|
format can return an arbitrary amount of metadata inside HTML
|
|
<literal>meta</literal> tags. These will be processed
|
|
according to the directives found in
|
|
the <link linkend="RCL.PROGRAM.FIELDS">
|
|
<filename>fields</filename> configuration
|
|
file</link>.</para>
|
|
|
|
<para>The handlers that can handle multiple documents per file
|
|
return a single piece of data to identify each document inside
|
|
the file. This piece of data, called
|
|
an <literal>ipath element</literal> will be sent back by
|
|
&RCL; to extract the document at query time, for previewing,
|
|
or for creating a temporary file to be opened by a
|
|
viewer.</para>
|
|
|
|
<para>The following section describes the simple
|
|
handlers, and the next one gives a few explanations about
|
|
the <literal>execm</literal> ones. You could conceivably
|
|
write a simple handler with only the elements in the
|
|
manual. This will not be the case for the other ones, for
|
|
which you will have to look at the code.</para>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.SIMPLE">
|
|
<title>Simple input handlers</title>
|
|
|
|
<para>&RCL; simple handlers are usually shell-scripts, but this is in
|
|
no way necessary. Extracting the text from the native format is the
|
|
difficult part. Outputting the format expected by &RCL; is
|
|
trivial. Happily enough, most document formats have translators or
|
|
text extractors which can be called from the handler. In some cases
|
|
the output of the translating program is completely appropriate,
|
|
and no intermediate shell-script is needed.</para>
|
|
|
|
<para>Input handlers are called with a single argument which is the
|
|
source file name. They should output the result to stdout.</para>
|
|
|
|
<para>When writing a handler, you should decide if it will output
|
|
plain text or HTML. Plain text is simpler, but you will not be able
|
|
to add metadata or vary the output character encoding (this will be
|
|
defined in a configuration file). Additionally, some formatting may
|
|
be easier to preserve when previewing HTML. Actually the deciding factor
|
|
is metadata: &RCL; has a way to <link linkend="RCL.PROGRAM.FILTERS.HTML">
|
|
extract metadata from the HTML header and use it for field
|
|
searches.</link>.</para>
|
|
|
|
<para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment
|
|
variable (values <literal>yes</literal>, <literal>no</literal>)
|
|
tells the handler if the operation is for indexing or
|
|
previewing. Some handlers use this to output a slightly different
|
|
format, for example stripping uninteresting repeated keywords (ie:
|
|
<literal>Subject:</literal> for email) when indexing. This is not
|
|
essential.</para>
|
|
|
|
<para>You should look at one of the simple handlers, for example
|
|
<command>rclps</command> for a starting point.</para>
|
|
|
|
<para>Don't forget to make your handler executable before
|
|
testing !</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.MULTIPLE">
|
|
<title>"Multiple" handlers</title>
|
|
|
|
<para>If you can program and want to write
|
|
an <literal>execm</literal> handler, it should not be too
|
|
difficult to make sense of one of the existing modules. There is
|
|
a sample one with many comments, not actually used by &RCL;,
|
|
which would index a text file as one document per line. Look for
|
|
<filename>rcltxtlines.py</filename> in the
|
|
<filename>src/filters</filename> directory in the &RCL; <ulink
|
|
url="https://bitbucket.org/medoc/recoll/src">BitBucket
|
|
repository</ulink> (the sample
|
|
not in the distributed release at the moment).</para>
|
|
|
|
<para>You can also have a look at the slightly more complex
|
|
<command>rclzip</command> which uses Zip
|
|
file paths as identifiers (<literal>ipath</literal>).</para>
|
|
|
|
<para><literal>execm</literal> handlers sometimes need to make
|
|
a choice for the nature of the <literal>ipath</literal>
|
|
elements that they use in communication with the
|
|
indexer. Here are a few guidelines:
|
|
<itemizedlist>
|
|
<listitem><para>Use ASCII or UTF-8 (if the identifier is an
|
|
integer print it, for example, like printf %d would
|
|
do).</para></listitem>
|
|
<listitem><para>If at all possible, the data should make some
|
|
kind of sense when printed to a log file to help with
|
|
debugging.</para></listitem>
|
|
<listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
|
|
separator to store a complex path internally (for
|
|
deeper embedding). Colons inside
|
|
the <literal>ipath</literal> elements output by a
|
|
handler will be escaped, but would be a bad choice as a
|
|
handler-specific separator (mostly, again, for
|
|
debugging issues).</para></listitem>
|
|
</itemizedlist>
|
|
In any case, the main goal is that it should
|
|
be easy for the handler to extract the target document, given
|
|
the file name and the <literal>ipath</literal>
|
|
element.</para>
|
|
|
|
<para><literal>execm</literal> handlers will also produce
|
|
a document with a null <literal>ipath</literal>
|
|
element. Depending on the type of document, this may have
|
|
some associated data (e.g. the body of an email message), or
|
|
none (typical for an archive file). If it is empty, this
|
|
document will be useful anyway for some operations, as the
|
|
parent of the actual data documents.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.ASSOCIATION">
|
|
<title>Telling &RCL; about the handler</title>
|
|
|
|
<para>There are two elements that link a file to the handler which
|
|
should process it: the association of file to MIME type and the
|
|
association of a MIME type with a handler.</para>
|
|
|
|
<para>The association of files to MIME types is mostly based on
|
|
name suffixes. The types are defined inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<filename>mimemap</filename> file</link>. Example:
|
|
<programlisting>
|
|
|
|
.doc = application/msword
|
|
</programlisting>
|
|
If no suffix association is found for the file name, &RCL; will try
|
|
to execute a system command (typically <command>file -i</command> or
|
|
<command>xdg-mime</command>) to determine a MIME type.</para>
|
|
|
|
<para>The second element is the association of MIME types to handlers
|
|
in the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<filename>mimeconf</filename> file</link>. A sample will probably be
|
|
better than a long explanation:</para>
|
|
<programlisting>
|
|
|
|
[index]
|
|
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
mimetype = text/plain ; charset=utf-8
|
|
|
|
application/ogg = exec rclogg
|
|
|
|
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
|
application/x-chm = execm rclchm
|
|
</programlisting>
|
|
|
|
<para>The fragment specifies that:
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>application/msword</literal> files
|
|
are processed by executing the <command>antiword</command>
|
|
program, which outputs
|
|
<literal>text/plain</literal> encoded in
|
|
<literal>utf-8</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>application/ogg</literal> files are
|
|
processed by the <command>rclogg</command> script, with
|
|
default output type (<literal>text/html</literal>, with
|
|
encoding specified in the header, or <literal>utf-8</literal>
|
|
by default).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>text/rtf</literal> is processed by
|
|
<command>unrtf</command>, which outputs
|
|
<literal>text/html</literal>. The
|
|
<literal>iso-8859-1</literal> encoding is specified because it
|
|
is not the <literal>utf-8</literal> default, and not output by
|
|
<command>unrtf</command> in the HTML header section.</para>
|
|
</listitem>
|
|
<listitem><para><literal>application/x-chm</literal> is processed
|
|
by a persistant handler. This is determined by the
|
|
<literal>execm</literal> keyword.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.HTML">
|
|
<title>Input handler output</title>
|
|
|
|
<para>Both the simple and persistent input handlers can return any
|
|
MIME type to Recoll, which will further process the data according
|
|
to the MIME configuration.</para>
|
|
|
|
<para>Most input filters filters produce either
|
|
<literal>text/plain</literal> or <literal>text/html</literal>
|
|
data. There are exceptions, for example, filters which process
|
|
archive file (<literal>zip</literal>, <literal>tar</literal>, etc.)
|
|
will usually return the documents as they are found, without
|
|
processing them further.</para>
|
|
|
|
<para>There is nothing to say about <literal>text/plain</literal>
|
|
output, except that its character encoding should be consistent
|
|
with what is specified in the <filename>mimeconf</filename>
|
|
file.</para>
|
|
|
|
<para>For filters producing HTML, the output could be very minimal
|
|
like the following example:
|
|
<programlisting>
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
|
</head>
|
|
<body>
|
|
Some text content
|
|
</body>
|
|
</html>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>You should take care to escape some
|
|
characters inside the text by transforming them into
|
|
appropriate entities. At the very minimum,
|
|
"<literal>&</literal>" should be transformed into
|
|
"<literal>&amp;</literal>", "<literal><</literal>"
|
|
should be transformed into
|
|
"<literal>&lt;</literal>". This is not always properly
|
|
done by external helper programs which output HTML, and of
|
|
course never by those which output plain text. </para>
|
|
|
|
<para>When encapsulating plain text in an HTML body,
|
|
the display of a preview may be improved by enclosing the
|
|
text inside <literal><pre></literal> tags.</para>
|
|
|
|
<para>The character set needs to be specified in the
|
|
header. It does not need to be UTF-8 (&RCL; will take care
|
|
of translating it), but it must be accurate for good
|
|
results.</para>
|
|
|
|
<para>&RCL; will process <literal>meta</literal> tags inside
|
|
the header as possible document fields candidates. Documents
|
|
fields can be processed by the indexer in different ways,
|
|
for searching or displaying inside query results. This is
|
|
described in a <link linkend="RCL.PROGRAM.FIELDS">following
|
|
section.</link>
|
|
</para>
|
|
|
|
<para>By default, the indexer will process the standard header
|
|
fields if they are present: <literal>title</literal>,
|
|
<literal>meta/description</literal>,
|
|
and <literal>meta/keywords</literal> are both indexed and stored
|
|
for query-time display.</para>
|
|
|
|
<para>A predefined non-standard <literal>meta</literal> tag
|
|
will also be processed by &RCL; without further
|
|
configuration: if a <literal>date</literal> tag is present
|
|
and has the right format, it will be used as the document
|
|
date (for display and sorting), in preference to the file
|
|
modification date. The date format should be as follows:
|
|
<programlisting>
|
|
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
|
|
or
|
|
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
|
|
</programlisting>
|
|
Example:
|
|
<programlisting>
|
|
<meta name="date" content="2013-02-24 17:50:00">
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>Input handlers also have the possibility to "invent" field
|
|
names. This should also be output as meta tags:</para>
|
|
|
|
<programlisting>
|
|
<meta name="somefield" content="Some textual data" />
|
|
</programlisting>
|
|
|
|
<para>You can embed HTML markup inside the content of custom
|
|
fields, for improving the display inside result lists. In this
|
|
case, add a (wildly non-standard) <literal>markup</literal>
|
|
attribute to tell &RCL; that the value is HTML and should not
|
|
be escaped for display.</para>
|
|
|
|
<programlisting>
|
|
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
|
|
</programlisting>
|
|
|
|
<para>As written above, the processing of fields is described
|
|
in a <link linkend="RCL.PROGRAM.FIELDS">further
|
|
section</link>.</para>
|
|
|
|
|
|
<para>Persistent filters can use another, probably simpler,
|
|
method to produce metadata, by calling the
|
|
<literal>setfield()</literal> helper method. This avoids the
|
|
necessity to produce HTML, and any issue with HTML quoting. See,
|
|
for example, <filename>rclaudio</filename> in &RCL; 1.23 and
|
|
later for an example of handler which outputs
|
|
<literal>text/plain</literal> and uses
|
|
<literal>setfield()</literal> to produce metadata.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.PAGES">
|
|
<title>Page numbers</title>
|
|
|
|
<para>The indexer will interpret <literal>^L</literal> characters
|
|
in the handler output as indicating page breaks, and will record
|
|
them. At query time, this allows starting a viewer on the right
|
|
page for a hit or a snippet. Currently, only the PDF, Postscript
|
|
and DVI handlers generate page breaks.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.PROGRAM.FIELDS">
|
|
<title>Field data processing</title>
|
|
|
|
<para><literal>Fields</literal> are named pieces of information
|
|
in or about documents, like <literal>title</literal>,
|
|
<literal>author</literal>, <literal>abstract</literal>.</para>
|
|
|
|
<para>The field values for documents can appear in several ways
|
|
during indexing: either output by input handlers
|
|
as <literal>meta</literal> fields in the HTML header section, or
|
|
extracted from file extended attributes, or added as attributes
|
|
of the <literal>Doc</literal> object when using the API, or
|
|
again synthetized internally by &RCL;.</para>
|
|
|
|
<para>The &RCL; query language allows searching for text in a
|
|
specific field.</para>
|
|
|
|
<para>&RCL; defines a number of default fields. Additional
|
|
ones can be output by handlers, and described in the
|
|
<filename>fields</filename> configuration file.</para>
|
|
|
|
<para>Fields can be:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>indexed</literal>, meaning that their
|
|
terms are separately stored in inverted lists (with a specific
|
|
prefix), and that a field-specific search is possible.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>stored</literal>, meaning that their
|
|
value is recorded in the index data record for the document,
|
|
and can be returned and displayed with search results.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>A field can be either or both indexed and stored. This and
|
|
other aspects of fields handling is defined inside the
|
|
<filename>fields</filename> configuration file.</para>
|
|
|
|
<para>The sequence of events for field processing is as follows:
|
|
<itemizedlist>
|
|
<listitem><para>During indexing,
|
|
<command>recollindex</command> scans all <literal>meta</literal>
|
|
fields in HTML documents (most document types are transformed
|
|
into HTML at some point). It compares the name for each element
|
|
to the configuration defining what should be done with fields
|
|
(the <filename>fields</filename> file)</para>
|
|
</listitem>
|
|
<listitem><para>If the name for the <literal>meta</literal>
|
|
element matches one for a field that should be indexed, the
|
|
contents are processed and the terms are entered into the index
|
|
with the prefix defined in the <filename>fields</filename>
|
|
file.</para>
|
|
</listitem>
|
|
<listitem><para>If the name for the <literal>meta</literal> element
|
|
matches one for a field that should be stored, the content of the
|
|
element is stored with the document data record, from which it
|
|
can be extracted and displayed at query time.</para>
|
|
</listitem>
|
|
<listitem><para>At query time, if a field search is performed, the
|
|
index prefix is computed and the match is only performed against
|
|
appropriately prefixed terms in the index.</para>
|
|
</listitem>
|
|
<listitem><para>At query time, the field can be displayed inside
|
|
the result list by using the appropriate directive in the
|
|
definition of the <link
|
|
linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">result list paragraph
|
|
format</link>. All fields are displayed on the fields screen of
|
|
the preview window (which you can reach through the right-click
|
|
menu). This is independant of the fact that the search which
|
|
produced the results used the field or not.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>You can find more information in the
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS">section about the
|
|
<filename>fields</filename> file</link>, or in comments inside the
|
|
file.</para>
|
|
|
|
<para>You can also have a look at the
|
|
<ulink url="&WIKI;HandleCustomField">example on the Wiki</ulink>,
|
|
detailing how one could add a <emphasis>page count</emphasis> field
|
|
to pdf documents for displaying inside result lists.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.PROGRAM.PYTHONAPI">
|
|
<title>Python API</title>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.INTRO">
|
|
<title>Introduction</title>
|
|
|
|
<para>&RCL; versions after 1.11 define a Python programming
|
|
interface, both for searching and creating/updating an
|
|
index.</para>
|
|
|
|
<para>The search interface is used in the &RCL; Ubuntu Unity Lens
|
|
and the &RCL; Web UI. It can run queries on any &RCL;
|
|
configuration.</para>
|
|
|
|
<para>The index update section of the API may be used to create and
|
|
update &RCL; indexes on specific configurations (separate from the
|
|
ones created by <command>recollindex</command>). The resulting
|
|
databases can be queried alone, or in conjunction with regular
|
|
ones, through the GUI or any of the query interfaces.</para>
|
|
|
|
<para>The search API is modeled along the Python database API
|
|
specification. There were two major changes along &RCL; versions:
|
|
<itemizedlist>
|
|
<listitem><para>The basis for the &RCL; API changed from Python
|
|
database API version 1.0 (&RCL; versions up to 1.18.1),
|
|
to version 2.0 (&RCL; 1.18.2 and later).</para></listitem>
|
|
<listitem><para>The <literal>recoll</literal> module became a
|
|
package (with an internal <literal>recoll</literal>
|
|
module) as of &RCL; version 1.19, in order to add more
|
|
functions. For existing code, this only changes the way
|
|
the interface must be imported.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>We will describe the new API and package structure here. A
|
|
paragraph at the end of this section will explain a few differences
|
|
and ways to write code compatible with both versions.</para>
|
|
|
|
<para>The Python interface can be found in the source package,
|
|
under <filename>python/recoll</filename>.</para>
|
|
|
|
<para>The <filename>python/recoll/</filename> directory
|
|
contains the usual <filename>setup.py</filename>. After
|
|
configuring the main &RCL; code, you can use the script to
|
|
build and install the Python module:
|
|
<screen>
|
|
<userinput>cd recoll-xxx/python/recoll</userinput>
|
|
<userinput>python setup.py build</userinput>
|
|
<userinput>python setup.py install</userinput>
|
|
</screen>
|
|
</para>
|
|
|
|
<para>As of &RCL; 1.19, the module can be compiled for
|
|
Python3.</para>
|
|
|
|
<para>The normal &RCL; installer installs the Python2
|
|
API along with the main code. The Python3 version must be
|
|
explicitely built and installed.</para>
|
|
|
|
<para>When installing from a repository, and depending on the
|
|
distribution, the Python API can sometimes be found in a
|
|
separate package.</para>
|
|
|
|
<para>As an introduction, the following small sample will run a
|
|
query and list the title and url for each of the results. It would
|
|
work with &RCL; 1.19 and later. The
|
|
<filename>python/samples</filename> source directory contains
|
|
several examples of Python programming with &RCL;, exercising the
|
|
extension more completely, and especially its data extraction
|
|
features.</para>
|
|
|
|
<programlisting><![CDATA[
|
|
#!/usr/bin/env python
|
|
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
query = db.query()
|
|
nres = query.execute("some query")
|
|
results = query.fetchmany(20)
|
|
for doc in results:
|
|
print(doc.url, doc.title)
|
|
]]></programlisting>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.ELEMENTS">
|
|
<title>Interface elements</title>
|
|
|
|
<para>A few elements in the interface are specific and and need
|
|
an explanation.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH">>
|
|
<term>ipath</term>
|
|
|
|
<listitem><para>This data value (set as a field in the Doc
|
|
object) is stored, along with the URL, but not indexed by
|
|
&RCL;. Its contents are not interpreted by the index layer, and
|
|
its use is up to the application. For example, the &RCL; file
|
|
system indexer uses the <literal>ipath</literal> to store the
|
|
part of the document access path internal to (possibly
|
|
imbricated) container documents. <literal>ipath</literal> in
|
|
this case is a vector of access elements (e.g, the first part
|
|
could be a path inside a zip file to an archive member which
|
|
happens to be an mbox file, the second element would be the
|
|
message sequential number inside the mbox
|
|
etc.). <literal>url</literal> and <literal>ipath</literal> are
|
|
returned in every search result and define the access to the
|
|
original document. <literal>ipath</literal> is empty for
|
|
top-level document/files (e.g. a PDF document which is a
|
|
filesystem file). The &RCL; GUI knows about the structure of the
|
|
<literal>ipath</literal> values used by the filesystem indexer,
|
|
and uses it for such functions as opening the parent of a given
|
|
document.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">
|
|
<term>udi</term>
|
|
|
|
<listitem><para>An <literal>udi</literal> (unique document
|
|
identifier) identifies a document. Because of limitations inside
|
|
the index engine, it is restricted in length (to 200 bytes),
|
|
which is why a regular URI cannot be used. The structure and
|
|
contents of the <literal>udi</literal> is defined by the
|
|
application and opaque to the index engine. For example, the
|
|
internal file system indexer uses the complete document path
|
|
(file path + internal path), truncated to length, the suppressed
|
|
part being replaced by a hash value. The <literal>udi</literal>
|
|
is not explicit in the query interface (it is used "under the
|
|
hood" by the <filename>rclextract</filename> module), but it is
|
|
an explicit element of the update interface.</para> </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">
|
|
<term>parent_udi</term>
|
|
|
|
<listitem><para>If this attribute is set on a document when
|
|
entering it in the index, it designates its physical container
|
|
document. In a multilevel hierarchy, this may not be the
|
|
immediate parent. <literal>parent_udi</literal> is optional, but
|
|
its use by an indexer may simplify index maintenance, as &RCL;
|
|
will automatically delete all children defined by
|
|
<literal>parent_udi == udi</literal> when the document designated
|
|
by <literal>udi</literal> is destroyed. e.g. if a
|
|
<literal>Zip</literal> archive contains entries which are
|
|
themselves containers, like <literal>mbox</literal> files, all
|
|
the subdocuments inside the <literal>Zip</literal> file (mbox,
|
|
messages, message attachments, etc.) would have the same
|
|
<literal>parent_udi</literal>, matching the
|
|
<literal>udi</literal> for the <literal>Zip</literal> file, and
|
|
all would be destroyed when the <literal>Zip</literal> file
|
|
(identified by its <literal>udi</literal>) is removed from the
|
|
index. The standard filesystem indexer uses
|
|
<literal>parent_udi</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Stored and indexed fields</term>
|
|
|
|
<listitem><para>The <filename>fields</filename> file inside
|
|
the &RCL; configuration defines which document fields are
|
|
either "indexed" (searchable), "stored" (retrievable with
|
|
search results), or both.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.SEARCH">
|
|
<title>Python search interface</title>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.PACKAGE">
|
|
<title>Recoll package</title>
|
|
|
|
<para>The <literal>recoll</literal> package contains two
|
|
modules:
|
|
<itemizedlist>
|
|
<listitem><para>The <literal>recoll</literal> module contains
|
|
functions and classes used to query (or update) the
|
|
index. This section will only describe the query part, see
|
|
further for the update part.</para></listitem>
|
|
<listitem><para>The <literal>rclextract</literal> module contains
|
|
functions and classes used to access document
|
|
data.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.RECOLL">
|
|
<title>The recoll module</title>
|
|
|
|
<sect4 id="RCL.PROGRAM.PYTHONAPI.RECOLL.FUNCTIONS">
|
|
<title>Functions</title>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>connect(confdir=None, extra_dbs=None,
|
|
writable = False)</term>
|
|
<listitem>
|
|
<para>The <literal>connect()</literal> function connects to
|
|
one or several &RCL; index(es) and returns
|
|
a <literal>Db</literal> object.</para>
|
|
<itemizedlist>
|
|
<listitem><para><literal>confdir</literal> may specify
|
|
a configuration directory. The usual defaults
|
|
apply.</para></listitem>
|
|
<listitem><para><literal>extra_dbs</literal> is a list of
|
|
additional indexes (Xapian directories).</para></listitem>
|
|
<listitem><para><literal>writable</literal> decides if
|
|
we can index new data through this
|
|
connection.</para></listitem>
|
|
</itemizedlist>
|
|
<para>This call initializes the recoll module, and it should
|
|
always be performed before any other call or object
|
|
creation.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</sect4>
|
|
|
|
|
|
<sect4 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES">
|
|
<title>Classes</title>
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DB">
|
|
<title>The Db class</title>
|
|
|
|
<para>A Db object is created by
|
|
a <literal>connect()</literal> call and holds a
|
|
connection to a Recoll index.</para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>Db.close()</term>
|
|
<listitem><para>Closes the connection. You can't do anything
|
|
with the <literal>Db</literal> object after
|
|
this.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Db.query(), Db.cursor()</term> <listitem><para>These
|
|
aliases return a blank <literal>Query</literal> object
|
|
for this index.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Db.setAbstractParams(maxchars,
|
|
contextwords)</term> <listitem><para>Set the parameters used
|
|
to build snippets (sets of keywords in context text
|
|
fragments). <literal>maxchars</literal> defines the
|
|
maximum total size of the abstract.
|
|
<literal>contextwords</literal> defines how many
|
|
terms are shown around the keyword.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Db.termMatch(match_type, expr, field='',
|
|
maxlen=-1, casesens=False, diacsens=False, lang='english')
|
|
</term>
|
|
<listitem><para>Expand an expression against the
|
|
index term list. Performs the basic function from the
|
|
GUI term explorer tool. <literal>match_type</literal>
|
|
can be either
|
|
of <literal>wildcard</literal>, <literal>regexp</literal>
|
|
or <literal>stem</literal>. Returns a list of terms
|
|
expanded from the input expression.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect5>
|
|
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY">
|
|
<title>The Query class</title>
|
|
|
|
<para>A <literal>Query</literal> object (equivalent to a
|
|
cursor in the Python DB API) is created by
|
|
a <literal>Db.query()</literal> call. It is used to
|
|
execute index searches.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Query.sortby(fieldname, ascending=True)</term>
|
|
<listitem><para>Sort results
|
|
by <replaceable>fieldname</replaceable>, in ascending
|
|
or descending order. Must be called before executing
|
|
the search.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.execute(query_string, stemming=1,
|
|
stemlang="english")</term>
|
|
<listitem><para>Starts a search
|
|
for <replaceable>query_string</replaceable>, a &RCL;
|
|
search language string.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.executesd(SearchData)</term>
|
|
<listitem><para>Starts a search for the query defined by the
|
|
SearchData object.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.fetchmany(size=query.arraysize)</term>
|
|
|
|
<listitem><para>Fetches
|
|
the next <literal>Doc</literal> objects in the current
|
|
search results, and returns them as an array of the
|
|
required size, which is by default the value of
|
|
the <literal>arraysize</literal> data member.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.fetchone()</term>
|
|
<listitem><para>Fetches the next <literal>Doc</literal> object
|
|
from the current search results.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.close()</term>
|
|
<listitem><para>Closes the query. The object is unusable
|
|
after the call.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.scroll(value, mode='relative')</term>
|
|
<listitem><para>Adjusts the position in the current result
|
|
set. <literal>mode</literal> can
|
|
be <literal>relative</literal>
|
|
or <literal>absolute</literal>. </para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getgroups()</term>
|
|
<listitem><para>Retrieves the expanded query terms as a list
|
|
of pairs. Meaningful only after executexx In each
|
|
pair, the first entry is a list of user terms (of size
|
|
one for simple terms, or more for group and phrase
|
|
clauses), the second a list of query terms as derived
|
|
from the user terms and used in the Xapian
|
|
Query.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getxquery()</term>
|
|
<listitem><para>Return the Xapian query description as a
|
|
Unicode string.
|
|
Meaningful only after executexx.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.highlight(text, ishtml = 0, methods = object)</term>
|
|
<listitem><para>Will insert <span "class=rclmatch">,
|
|
</span> tags around the match areas in the input text
|
|
and return the modified text. <literal>ishtml</literal>
|
|
can be set to indicate that the input text is HTML and
|
|
that HTML special characters should not be escaped.
|
|
<literal>methods</literal> if set should be an object
|
|
with methods startMatch(i) and endMatch() which will be
|
|
called for each match and should return a begin and end
|
|
tag</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.makedocabstract(doc, methods = object))</term>
|
|
<listitem><para>Create a snippets abstract
|
|
for <literal>doc</literal> (a <literal>Doc</literal>
|
|
object) by selecting text around the match terms.
|
|
If methods is set, will also perform highlighting. See
|
|
the highlight method.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.__iter__() and Query.next()</term>
|
|
<listitem><para>So that things like <literal>for doc in
|
|
query:</literal> will work.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry><term>Query.arraysize</term>
|
|
<listitem><para>Default number of records processed by fetchmany
|
|
(r/w).</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry><term>Query.rowcount</term><listitem><para>Number
|
|
of records returned by the last
|
|
execute.</para></listitem></varlistentry>
|
|
<varlistentry><term>Query.rownumber</term><listitem><para>Next index
|
|
to be fetched from results. Normally increments after
|
|
each fetchone() call, but can be set/reset before the
|
|
call to effect seeking (equivalent to
|
|
using <literal>scroll()</literal>). Starts at
|
|
0.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect5>
|
|
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC">
|
|
<title>The Doc class</title>
|
|
|
|
<para>A <literal>Doc</literal> object contains index data
|
|
for a given document. The data is extracted from the
|
|
index when searching, or set by the indexer program when
|
|
updating. The Doc object has many attributes to be read or
|
|
set by its user. It matches exactly the Rcl::Doc C++
|
|
object. Some of the attributes are predefined, but,
|
|
especially when indexing, others can be set, the name of
|
|
which will be processed as field names by the indexing
|
|
configuration. Inputs can be specified as Unicode or
|
|
strings. Outputs are Unicode objects. All dates are
|
|
specified as Unix timestamps, printed as strings. Please
|
|
refer to the <filename>rcldb/rcldoc.h</filename> C++ file
|
|
for a description of the predefined attributes.</para>
|
|
|
|
<para>At query time, only the fields that are defined
|
|
as <literal>stored</literal> either by default or in
|
|
the <filename>fields</filename> configuration file will be
|
|
meaningful in the <literal>Doc</literal>
|
|
object. Especially this will not be the case for the
|
|
document text. See the <literal>rclextract</literal>
|
|
module for accessing document contents.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>get(key), [] operator</term>
|
|
|
|
<listitem><para>Retrieve the named doc
|
|
attribute. You can also use
|
|
<literal>getattr(doc, key)</literal> or
|
|
<literal>doc.key</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>doc.key = value</term>
|
|
|
|
<listitem><para>Set the the named doc
|
|
attribute. You can also use
|
|
<literal>setattr(doc, key, value)</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>getbinurl()</term>
|
|
|
|
<listitem><para>Retrieve the URL in byte array format (no
|
|
transcoding), for use as parameter to a system
|
|
call.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>setbinurl(url)</term>
|
|
|
|
<listitem><para>Set the URL in byte array format (no
|
|
transcoding).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>items()</term>
|
|
<listitem><para>Return a dictionary of doc object
|
|
keys/values</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>keys()</term>
|
|
<listitem><para>list of doc object keys (attribute
|
|
names).</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect5> <!-- Doc -->
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA">
|
|
<title>The SearchData class</title>
|
|
|
|
<para>A <literal>SearchData</literal> object allows building
|
|
a query by combining clauses, for execution
|
|
by <literal>Query.executesd()</literal>. It can be used
|
|
in replacement of the query language approach. The
|
|
interface is going to change a little, so no detailed doc
|
|
for now...</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
qstring=string, slack=0, field='', stemming=1,
|
|
subSearch=SearchData)</term>
|
|
<listitem><para></para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect5> <!-- SearchData -->
|
|
|
|
</sect4> <!-- recoll.classes -->
|
|
</sect3> <!-- Recoll module -->
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
|
|
<title>The rclextract module</title>
|
|
|
|
<para>Index queries do not provide document content (only a
|
|
partial and unprecise reconstruction is performed to show the
|
|
snippets text). In order to access the actual document data,
|
|
the data extraction part of the indexing process
|
|
must be performed (subdocument access and format
|
|
translation). This is not trivial in
|
|
general. The <literal>rclextract</literal> module currently
|
|
provides a single class which can be used to access the data
|
|
content for result documents.</para>
|
|
|
|
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
|
|
<title>Classes</title>
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
|
|
<title>The Extractor class</title>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Extractor(doc)</term>
|
|
<listitem><para>An <literal>Extractor</literal> object is
|
|
built from a <literal>Doc</literal> object, output
|
|
from a query.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Extractor.textextract(ipath)</term>
|
|
<listitem><para>Extract document defined
|
|
by <replaceable>ipath</replaceable> and return
|
|
a <literal>Doc</literal> object. The doc.text field
|
|
has the document text converted to either text/plain or
|
|
text/html according to doc.mimetype. The typical use
|
|
would be as follows:
|
|
<programlisting>
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
doc = extractor.textextract(qdoc.ipath)
|
|
# use doc.text, e.g. for previewing
|
|
</programlisting>
|
|
</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
|
|
<listitem><para>Extracts document into an output file,
|
|
which can be given explicitly or will be created as a
|
|
temporary file to be deleted by the caller. Typical use:
|
|
<programlisting>
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
|
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect5> <!-- Extractor class -->
|
|
</sect4> <!-- rclextract classes -->
|
|
</sect3> <!-- rclextract module -->
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
|
|
<title>Search API usage example</title>
|
|
|
|
<para>The following sample would query the index with a user
|
|
language string. See the <filename>python/samples</filename>
|
|
directory inside the &RCL; source for other
|
|
examples. The <filename>recollgui</filename> subdirectory
|
|
has a very embryonic GUI which demonstrates the
|
|
highlighting and data extraction functions.</para>
|
|
|
|
<programlisting>
|
|
#!/usr/bin/env python
|
|
<![CDATA[
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
db.setAbstractParams(maxchars=80, contextwords=4)
|
|
|
|
query = db.query()
|
|
nres = query.execute("some user question")
|
|
print "Result count: ", nres
|
|
if nres > 5:
|
|
nres = 5
|
|
for i in range(nres):
|
|
doc = query.fetchone()
|
|
print "Result #%d" % (query.rownumber,)
|
|
for k in ("title", "size"):
|
|
print k, ":", getattr(doc, k).encode('utf-8')
|
|
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
|
print abs
|
|
print
|
|
|
|
]]>
|
|
</programlisting>
|
|
|
|
</sect3>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.UPDATE">
|
|
<title>Creating Python external indexers</title>
|
|
|
|
<para>The update API can be used to create an index from data which
|
|
is not accessible to the regular &RCL; indexer, or structured to
|
|
present difficulties to the &RCL; input handlers.</para>
|
|
|
|
<para>An indexer created using this API will be have equivalent work
|
|
to do as the the Recoll file system indexer: look for modified
|
|
documents, extract their text, call the API for indexing it, take
|
|
care of purging the index out of data from documents which do not
|
|
exist in the document store any more.</para>
|
|
|
|
<para>The data for such an external indexer should be stored in an
|
|
index separate from any used by the &RCL; internal file system
|
|
indexer. The reason is that the main document indexer purge pass
|
|
(removal of deleted documents) would also remove all the documents
|
|
belonging to the external indexer, as they were not seen during the
|
|
filesystem walk. The main indexer documents would also probably be a
|
|
problem for the external indexer own purge operation.</para>
|
|
|
|
<para>While there would be ways to enable multiple foreign indexers
|
|
to cooperate on a single index, it is just simpler to use separate
|
|
ones, and use the multiple index access capabilities of the query
|
|
interface, if needed.</para>
|
|
|
|
<para>There are two parts in the update interface:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>Methods inside the <filename>recoll</filename>
|
|
module allow inserting data into the index, to make it accessible by
|
|
the normal query interface.</para></listitem>
|
|
<listitem><para>An interface based on scripts execution is defined
|
|
to allow either the GUI or the <filename>rclextract</filename>
|
|
module to access original document data for previewing or
|
|
editing.</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE">
|
|
<title>Python update interface</title>
|
|
|
|
<para>The update methods are part of the
|
|
<filename>recoll</filename> module described above. The connect()
|
|
method is used with a <literal>writable=true</literal> parameter to
|
|
obtain a writable <literal>Db</literal> object. The following
|
|
<literal>Db</literal> object methods are then available.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>addOrUpdate(udi, doc, parent_udi=None)</term>
|
|
<listitem><para>Add or update index data for a given document
|
|
The <literal>
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">
|
|
udi</link></literal> string must define a unique id for
|
|
the document. It is an opaque interface element and not
|
|
interpreted inside Recoll. <literal>doc</literal> is a
|
|
<literal>
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC">
|
|
Doc</link></literal> object, created from the data to be
|
|
indexed (the main text should be in
|
|
<literal>doc.text</literal>). If <literal>
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">
|
|
parent_udi</link></literal> is set, this is a unique
|
|
identifier for the top-level container (e.g. for the
|
|
filesystem indexer, this would be the one which is an actual
|
|
file).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>delete(udi)</term>
|
|
<listitem><para>Purge index from all data for
|
|
<literal>udi</literal>, and all documents (if any) which have a
|
|
matrching <literal>parent_udi</literal>. </para> </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>needUpdate(udi, sig)</term>
|
|
<listitem><para>Test if the index needs to be updated for the
|
|
document identified by <literal>udi</literal>. If this call is
|
|
to be used, the <literal>doc.sig</literal> field should contain
|
|
a signature value when calling
|
|
<literal>addOrUpdate()</literal>. The
|
|
<literal>needUpdate()</literal> call then compares its
|
|
parameter value with the stored <literal>sig</literal> for
|
|
<literal>udi</literal>. <literal>sig</literal> is an opaque
|
|
value, compared as a string.</para>
|
|
<para>The filesystem indexer uses a
|
|
concatenation of the decimal string values for file size and
|
|
update time, but a hash of the contents could also be
|
|
used.</para>
|
|
<para>As a side effect, if the return value is false (the index
|
|
is up to date), the call will set the existence flag for the
|
|
document (and any subdocument defined by its
|
|
<literal>parent_udi</literal>), so that a later
|
|
<literal>purge()</literal> call will preserve them).</para>
|
|
<para>The use of <literal>needUpdate()</literal> and
|
|
<literal>purge()</literal> is optional, and the indexer may use
|
|
another method for checking the need to reindex or to delete
|
|
stale entries.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>purge()</term>
|
|
<listitem><para>Delete all documents that were not touched
|
|
during the just finished indexing pass (since
|
|
open-for-write). These are the documents for the needUpdate()
|
|
call was not performed, indicating that they no longer exist in
|
|
the primary storage system.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.ACCESS">
|
|
<title>Query data access for external indexers (1.23)</title>
|
|
|
|
<para>&RCL; has internal methods to access document data for its
|
|
internal (filesystem) indexer. An external indexer needs to provide
|
|
data access methods if it needs integration with the GUI
|
|
(e.g. preview function), or support for the
|
|
<filename>rclextract</filename> module.</para>
|
|
|
|
<para>The index data and the access method are linked by the
|
|
<literal>rclbes</literal> (recoll backend storage)
|
|
<literal>Doc</literal> field. You should set this to a short string
|
|
value identifying your indexer (e.g. the filesystem indexer uses either
|
|
"FS" or an empty value, the Web history indexer uses "BGL").</para>
|
|
|
|
<para>The link is actually performed inside a
|
|
<filename>backends</filename> configuration file (stored in the
|
|
configuration directory). This defines commands to execute to
|
|
access data from the specified indexer. Example, for the mbox
|
|
indexing sample found in the Recoll source (which sets
|
|
<literal>rclbes="MBOX"</literal>):</para>
|
|
<programlisting>[MBOX]
|
|
fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch
|
|
makesig = path/to/recoll/src/python/samples/rclmbox.py makesig
|
|
</programlisting>
|
|
<para><literal>fetch</literal> and <literal>makesig</literal>
|
|
define two commands to execute to respectively retrieve the
|
|
document text and compute the document signature (the example
|
|
implementation uses the same script with different first parameters
|
|
to perform both operations).</para>
|
|
|
|
<para>The scripts are called with three additional arguments:
|
|
<literal>udi</literal>, <literal>url</literal>,
|
|
<literal>ipath</literal>, stored with the document when it was
|
|
indexed, and may use any or all to perform the requested
|
|
operation. The caller expects the result data on
|
|
<literal>stdout</literal>.</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.SAMPLES">
|
|
<title>External indexer samples</title>
|
|
|
|
<para>The Recoll source tree has two samples of external indexers
|
|
in the <filename>src/python/samples</filename> directory. The more
|
|
interesting one is <filename>rclmbox.py</filename> which indexes a
|
|
directory containing <literal>mbox</literal> folder files. It
|
|
exercises most features in the update interface, and has a data
|
|
access interface.</para>
|
|
|
|
<para>See the comments inside the file for more information.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.COMPAT">
|
|
<title>Package compatibility with the previous version</title>
|
|
|
|
<para>The following code fragments can be used to ensure that
|
|
code can run with both the old and the new API (as long as it
|
|
does not use the new abilities of the new API of
|
|
course).</para>
|
|
|
|
<para>Adapting to the new package structure:</para>
|
|
<programlisting>
|
|
<![CDATA[
|
|
try:
|
|
from recoll import recoll
|
|
from recoll import rclextract
|
|
hasextract = True
|
|
except:
|
|
import recoll
|
|
hasextract = False
|
|
]]>
|
|
</programlisting>
|
|
|
|
<para>Adapting to the change of nature of
|
|
the <literal>next</literal> <literal>Query</literal>
|
|
member. The same test can be used to choose to use
|
|
the <literal>scroll()</literal> method (new) or set
|
|
the <literal>next</literal> value (old).</para>
|
|
|
|
<programlisting>
|
|
<![CDATA[
|
|
rownum = query.next if type(query.next) == int else \
|
|
query.rownumber
|
|
]]>
|
|
</programlisting>
|
|
|
|
</sect2> <!-- compat with previous version -->
|
|
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="RCL.INSTALL">
|
|
<title>Installation and configuration</title>
|
|
|
|
<sect1 id="RCL.INSTALL.BINARY">
|
|
<title>Installing a binary copy</title>
|
|
|
|
|
|
<para>&RCL; binary copies are always distributed as regular
|
|
packages for your system. They can be obtained either through
|
|
the system's normal software distribution framework (e.g.
|
|
<application>Debian/Ubuntu apt</application>,
|
|
<application>FreeBSD</application> ports, etc.), or from some type
|
|
of "backports" repository providing versions newer than the standard
|
|
ones, or found on the &RCL; WEB site in some
|
|
cases. The most up-to-date information about Recoll packages can
|
|
usually be found on the
|
|
<ulink url="http://www.recoll.org/download.html">
|
|
<application>Recoll</application> WEB site downloads
|
|
page</ulink></para>
|
|
|
|
<para>There used to exist another form of binary install, as
|
|
pre-compiled source trees, but these are just less convenient than
|
|
the packages and don't exist any more.</para>
|
|
|
|
<para>The package management tools will usually automatically
|
|
deal with hard dependancies for packages obtained from a proper
|
|
package repository. You will have to deal with them by hand for
|
|
downloaded packages (for example, when <command>dpkg</command>
|
|
complains about missing dependancies).</para>
|
|
|
|
<para>In all cases, you will have to check or install <link
|
|
linkend="RCL.INSTALL.EXTERNAL">supporting applications</link>
|
|
for the file types that you want to index beyond those that are
|
|
natively processed by &RCL; (text, HTML, email files, and a few
|
|
others).</para>
|
|
|
|
<para>You should also maybe have a look at the
|
|
<link linkend="RCL.INSTALL.CONFIG">configuration section</link>
|
|
(but this may not be necessary for a quick test with default
|
|
parameters). Most parameters can be more conveniently set from the
|
|
GUI interface.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INSTALL.EXTERNAL">
|
|
<title>Supporting packages</title>
|
|
|
|
<note><para>The &WIN; installation of &RCL; is self-contained, and
|
|
only needs Python 2.7 to be externally installed. &WIN; users can
|
|
skip this section.</para></note>
|
|
|
|
<para>&RCL; uses external applications to index some file
|
|
types. You need to install them for the file types that you wish to
|
|
have indexed (these are run-time optional dependencies. None is
|
|
needed for building or running &RCL; except for indexing their
|
|
specific file type).</para>
|
|
|
|
<para>After an indexing pass, the commands that were found
|
|
missing can be displayed from the <command>recoll</command>
|
|
<guilabel>File</guilabel> menu. The list is stored in the
|
|
<filename>missing</filename> text file inside the configuration
|
|
directory.</para>
|
|
|
|
<para>A list of common file types which need external
|
|
commands follows. Many of the handlers need the
|
|
<command>iconv</command> command, which is not always listed as a
|
|
dependancy.</para>
|
|
|
|
<para>Please note that, due to the relatively dynamic nature of this
|
|
information, the most up to date version is now kept on &RCLAPPS;
|
|
along with links to the home pages or best source/patches pages,
|
|
and misc tips. The list below is not updated often and may be quite
|
|
stale.</para>
|
|
|
|
<para>For many Linux distributions, most of the commands listed can
|
|
be installed from the package repositories. However, the packages
|
|
are sometimes outdated, or not the best version for &RCL;, so you
|
|
should take a look at &RCLAPPS; if a file
|
|
type is important to you.</para>
|
|
|
|
<para>As of &RCL; release 1.14, a number of XML-based formats that
|
|
were handled by ad hoc handler code now use the
|
|
<command>xsltproc</command> command, which usually comes with
|
|
<application>libxslt</application>. These are: abiword, fb2
|
|
(ebooks), kword, openoffice, svg.</para>
|
|
|
|
<para>Now for the list:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para>Openoffice files need <command>unzip</command> and
|
|
<command>xsltproc</command>.</para></listitem>
|
|
|
|
<listitem><para>PDF files need <command>pdftotext</command>
|
|
which is part of <application>Poppler</application> (usually
|
|
comes with the <literal>poppler-utils</literal>
|
|
package). Avoid the original one from
|
|
<application>Xpdf</application>.</para></listitem>
|
|
|
|
<listitem><para>Postscript files need <command>pstotext</command>.
|
|
The original version has an issue with shell
|
|
character in file names, which is corrected in recent
|
|
packages. See &RCLAPPS; for more detail.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>MS Word needs
|
|
<command>antiword</command>. It is also useful to have
|
|
<command>wvWare</command> installed as it may be
|
|
be used as a fallback for some files which
|
|
<command>antiword</command> does not handle.</para></listitem>
|
|
|
|
<listitem><para>MS Excel and PowerPoint are processed by
|
|
internal <command>Python</command> handlers.</para></listitem>
|
|
|
|
<listitem><para>MS Open XML (docx) needs <command>
|
|
xsltproc</command>.</para></listitem>
|
|
|
|
<listitem><para>Wordperfect files need <command>wpd2html</command>
|
|
from the <application>libwpd</application> (or
|
|
<application>libwpd-tools</application> on Ubuntu)
|
|
package.</para></listitem>
|
|
|
|
<listitem><para>RTF files need <command>unrtf</command>,
|
|
which, in its older versions, has much trouble with
|
|
non-western character sets. Many Linux distributions carry
|
|
outdated <command>unrtf</command> versions. Check
|
|
&RCLAPPS; for details.</para></listitem>
|
|
|
|
<listitem><para>TeX files need <command>untex</command> or
|
|
<command>detex</command>. Check &RCLAPPS; for sources if it's not
|
|
packaged for your distribution.</para></listitem>
|
|
|
|
<listitem><para>dvi files need <command>dvips</command>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>djvu files need <command>djvutxt</command> and
|
|
<command>djvused</command> from the
|
|
<application>DjVuLibre</application> package.</para></listitem>
|
|
|
|
<listitem><para>Audio files: &RCL; releases 1.14 and later use
|
|
a single <application>Python</application> handler based
|
|
on <application>mutagen</application> for all audio file
|
|
types.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>Pictures: &RCL; uses the
|
|
<application>Exiftool</application>
|
|
<application>Perl</application> package to extract tag
|
|
information. Most image file formats are supported. Note that
|
|
there may not be much interest in indexing the technical tags
|
|
(image size, aperture, etc.). This is only of interest if you
|
|
store personal tags or textual descriptions inside the image
|
|
files.</para></listitem>
|
|
|
|
<listitem><para>chm: files in Microsoft help format need Python and
|
|
the <application>pychm</application> module (which needs
|
|
<application>chmlib</application>).</para></listitem>
|
|
|
|
<listitem><para>ICS: up to &RCL; 1.13, iCalendar files need
|
|
<application>Python</application>
|
|
and the <application>icalendar</application>
|
|
module. <application>icalendar</application> is not needed for newer
|
|
versions, which use internal code.</para></listitem>
|
|
|
|
<listitem><para>Zip archives need <application>Python</application>
|
|
(and the standard zipfile module). </para></listitem>
|
|
|
|
<listitem><para>Rar archives need
|
|
<application>Python</application>, the
|
|
<application>rarfile</application> Python module and the
|
|
<command>unrar</command> utility.</para></listitem>
|
|
|
|
<listitem><para>Midi karaoke files need
|
|
<application>Python</application> and the
|
|
<ulink url="http://pypi.python.org/pypi/midi/0.2.1">
|
|
<application>Midi module</application></ulink></para>
|
|
</listitem>
|
|
|
|
<listitem><para>Konqueror webarchive format with Python (uses the
|
|
Tarfile module).</para></listitem>
|
|
|
|
<listitem><para>Mimehtml web archive format (support based on
|
|
the email handler, which introduces some mild weirdness, but
|
|
still usable).</para></listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>Text, HTML, email folders, and Scribus files are
|
|
processed internally. <application>Lyx</application> is used to
|
|
index Lyx files. Many handlers need <command>iconv</command> and the
|
|
standard <command>sed</command> and <command>awk</command>.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INSTALL.BUILDING">
|
|
<title>Building from source</title>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.PREREQS">
|
|
<title>Prerequisites</title>
|
|
|
|
<para>If you can install any or all of the following through
|
|
the package manager for your system, all the
|
|
better. Especially <application>Qt</application> is a very
|
|
big piece of software, but you will most probably be able to
|
|
find a binary package.</para>
|
|
|
|
<para>If you are building for an exotic or older system, it may
|
|
be useful to note that functional improvements in &RCL;
|
|
have been relatively marginal in recent versions,
|
|
and that you may make your life easier by using an older
|
|
release, without losing major function.</para>
|
|
|
|
<para>The shopping list:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para>The <literal>autoconf</literal>,
|
|
<literal>automake</literal> and <literal>libtool</literal>
|
|
triad. Only <literal>autoconf</literal> is needed before &RCL;
|
|
and including 1.21.</para></listitem>
|
|
|
|
<listitem><para>C++ compiler. Recent versions require C++11
|
|
compatibility (1.23 and later).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><command>bison</command> command (for &RCL; 1.21
|
|
and later).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><command>xsltproc</command> command. For building
|
|
the documentation (for &RCL; 1.21
|
|
and later). This sometimes comes with the
|
|
<literal>libxslt</literal> package. And also the Docbook XML and
|
|
style sheet files.</para>
|
|
</listitem>
|
|
|
|
|
|
<listitem><para>Development files
|
|
for <ulink url="http://www.xapian.org"> <application>Xapian
|
|
core</application></ulink>.</para>
|
|
<important>
|
|
<para>If you are
|
|
building Xapian for an older CPU (before Pentium 4 or Athlon
|
|
64), you need to add the <option>--disable-sse</option> flag
|
|
to the configure command. Else all Xapian application will
|
|
crash with an <literal>illegal instruction</literal>
|
|
error.</para>
|
|
</important>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Development files for
|
|
<ulink url="http://qt-project.org/downloads">
|
|
<application>Qt 4 or Qt 5</application> </ulink>. &RCL; 1.15.9
|
|
was the last version to support <application>Qt 3</application>.
|
|
If you do not want to install or build the
|
|
<application>Qt Webkit</application> module, &RCL;
|
|
has a configuration option to disable its use (see further).
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Development files for <application>X11</application> and
|
|
<application>zlib</application>.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Development files for <application>Python</application>
|
|
(or use <literal>--disable-python-module</literal>).</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>You may also need
|
|
<ulink url="http://www.gnu.org/software/libiconv/">
|
|
libiconv</ulink>. On <application>Linux</application>
|
|
systems, the iconv interface is part of libc and you should not
|
|
need to do anything special.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>Check the <ulink url="http://www.recoll.org/download.html">
|
|
&RCL; download page</ulink> for up to date version
|
|
information.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.BUILD">
|
|
<title>Building</title>
|
|
|
|
<para>&RCL; has been built on Linux, FreeBSD, Mac OS X, and Solaris,
|
|
most versions after 2005 should be ok, maybe some older ones too
|
|
(Solaris 8 is ok). If you build on another system, and
|
|
need to modify things,
|
|
<ulink url="mailto:jfd@recoll.org">I would
|
|
very much welcome patches</ulink>.</para>
|
|
|
|
|
|
<formalpara><title>Configure options:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><option>--without-aspell</option>
|
|
will disable the code for phonetic matching of search
|
|
terms. </para></listitem>
|
|
|
|
<listitem><para><option>--with-fam</option> or
|
|
<option>--with-inotify</option> will enable the code for
|
|
real time indexing. Inotify support is enabled by default on
|
|
recent Linux systems.</para></listitem>
|
|
|
|
<listitem><para><option>--with-qzeitgeist</option> will
|
|
enable sending <application>Zeitgeist</application>
|
|
events about the visited search results, and needs
|
|
the <application>qzeitgeist</application>
|
|
package.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-webkit</option> is available
|
|
from version 1.17 to implement the result list with a
|
|
<application>Qt</application> QTextBrowser instead of a
|
|
WebKit widget if you do not or can't depend on the
|
|
latter.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-idxthreads</option> is available
|
|
from version 1.19 to suppress multithreading inside the
|
|
indexing process. You can also use the run-time
|
|
configuration to restrict <command>recollindex</command>
|
|
to using a single thread, but the compile-time option
|
|
may disable a few more unused locks. This only applies
|
|
to the use of multithreading for the core index
|
|
processing (data input). The &RCL; monitor mode always
|
|
uses at least two threads of execution.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-python-module</option> will
|
|
avoid building the <application>Python</application>
|
|
module.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-xattr</option> will prevent
|
|
fetching data from file extended attributes. Beyond a
|
|
few standard attributes, fetching extended attributes
|
|
data can only be useful is some application stores data
|
|
in there, and also needs some simple configuration (see
|
|
comments in the <filename>fields</filename> configuration
|
|
file).</para></listitem>
|
|
|
|
<listitem><para><option>--enable-camelcase</option> will enable
|
|
splitting <replaceable>camelCase</replaceable> words. This
|
|
is not enabled by default as it has the unfortunate
|
|
side-effect of making some phrase searches quite
|
|
confusing: ie, <literal>"MySQL manual"</literal> would be
|
|
matched by <literal>"MySQL manual"</literal> and
|
|
<literal>"my sql manual"</literal> but not <literal>"mysql
|
|
manual"</literal> (only inside phrase searches).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><option>--with-file-command</option> Specify
|
|
the version of the 'file' command to use (ie:
|
|
--with-file-command=/usr/local/bin/file). Can be useful to
|
|
enable the gnu version on systems where the native one is
|
|
bad.</para> </listitem>
|
|
|
|
<listitem><para><option>--disable-qtgui</option> Disable the Qt
|
|
interface. Will allow building the indexer and the command line
|
|
search program in absence of a Qt environment.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><option>--disable-x11mon</option> Disable
|
|
<application>X11</application> connection monitoring
|
|
inside recollindex. Together with --disable-qtgui, this
|
|
allows building recoll without
|
|
<application>Qt</application> and
|
|
<application>X11</application>.</para> </listitem>
|
|
|
|
<listitem><para><option>--disable-userdoc</option>
|
|
will avoid building the user manual. This avoids having to
|
|
install the Docbook XML/XSL files and the TeX toolchain used for
|
|
translating the manual to PDF.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-pic</option> (&RCL; versions up
|
|
to 1.21 only) will compile
|
|
&RCL; with position-dependant code. This is incompatible with
|
|
building the KIO or the <application>Python</application>
|
|
or <application>PHP</application> extensions, but might
|
|
yield very marginally faster code.</para></listitem>
|
|
|
|
<listitem><para>Of course the usual
|
|
<application>autoconf</application> <command>configure</command>
|
|
options, like <option>--prefix</option> apply.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<para>Normal procedure (for source extracted from a tar
|
|
distribution):</para>
|
|
<screen>
|
|
<userinput>cd recoll-xxx</userinput>
|
|
<userinput>./configure</userinput>
|
|
<userinput>make</userinput>
|
|
<userinput>(practices usual hardship-repelling invocations)</userinput>
|
|
</screen>
|
|
|
|
<para>When building from source cloned from the BitBucket repository,
|
|
you also need to install <application>autoconf</application>,
|
|
<application>automake</application>, and
|
|
<application>libtool</application> and you must execute <literal>sh
|
|
autogen.sh</literal> in the top source directory before running
|
|
<literal>configure</literal>.</para>
|
|
|
|
<sect3 id="RCL.INSTALL.BUILDING.BUILD.SOLARIS">
|
|
<title>Building on Solaris</title>
|
|
|
|
<para>We did not test building the GUI on Solaris for recent
|
|
versions. You will need at least Qt 4.4. There are some hints
|
|
on <ulink url="http://www.recoll.org/download-1.14.html">an old
|
|
web site page</ulink>, they may still be valid.</para>
|
|
|
|
<para>Someone did test the 1.19 indexer and Python module build,
|
|
they do work, with a few minor glitches. Be sure to use
|
|
GNU <command>make</command> and <command>install</command>.</para>
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.INSTALL">
|
|
<title>Installation</title>
|
|
|
|
<para>Use <userinput>make install</userinput>
|
|
in the root
|
|
of the source tree. This will copy the commands to
|
|
<filename><replaceable>prefix</replaceable>/bin</filename>
|
|
and the sample configuration files, scripts and other shared
|
|
data to
|
|
<filename><replaceable>prefix</replaceable>/share/recoll</filename>.
|
|
</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INSTALL.CONFIG">
|
|
<title>Configuration overview</title>
|
|
|
|
<para>Most of the parameters specific to the
|
|
<command>recoll</command> GUI are set through the
|
|
<guilabel>Preferences</guilabel> menu and stored in the standard Qt
|
|
place (<filename>$HOME/.config/Recoll.org/recoll.conf</filename>).
|
|
You probably do not want to edit this by hand.</para>
|
|
|
|
<para>&RCL; indexing options are set inside text configuration
|
|
files located in a configuration directory. There can be
|
|
several such directories, each of which defines the parameters
|
|
for one index.</para>
|
|
|
|
<para>The configuration files can be edited by hand or through
|
|
the <guilabel>Index configuration</guilabel> dialog
|
|
(<guilabel>Preferences</guilabel> menu). The GUI tool will try
|
|
to respect your formatting and comments as much as possible,
|
|
so it is quite possible to use both approaches on the same
|
|
configuration.</para>
|
|
|
|
<para>The most accurate documentation for the
|
|
configuration parameters is given by comments inside the default
|
|
files, and we will just give a general overview here.</para>
|
|
|
|
<para>For each index, there are at least two sets of
|
|
configuration files. System-wide configuration files are kept
|
|
in a directory named
|
|
like <filename>/usr/share/recoll/examples</filename>,
|
|
and define default values, shared by all indexes. For each
|
|
index, a parallel set of files defines the customized
|
|
parameters.</para>
|
|
|
|
<para>The default location of the customized configuration is the
|
|
<filename>.recoll</filename>
|
|
directory in your home. Most people will only use this
|
|
directory.</para>
|
|
|
|
<para>This location can be changed, or others can be added with the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable or the
|
|
<option>-c</option> option parameter to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>In addition (as of &RCL; version 1.19.7), it is possible
|
|
to specify two additional configuration directories which will
|
|
be stacked before and after the user configuration
|
|
directory. These are defined by
|
|
the <envar>RECOLL_CONFTOP</envar>
|
|
and <envar>RECOLL_CONFMID</envar> environment
|
|
variables. Values from configuration files inside the top
|
|
directory will override user ones, values from configuration
|
|
files inside the middle directory will override system ones
|
|
and be overriden by user ones. These two variables may be of
|
|
use to applications which augment &RCL; functionality, and
|
|
need to add configuration data without disturbing the user's
|
|
files. Please note that the two, currently single, values will
|
|
probably be interpreted as colon-separated lists in the
|
|
future: do not use colon characters inside the directory
|
|
paths.</para>
|
|
|
|
<para>If the <filename>.recoll</filename> directory does not
|
|
exist when <command>recoll</command> or
|
|
<command>recollindex</command> are started, it will be created
|
|
with a set of empty configuration files.
|
|
<command>recoll</command> will give you a chance to edit the
|
|
configuration file before starting
|
|
indexing. <command>recollindex</command> will proceed
|
|
immediately. To avoid mistakes, the automatic directory
|
|
creation will only occur for the
|
|
default location, not if <option>-c</option> or
|
|
<envar>RECOLL_CONFDIR</envar> were used (in the latter
|
|
cases, you will have to create the directory).</para>
|
|
|
|
<para>All configuration files share the same format. For
|
|
example, a short extract of the main configuration file might
|
|
look as follows:</para>
|
|
<programlisting>
|
|
# Space-separated list of directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
</programlisting>
|
|
|
|
<para>There are three kinds of lines: </para>
|
|
<itemizedlist>
|
|
<listitem><para>Comment (starts with
|
|
<emphasis>#</emphasis>) or empty.</para>
|
|
</listitem>
|
|
<listitem><para>Parameter affectation (<emphasis>name =
|
|
value</emphasis>).</para>
|
|
</listitem>
|
|
<listitem><para>Section definition
|
|
([<emphasis>somedirname</emphasis>]).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Long lines can be broken by ending each incomplete part with
|
|
a backslash (<literal>\</literal>).</para>
|
|
|
|
<para>Depending on the type of configuration file, section
|
|
definitions either separate groups of parameters or allow
|
|
redefining some parameters for a directory sub-tree. They stay
|
|
in effect until another section definition, or the end of
|
|
file, is encountered. Some of the parameters used for indexing
|
|
are looked up hierarchically from the current directory
|
|
location upwards. Not all parameters can be meaningfully
|
|
redefined, this is specified for each in the next
|
|
section. </para>
|
|
|
|
<para>When found at the beginning of a file path, the tilde
|
|
character (~) is expanded to the name of the user's home
|
|
directory, as a shell would do.</para>
|
|
|
|
<para>Some parameters are lists of strings. White space is used for
|
|
separation. List elements with embedded spaces can be quoted using
|
|
double-quotes. Double quotes inside these elements can be escaped
|
|
with a backslash.</para>
|
|
|
|
<para>No value inside a configuration file can contain a newline
|
|
character. Long lines can be continued by escaping the
|
|
physical newline with backslash, even inside quoted strings.</para>
|
|
<programlisting>
|
|
astringlist = "some string \
|
|
with spaces"
|
|
thesame = "some string with spaces"
|
|
</programlisting>
|
|
|
|
<para>Parameters which are not part of string lists can't be
|
|
quoted, and leading and trailing space characters are
|
|
stripped before the value is used.</para>
|
|
|
|
<formalpara>
|
|
<title>Encoding issues</title>
|
|
<para>Most of the configuration parameters are plain ASCII. Two
|
|
particular sets of values may cause encoding issues:</para>
|
|
</formalpara>
|
|
|
|
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para>File path parameters may contain non-ascii
|
|
characters and should use the exact same byte values as found in
|
|
the file system directory. Usually, this means that the
|
|
configuration file should use the system default locale
|
|
encoding.</para>
|
|
</listitem>
|
|
<listitem><para>The <envar>unac_except_trans</envar> parameter
|
|
should be encoded in UTF-8. If your system locale is not UTF-8, and
|
|
you need to also specify non-ascii file paths, this poses a
|
|
difficulty because common text editors cannot handle multiple
|
|
encodings in a single file. In this relatively unlikely case, you
|
|
can edit the configuration file as two separate text files with
|
|
appropriate encodings, and concatenate them to create the complete
|
|
configuration.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.ENVIR">
|
|
<title>Environment variables</title>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><varname>RECOLL_CONFDIR</varname></term>
|
|
<listitem><para>Defines the main configuration
|
|
directory.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_TMPDIR, TMPDIR</varname></term>
|
|
<listitem><para>Locations for temporary files, in this order
|
|
of priority. The default if none of these is set is to use
|
|
<filename>/tmp</filename>. Big temporary files may be created
|
|
during indexing, mostly for decompressing, and also for
|
|
processing, e.g. email attachments.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_CONFTOP, RECOLL_CONFMID</varname></term>
|
|
<listitem><para>Allow adding configuration directories with
|
|
priorities below and above the user directory (see above the
|
|
Configuration overview section for details).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_EXTRA_DBS,
|
|
RECOLL_ACTIVE_EXTRA_DBS</varname></term>
|
|
<listitem><para>
|
|
Help for setting up external indexes. See <link
|
|
linkend="RCL.SEARCH.GUI.MULTIDB">this paragraph</link> for
|
|
explanations.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_DATADIR</varname></term>
|
|
<listitem><para>Defines replacement for the default location
|
|
of Recoll data files, normally found in, e.g.,
|
|
<filename>/usr/share/recoll</filename>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_FILTERSDIR</varname></term>
|
|
<listitem><para>Defines replacement for the default location
|
|
of Recoll filters, normally found in, e.g.,
|
|
<filename>/usr/share/recoll/filters</filename>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>ASPELL_PROG</varname></term>
|
|
<listitem><para><command>aspell</command> program to use for
|
|
creating the spelling dictionary. The result has to be
|
|
compatible with the <filename>libaspell</filename> which &RCL;
|
|
is using.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>VARNAME</varname></term>
|
|
<listitem><para>Blabla</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
<!-- <sect2 id="RCL.INSTALL.CONFIG.RECOLLCONF"> -->
|
|
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
href="recoll.conf.xml" />
|
|
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.FIELDS">
|
|
<title>The fields file</title>
|
|
|
|
<para>This file contains information about dynamic fields handling
|
|
in &RCL;. Some very basic fields have hard-wired behaviour,
|
|
and, mostly, you should not change the original data inside the
|
|
<filename>fields</filename> file. But you can create custom fields
|
|
fitting your data and handle them just like they were native
|
|
ones.</para>
|
|
|
|
<para>The <filename>fields</filename> file has several sections,
|
|
which each define an aspect of fields processing. Quite often,
|
|
you'll have to modify several sections to obtain the desired
|
|
behaviour.</para>
|
|
|
|
<para>We will only give a short description here, you should refer
|
|
to the comments inside the default file for more detailed
|
|
information.</para>
|
|
|
|
<para>Field names should be lowercase alphabetic ASCII.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>[prefixes]</term>
|
|
<listitem><para>A field becomes indexed (searchable) by having
|
|
a prefix defined in this section.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[stored]</term>
|
|
<listitem><para>A field becomes stored (displayable inside
|
|
results) by having its name listed in this section (typically
|
|
with an empty value).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[aliases]</term>
|
|
<listitem><para>This section defines lists of synonyms for the
|
|
canonical names used inside the <literal>[prefixes]</literal>
|
|
and <literal>[stored]</literal> sections</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[queryaliases]</term>
|
|
<listitem><para>This section also defines aliases for the
|
|
canonic field names, with the difference that the substitution
|
|
will only be used at query time, avoiding any possibility that
|
|
the value would pick-up random metadata from documents.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>handler-specific sections</term>
|
|
<listitem><para>Some input handlers may need specific
|
|
configuration for handling fields. Only the email message handler
|
|
currently has such a section (named
|
|
<literal>[mail]</literal>). It allows indexing arbitrary email
|
|
headers in addition to the ones indexed by default. Other such
|
|
sections may appear in the future.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>Here follows a small example of a personal
|
|
<filename>fields</filename>
|
|
file. This would extract a specific email header and
|
|
use it as a searchable field, with data displayable inside result
|
|
lists. (Side note: as the email handler does no decoding on the values,
|
|
only plain ascii headers can be indexed, and only the
|
|
first occurrence will be used for headers that occur several times).
|
|
|
|
<programlisting>[prefixes]
|
|
# Index mailmytag contents (with the given prefix)
|
|
mailmytag = XMTAG
|
|
|
|
[stored]
|
|
# Store mailmytag inside the document data record (so that it can be
|
|
# displayed - as %(mailmytag) - in result lists).
|
|
mailmytag =
|
|
|
|
[queryaliases]
|
|
filename = fn
|
|
containerfilename = cfn
|
|
|
|
[mail]
|
|
# Extract the X-My-Tag mail header, and use it internally with the
|
|
# mailmytag field name
|
|
x-my-tag = mailmytag
|
|
</programlisting>
|
|
</para>
|
|
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.FIELDS.XATTR">
|
|
<title>Extended attributes in the fields file</title>
|
|
|
|
<para>&RCL; versions 1.19 and later process user extended
|
|
file attributes as documents fields by default.</para>
|
|
|
|
<para>Attributes are processed as fields of the same name,
|
|
after removing the <literal>user</literal> prefix on
|
|
Linux.</para>
|
|
|
|
<para>The <literal>[xattrtofields]</literal>
|
|
section of the <filename>fields</filename> file allows
|
|
specifying translations from extended attributes names to
|
|
&RCL; field names. An empty translation disables use of the
|
|
corresponding attribute data.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<title>The mimemap file</title>
|
|
|
|
<para><filename>mimemap</filename> specifies the
|
|
file name extension to MIME type mappings.</para>
|
|
|
|
<para>For file names without an extension, or with an unknown one,
|
|
a system command (<command>file</command> <option>-i</option>, or
|
|
<command>xdg-mime</command>) will be executed to determine the MIME
|
|
type (this can be switched off, or the command changed inside the
|
|
main configuration file).</para>
|
|
|
|
<para>All extension values in <filename>mimemap</filename> must be
|
|
entered in lower case. File names extensions are lower-cased for
|
|
comparison during indexing, meaning that an upper case
|
|
<filename>mimemap</filename> entry will never be matched.</para>
|
|
|
|
<para>The mappings can be specified on a per-subtree basis,
|
|
which may be useful in some cases. Example:
|
|
<application>okular</application> notes have a
|
|
<filename>.xml</filename> extension but
|
|
should be handled specially, which is possible because they
|
|
are usually all located in one place. Example:
|
|
<programlisting>[~/.kde/share/apps/okular/docdata]
|
|
.xml = application/x-okular-notes</programlisting></para>
|
|
|
|
<para>The <varname>recoll_noindex</varname>
|
|
<filename>mimemap</filename> variable has been moved to
|
|
<filename>recoll.conf</filename> and renamed to
|
|
<varname>noContentSuffixes</varname>, while keeping the same
|
|
function, as of &RCL; version 1.21. For older &RCL; versions,
|
|
see the documentation for <varname>noContentSuffixes</varname>
|
|
but use <varname>recoll_noindex</varname> in
|
|
<filename>mimemap</filename>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<title>The mimeconf file</title>
|
|
|
|
<para><filename>mimeconf</filename> specifies how the
|
|
different MIME types are handled for indexing, and which icons
|
|
are displayed in the <command>recoll</command> result lists.</para>
|
|
|
|
<para>Changing the parameters in the [index] section is
|
|
probably not a good idea except if you are a &RCL;
|
|
developer.</para>
|
|
|
|
<para>The [icons] section allows you to change the icons which
|
|
are displayed by <command>recoll</command> in the result
|
|
lists (the values are the basenames of the png images inside
|
|
the <filename>iconsdir</filename> directory (specified in
|
|
<filename>recoll.conf</filename>).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<title>The mimeview file</title>
|
|
|
|
<para><filename>mimeview</filename> specifies which programs
|
|
are started when you click on an <guilabel>Open</guilabel> link
|
|
in a result list. Ie: HTML is normally displayed using
|
|
<application>firefox</application>, but you may prefer
|
|
<application>Konqueror</application>, your
|
|
<application>openoffice.org</application>
|
|
program might be named <command>oofice</command> instead of
|
|
<command>openoffice</command> etc.</para>
|
|
|
|
<para>Changes to this file can be done by direct editing, or
|
|
through the <command>recoll</command> GUI preferences dialog.</para>
|
|
|
|
<para>If <guilabel>Use desktop preferences to choose document
|
|
editor</guilabel> is checked in the &RCL; GUI preferences, all
|
|
<filename>mimeview</filename> entries will be ignored except the
|
|
one labelled <literal>application/x-all</literal> (which is set to
|
|
use <command>xdg-open</command> by default).</para>
|
|
|
|
<para>In this case, the <literal>xallexcepts</literal> top level
|
|
variable defines a list of MIME type exceptions which
|
|
will be processed according to the local entries instead of being
|
|
passed to the desktop. This is so that specific &RCL; options
|
|
such as a page number or a search string can be passed to
|
|
applications that support them, such as the
|
|
<application>evince</application> viewer.</para>
|
|
|
|
<para>As for the other configuration files, the normal usage
|
|
is to have a <filename>mimeview</filename> inside your own
|
|
configuration directory, with just the non-default entries,
|
|
which will override those from the central configuration
|
|
file.</para>
|
|
|
|
<para>All viewer definition entries must be placed under a
|
|
<literal>[view]</literal> section.</para>
|
|
|
|
<para>The keys in the file are normally MIME types. You can add an
|
|
application tag to specialize the choice for an area of the
|
|
filesystem (using a <varname>localfields</varname> specification
|
|
in <filename>mimeconf</filename>). The syntax for the key is
|
|
<replaceable>mimetype</replaceable><literal>|</literal><replaceable>tag</replaceable></para>
|
|
|
|
<para>The <varname>nouncompforviewmts</varname> entry, (placed at
|
|
the top level, outside of the <literal>[view]</literal> section),
|
|
holds a list of MIME types that should not be uncompressed before
|
|
starting the viewer (if they are found compressed, ie:
|
|
<replaceable>mydoc.doc.gz</replaceable>).</para>
|
|
|
|
<para>The right side of each assignment holds a command to be
|
|
executed for opening the file. The following substitutions are
|
|
performed:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>%D</title>
|
|
<para>Document date</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%f</title>
|
|
<para>File name. This may be the name of a temporary file if
|
|
it was necessary to create one (ie: to extract a subdocument
|
|
from a container).</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%i</title>
|
|
<para>Internal path, for subdocuments of containers. The
|
|
format depends on the container type. If this appears in the
|
|
command line, &RCL; will not create a temporary file to
|
|
extract the subdocument, expecting the called application
|
|
(possibly a script) to be able to handle it.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%M</title>
|
|
<para>MIME type</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%p</title>
|
|
<para>Page index. Only significant for a subset of document
|
|
types, currently only PDF, Postscript and DVI files. Can be
|
|
used to start the editor at the right page for a match or
|
|
snippet.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%s</title>
|
|
<para>Search term. The value will only be set for documents
|
|
with indexed page numbers (ie: PDF). The value will be one of
|
|
the matched search terms. It would allow pre-setting the
|
|
value in the "Find" entry inside Evince for example, for easy
|
|
highlighting of the term.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%u</title>
|
|
<para>Url.</para></formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>In addition to the predefined values above, all strings like
|
|
<literal>%(fieldname)</literal> will be replaced by the value of
|
|
the field named <literal>fieldname</literal> for the
|
|
document. This could be used in combination with field
|
|
customisation to help with opening the document.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.PTRANS">
|
|
<title>The <filename>ptrans</filename> file</title>
|
|
|
|
<para><filename>ptrans</filename> specifies query-time path
|
|
translations. These can be useful
|
|
in <link linkend="RCL.SEARCH.PTRANS">multiple
|
|
cases</link>.</para>
|
|
<para>The file has a section for any index which needs
|
|
translations, either the main one or additional query
|
|
indexes. The sections are named with the &XAP; index
|
|
directory names. No slash character should exist at the end
|
|
of the paths (all comparisons are textual). An exemple
|
|
should make things sufficiently clear</para>
|
|
|
|
<programlisting>
|
|
[/home/me/.recoll/xapiandb]
|
|
/this/directory/moved = /to/this/place
|
|
|
|
[/path/to/additional/xapiandb]
|
|
/server/volume1/docdir = /net/server/volume1/docdir
|
|
/server/volume2/docdir = /net/server/volume2/docdir
|
|
</programlisting>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.EXAMPLES">
|
|
<title>Examples of configuration adjustments</title>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW">
|
|
<title>Adding an external viewer for an non-indexed type</title>
|
|
|
|
<para>Imagine that you have some kind of file which does not
|
|
have indexable content, but for which you would like to have a
|
|
functional <guilabel>Open</guilabel> link in the result list
|
|
(when found by file name). The file names end in
|
|
<replaceable>.blob</replaceable> and can be displayed by
|
|
application <replaceable>blobviewer</replaceable>.</para>
|
|
|
|
<para>You need two entries in the configuration files for this
|
|
to work:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>In <filename>$RECOLL_CONFDIR/mimemap</filename>
|
|
(typically <filename>~/.recoll/mimemap</filename>), add the
|
|
following line:<programlisting>
|
|
.blob = application/x-blobapp
|
|
</programlisting>
|
|
Note that the MIME type is made up here, and you could
|
|
call it <replaceable>diesel/oil</replaceable> just the
|
|
same.</para>
|
|
</listitem>
|
|
<listitem><para>In <filename>$RECOLL_CONFDIR/mimeview</filename>
|
|
under the <literal>[view]</literal> section, add:</para>
|
|
<programlisting>
|
|
application/x-blobapp = blobviewer %f
|
|
</programlisting>
|
|
<para>We are supposing
|
|
that <replaceable>blobviewer</replaceable> wants a file
|
|
name parameter here, you would use <literal>%u</literal> if
|
|
it liked URLs better.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>If you just wanted to change the application used by
|
|
&RCL; to display a MIME type which it already knows, you
|
|
would just need to edit <filename>mimeview</filename>. The
|
|
entries you add in your personal file override those in the
|
|
central configuration, which you do not need to
|
|
alter. <filename>mimeview</filename> can also be modified
|
|
from the Gui.</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX">
|
|
<title>Adding indexing support for a new file type</title>
|
|
|
|
<para>Let us now imagine that the above
|
|
<replaceable>.blob</replaceable> files actually contain
|
|
indexable text and that you know how to extract it with a
|
|
command line program. Getting &RCL; to index the files is
|
|
easy. You need to perform the above alteration, and also to
|
|
add data to the <filename>mimeconf</filename> file
|
|
(typically in <filename>~/.recoll/mimeconf</filename>):</para>
|
|
<itemizedlist>
|
|
<listitem><para>Under the <literal>[index]</literal>
|
|
section, add the following line (more about the
|
|
<replaceable>rclblob</replaceable> indexing script
|
|
later):<programlisting>
|
|
application/x-blobapp = exec rclblob
|
|
</programlisting></para>
|
|
</listitem>
|
|
<listitem><para>Under the <literal>[icons]</literal>
|
|
section, you should choose an icon to be displayed for the
|
|
files inside the result lists. Icons are normally 64x64
|
|
pixels PNG files which live in
|
|
<filename>/usr/share/recoll/images</filename>.</para>
|
|
</listitem>
|
|
<listitem><para>Under the <literal>[categories]</literal>
|
|
section, you should add the MIME type where it makes sense
|
|
(you can also create a category). Categories may be used
|
|
for filtering in advanced search.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <replaceable>rclblob</replaceable> handler should
|
|
be an executable program or script which exists inside
|
|
<filename>/usr/share/recoll/filters</filename>. It
|
|
will be given a file name as argument and should output the
|
|
text or html contents on the standard output.</para>
|
|
|
|
<para>The <link linkend="RCL.PROGRAM.FILTERS">filter
|
|
programming</link> section describes in more detail how
|
|
to write an input handler.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
</book>
|
|
|