5731 lines
253 KiB
Plaintext
5731 lines
253 KiB
Plaintext
<!-- Use this header for the FreeBSD sgml toolchain -->
|
|
<!-- NOTE: the sgml version should be saved as ISO-8859-1. -->
|
|
<!DOCTYPE BOOK PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
|
|
|
|
<!-- Use this header for going XML -->
|
|
<!-- <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [ -->
|
|
|
|
<!ENTITY RCL "<application>Recoll</application>">
|
|
<!ENTITY RCLAPPS "<ulink url='http://www.recoll.org/features.html'>Recoll helper applications page</ulink>">
|
|
<!ENTITY RCLVERSION "1.19">
|
|
<!ENTITY XAP "<application>Xapian</application>">
|
|
<!ENTITY WIKI "http://bitbucket.org/medoc/recoll/wiki/">
|
|
]>
|
|
|
|
<book lang="en">
|
|
|
|
<bookinfo>
|
|
<title>Recoll user manual</title>
|
|
|
|
<author>
|
|
<firstname>Jean-Francois</firstname>
|
|
<surname>Dockes</surname>
|
|
<affiliation>
|
|
<address><email>jfd@recoll.org</email></address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<copyright>
|
|
<year>2005-2012</year>
|
|
<holder role="mailto:jfd@recoll.org">Jean-Francois Dockes</holder>
|
|
</copyright>
|
|
<abstract>
|
|
|
|
<para><literal>Permission is granted to copy, distribute and/or
|
|
modify this document under the terms of the GNU Free Documentation
|
|
License, Version 1.3 or any later version published by the Free
|
|
Software Foundation; with no Invariant Sections, no Front-Cover
|
|
Texts, and no Back-Cover Texts. A copy of the license can be
|
|
found at the following
|
|
location: <ulink url="http://www.gnu.org/licenses/fdl.html">GNU
|
|
web site</ulink>.</literal></para>
|
|
|
|
<para>This document introduces full text search notions
|
|
and describes the installation and use of the &RCL;
|
|
application. It currently describes &RCL; &RCLVERSION;.</para>
|
|
<!-- <para>[ <ulink url="index.html">Split HTML</ulink> /
|
|
<ulink url="usermanual-xml.html">Single HTML</ulink> ]</para>
|
|
-->
|
|
</abstract>
|
|
|
|
|
|
</bookinfo>
|
|
|
|
<chapter id="RCL.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.TRYIT">
|
|
<title>Giving it a try</title>
|
|
|
|
<para>If you do not like reading manuals (who does?) and would like
|
|
to give &RCL; a try, just <link
|
|
linkend="RCL.INSTALL.BINARY">install</link> the application and
|
|
start the <command>recoll</command> graphical user interface (GUI),
|
|
which will ask to index your home directory by default, allowing
|
|
you to search immediately after indexing completes.</para>
|
|
|
|
<para>Do not do this if your home directory contains a huge
|
|
number of documents and you do not want to wait or are very
|
|
short on disk space. In this case, you may first want to customize
|
|
the <link linkend="RCL.INDEXING.CONFIG">configuration</link>
|
|
to restrict the indexed area.</para>
|
|
|
|
<para>Also be aware that you may need to install the
|
|
appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting
|
|
applications</link> for document types that need them (for
|
|
example <application>antiword</application> for
|
|
<application>Microsoft Word</application> files).</para>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.SEARCH">
|
|
<title>Full text search</title>
|
|
|
|
<para>&RCL; is a full text search application. Full text search
|
|
applications let you find your data by content rather
|
|
than by external attributes (like a file name). More
|
|
specifically, they will let you specify words (terms) that
|
|
should or should not appear in the text you are looking for,
|
|
and return a list of matching documents, ordered so that the
|
|
most <emphasis>relevant</emphasis> documents will appear
|
|
first.</para>
|
|
|
|
<para>You do not need to remember in what file or email message you
|
|
stored a given piece of information. You just ask for related
|
|
terms, and the tool will return a list of documents where
|
|
these terms are prominent, in a similar way to Internet search
|
|
engines.</para>
|
|
|
|
<para>A search application tries to determine which documents are
|
|
most relevant to the search terms you provide. Computer algorithms
|
|
for determining relevance can be very complex, and in general are
|
|
inferior to the power of the human mind to rapidly determine
|
|
relevance. The quality of relevance guessing is probably the most
|
|
important aspect when evaluating a search application.</para>
|
|
|
|
<para>In many cases, you are looking for all the forms of a
|
|
word, not for a specific form or spelling. These different forms
|
|
may include plurals, different tenses for a verb, or terms derived
|
|
from the same root or <emphasis>stem</emphasis> (example: floor,
|
|
floors, floored, flooring...). Search applications usually expand
|
|
queries to all such related terms (words that reduce to the same
|
|
stem) and also provide a way to disable this expansion if you are
|
|
actually searching for a specific form.</para>
|
|
|
|
<para>Stemming, by itself, does not accommodate for misspellings or
|
|
phonetic searches. &RCL; supports these features through a specific
|
|
tool (the <literal>term explorer</literal>) which will let you
|
|
explore the set of index terms along different modes.</para>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.RECOLL">
|
|
<title>Recoll overview</title>
|
|
|
|
<para>&RCL; uses the
|
|
<ulink url="http://www.xapian.org">&XAP;</ulink> information retrieval
|
|
library as its storage and retrieval engine. &XAP; is a very
|
|
mature package using <ulink
|
|
url="http://www.xapian.org/docs/intro_ir.html">a sophisticated
|
|
probabilistic ranking model</ulink>. &RCL; provides the mechanisms
|
|
and interface to get data into and out of the system.</para>
|
|
|
|
<para>In practice, &XAP; works by remembering where terms appear
|
|
in your document files. The acquisition process is called
|
|
indexing. </para>
|
|
|
|
<para>The resulting index can be big (roughly the size of the
|
|
original document set), but it is not a document
|
|
archive. &RCL; can only display documents that still exist at
|
|
the place from which they were indexed. (Actually, there is a
|
|
way to reconstruct a document from the information in the
|
|
index, but the result is not nice, as all formatting,
|
|
punctuation and capitalization are lost).</para>
|
|
|
|
<para>&RCL; stores all internal data in <application>Unicode
|
|
UTF-8</application> format, and it can index files with
|
|
different character sets, encodings, and languages into the same
|
|
index. It has input filters for many document types.</para>
|
|
|
|
<para>Stemming is the process by which &RCL; reduces words to
|
|
their radicals so that searching does not depend, for example, on a
|
|
word being singular or plural (floor, floors), or on a verb tense
|
|
(flooring, floored). Because the mechanisms used for stemming
|
|
depend on the specific grammatical rules for each language, there
|
|
is a separate &XAP; stemmer module for most common languages where
|
|
stemming makes sense.</para>
|
|
|
|
<para>&RCL; stores the unstemmed versions of terms in the main index
|
|
and uses auxiliary databases for term expansion (one for each
|
|
stemming language), which means that you can switch stemming
|
|
languages between searches, or add a language without needing a
|
|
full reindex.</para>
|
|
|
|
<para>Storing documents written in different languages in the same
|
|
index is possible, and commonly done. In this situation, you can
|
|
specify several stemming languages for the index. </para>
|
|
|
|
<para>&RCL; currently makes no attempt at automatic language
|
|
recognition, which means that the stemmer will sometimes be applied
|
|
to terms from other languages with potentially strange results. In
|
|
practise, even if this introduces possibilities of confusion, this
|
|
approach has been proven quite useful, and it is much less
|
|
cumbersome than separating your documents according to what
|
|
language they are written in.</para>
|
|
|
|
<para>Before version 1.18, &RCL; stripped most accents and
|
|
diacritics from terms, and converted them to lower case before
|
|
either storing them in the index or searching for them. As a
|
|
consequence, it was impossible to search for a particular
|
|
capitalization of a term (<literal>US</literal> /
|
|
<literal>us</literal>), or to discriminate two terms based on
|
|
diacritics (<literal>sake</literal> / <literal>saké</literal>,
|
|
<literal>mate</literal> / <literal>maté</literal>).</para>
|
|
|
|
<para>As of version 1.18, &RCL; can optionally store the raw terms,
|
|
without accent stripping or case conversion. In this configuration,
|
|
it is still possible (and most common) for a query to be
|
|
insensitive to case and/or diacritics. Appropriate term expansions
|
|
are performed before actually accessing the main index. This is
|
|
described in more detail in the <link
|
|
linkend="RCL.INDEXING.CONFIG.SENS">section about index case and
|
|
diacritics sensitivity</link>.</para>
|
|
|
|
<para>&RCL; has many parameters which define exactly what to
|
|
index, and how to classify and decode the source
|
|
documents. These are kept in <link
|
|
linkend="RCL.INDEXING.CONFIG">configuration files</link>. A
|
|
default configuration is copied into a standard location
|
|
(usually something like
|
|
<filename>/usr/[local/]share/recoll/examples</filename>)
|
|
during installation. The default values set by the
|
|
configuration files in this directory may be overridden by
|
|
values that you set inside your personal configuration, found
|
|
by default in the <filename>.recoll</filename> sub-directory
|
|
of your home directory. The default configuration will index
|
|
your home directory with default parameters and should be
|
|
sufficient for giving &RCL; a try, but you may want to adjust
|
|
it later, which can be done either by editing the text files
|
|
or by using configuration menus in the
|
|
<command>recoll</command> GUI. Some other parameters affecting only
|
|
the <command>recoll</command> GUI are stored in the standard
|
|
location defined by <application>Qt</application>.</para>
|
|
|
|
<para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing
|
|
process</link> is started automatically the first time you
|
|
execute the <command>recoll</command> GUI. Indexing can also be
|
|
performed by executing the <command>recollindex</command>
|
|
command.</para>
|
|
|
|
<para><link linkend="RCL.SEARCH">Searches</link> are usually
|
|
performed inside the <command>recoll</command> GUI, which has many
|
|
options to help you find what you are looking for. However, there
|
|
are other ways to perform &RCL; searches: mostly a <link
|
|
linkend="RCL.SEARCH.COMMANDLINE">
|
|
command line interface</link>, a
|
|
<link linkend="RCL.PROGRAM.API.PYTHON">
|
|
<application>Python</application>
|
|
programming interface</link>, a <link linkend="RCL.SEARCH.KIO">
|
|
<application>KDE</application> KIO slave module</link>, and
|
|
a <ulink url="&WIKI;UnityLens">Ubuntu Unity Lens</ulink> module.
|
|
</para>
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="RCL.INDEXING">
|
|
<title>Indexing</title>
|
|
|
|
<sect1 id="RCL.INDEXING.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<para>Indexing is the process by which the set of documents is
|
|
analyzed and the data entered into the database. &RCL;
|
|
indexing is normally incremental: documents will only be
|
|
processed if they have been modified. On the first execution,
|
|
all documents will need processing. A full index build can be
|
|
forced later by specifying an option to the indexing command
|
|
(<command>recollindex</command> <option>-z</option>
|
|
or <option>-Z</option>).</para>
|
|
|
|
<para>The following sections give an overview of different
|
|
aspects of the indexing processes and configuration, with links
|
|
to detailed sections.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.INTRODUCTION.MODES">
|
|
<title>Indexing modes</title>
|
|
|
|
<para>&RCL; indexing can be performed along two different modes:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara>
|
|
<title><link linkend="RCL.INDEXING.PERIODIC">
|
|
Periodic (or batch) indexing:</link></title>
|
|
<para>indexing takes place at discrete
|
|
times, by executing the <command>recollindex</command>
|
|
command. The typical usage is to have a nightly indexing run
|
|
<link linkend="RCL.INDEXING.PERIODIC.AUTOMAT">
|
|
programmed</link> into
|
|
your <command>cron</command> file.</para>
|
|
</formalpara>
|
|
</listitem>
|
|
<listitem>
|
|
<formalpara><title><link linkend="RCL.INDEXING.MONITOR">Real
|
|
time indexing:</link></title>
|
|
<para>indexing takes place as soon as a file is created or
|
|
changed. <command>recollindex</command> runs as a daemon
|
|
and uses a file system alteration monitor such as
|
|
<application>inotify</application>,
|
|
<application>Fam</application> or
|
|
<application>Gamin</application>
|
|
to detect file changes.</para>
|
|
</formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
<para>The choice between the two methods is mostly a matter of
|
|
preference, and they can be combined by setting up multiple
|
|
indexes (ie: use periodic indexing on a big documentation
|
|
directory, and real time indexing on a small home
|
|
directory). Monitoring a big file system tree can consume
|
|
significant system resources.</para>
|
|
|
|
<para>The choice of method and the parameters used can be
|
|
configured from the <command>recoll</command> GUI:
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing schedule</guimenuitem>
|
|
</menuchoice>
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.INTRODUCTION.CONFIG">
|
|
<title>Configurations, multiple indexes</title>
|
|
|
|
<para>The parameters describing what is to be indexed and
|
|
local preferences are defined in text files contained in a
|
|
<link linkend="RCL.INDEXING.CONFIG">configuration
|
|
directory</link>.</para>
|
|
|
|
<para>All parameters have defaults, defined in system-wide
|
|
files.</para>
|
|
|
|
<para>Without further configuration, &RCL; will index all
|
|
appropriate files from your home directory, with a reasonable
|
|
set of defaults.</para>
|
|
|
|
<para>A default personal configuration directory
|
|
(<filename>$HOME/.recoll/</filename>) is created
|
|
when a &RCL; program is first executed. It is possible to
|
|
create other configuration directories, and use them by
|
|
setting the <envar>RECOLL_CONFDIR</envar> environment
|
|
variable, or giving the <option>-c</option> option to any of
|
|
the &RCL; commands.</para>
|
|
|
|
<para>In some cases, it may be interesting to index different
|
|
areas of the file system to separate databases. You can do this
|
|
by using multiple configuration directories, each indexing a
|
|
file system area to a specific database. Typically, this
|
|
would be done to separate personal and shared
|
|
indexes, or to take advantage of the organization of your data
|
|
to improve search precision.</para>
|
|
|
|
<para>The generated indexes can
|
|
be queried concurrently in a transparent manner.</para>
|
|
|
|
<para>For index generation, multiple configurations are
|
|
totally independant from each other. When multiple indexes need
|
|
to be used for a single search,
|
|
<link linkend="RCL.INDEXING.CONFIG.MULTIPLE">some parameters
|
|
should be consistent among the configurations</link>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Document types</title>
|
|
<para>&RCL; knows about quite a few different document
|
|
types. The parameters for document types recognition and
|
|
processing are set in
|
|
<link linkend="RCL.INDEXING.CONFIG">configuration files</link>.</para>
|
|
|
|
<para>Most file types, like HTML or word processing files, only hold
|
|
one document. Some file types, like email folders or zip
|
|
archives, can hold many individually indexed documents, which may
|
|
themselves be compound ones. Such hierarchies can go quite
|
|
deep, and &RCL; can process, for example, a
|
|
<application>LibreOffice</application>
|
|
document stored as an attachment to an email message inside an
|
|
email folder archived in a zip file...</para>
|
|
|
|
<para>&RCL; indexing processes plain text, HTML, OpenDocument
|
|
(Open/LibreOffice), email formats, and a few others internally.</para>
|
|
|
|
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
|
need external applications for preprocessing. The list is in the
|
|
<link linkend="RCL.INSTALL.EXTERNAL"> installation</link>
|
|
section. After every indexing operation, &RCL; updates a list of
|
|
commands that would be needed for indexing existing files
|
|
types. This list can be displayed by selecting the menu option
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Show Missing Helpers</guimenuitem>
|
|
</menuchoice>
|
|
in the <command>recoll</command> GUI. It is stored in the
|
|
<filename>missing</filename> text file inside the configuration
|
|
directory.</para>
|
|
|
|
<para>By default, &RCL; will try to index any file type that
|
|
it has a way to read. This is sometimes not desirable, and
|
|
there are ways to either exclude some types, or on the
|
|
contrary to define a positive list of types to be
|
|
indexed. In the latter case, any type not in the list will
|
|
be ignored.</para>
|
|
<para>Excluding types can be done by adding name patterns to
|
|
the <literal>skippedNames</literal> list, which can be done
|
|
from the GUI Index configuration menu. It is also possible
|
|
to exclude a mime type independantly of the file name by
|
|
associating it with the <filename>rclnull</filename>
|
|
filter. This can be done by editing
|
|
the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<filename>mimeconf</filename> configuration
|
|
file</link>.</para>
|
|
|
|
<para>In order to define a positive list, You need to edit the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">main
|
|
configuration file
|
|
(<filename>recoll.conf</filename>)</link> and set
|
|
the <literal>indexedmimetypes</literal> configuration
|
|
variable. Example:<programlisting>
|
|
indexedmimetypes = text/html application/pdf
|
|
</programlisting>
|
|
There is no GUI way to do this, because this option runs a
|
|
bit contrary to &RCL; main goal which is to help you find
|
|
information, independantly of how it may be stored.
|
|
</para>
|
|
|
|
|
|
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2>
|
|
<title>Recovery</title>
|
|
<para>In the rare case where the index becomes corrupted (which can
|
|
signal itself by weird search results or crashes), the index files
|
|
need to be erased before restarting a clean indexing pass. Just delete
|
|
the <filename>xapiandb</filename> directory (see
|
|
<link linkend="RCL.INDEXING.STORAGE">next section</link>), or,
|
|
alternatively, start the next <command>recollindex</command> with the
|
|
<option>-z</option> option, which will reset the database before
|
|
indexing.</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.STORAGE">
|
|
<title>Index storage</title>
|
|
|
|
<para>The default location for the index data is the
|
|
<filename>xapiandb</filename> subdirectory of the &RCL;
|
|
configuration directory, typically
|
|
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
|
|
changed via two different methods (with different purposes):
|
|
<itemizedlist>
|
|
<listitem><para>You can specify a different configuration
|
|
directory by setting the <envar>RECOLL_CONFDIR</envar>
|
|
environment variable, or using the <option>-c</option>
|
|
option to the &RCL; commands. This method would typically be
|
|
used to index different areas of the file system to
|
|
different indexes. For example, if you were to issue the
|
|
following commands:
|
|
<programlisting>
|
|
export RECOLL_CONFDIR=~/.indexes-email
|
|
recoll
|
|
</programlisting> Then &RCL; would use configuration files
|
|
stored in <filename>~/.indexes-email/</filename> and,
|
|
(unless specified otherwise in
|
|
<filename>recoll.conf</filename>) would look for
|
|
the index in
|
|
<filename>~/.indexes-email/xapiandb/</filename>.</para>
|
|
|
|
<para>Using multiple configuration directories and <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
|
options</link> allows you to tailor multiple configurations and
|
|
indexes to handle whatever subset of the available data you wish
|
|
to make searchable.</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem><para>For a given configuration directory, you can
|
|
specify a non-default storage location for the index by setting
|
|
the <varname>dbdir</varname> parameter in the configuration file
|
|
(see the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
|
section</link>). This method would mainly be of use if you wanted
|
|
to keep the configuration directory in its default location, but
|
|
desired another location for the index, typically out of disk
|
|
occupation concerns.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>The size of the index is determined by the size of the set
|
|
of documents, but the ratio can vary a lot. For a typical
|
|
mixed set of documents, the index size will often be close to
|
|
the data set size. In specific cases (a set of compressed mbox
|
|
files for example), the index can become much bigger than the
|
|
documents. It may also be much smaller if the documents
|
|
contain a lot of images or other non-indexed data (an extreme
|
|
example being a set of mp3 files where only the tags would be
|
|
indexed).</para>
|
|
|
|
<para>Of course, images, sound and video do not increase the
|
|
index size, which means that nowadays (2012), typically, even a big
|
|
index will be negligible against the total amount of data on the
|
|
computer.</para>
|
|
|
|
<para>The index data directory (<filename>xapiandb</filename>)
|
|
only contains data that can be completely rebuilt by an index run
|
|
(as long as the original documents exist), and it can always be
|
|
destroyed safely.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.FORMAT">
|
|
<title>&XAP; index formats</title>
|
|
|
|
<para>&XAP; versions usually support several formats for index
|
|
storage. A given major &XAP; version will have a current format,
|
|
used to create new indexes, and will also support the format from
|
|
the previous major version.</para>
|
|
|
|
<para>&XAP; will not convert automatically an existing index
|
|
from the older format to the newer one. If you want to upgrade to
|
|
the new format, or if a very old index needs to be converted
|
|
because its format is not supported any more, you will have to
|
|
explicitly delete the old index, then run a normal indexing
|
|
process.</para>
|
|
|
|
<para>Using the <option>-z</option> option to
|
|
<command>recollindex</command> is not sufficient to change the
|
|
format, you will have to delete all files inside the index
|
|
directory (typically <filename>~/.recoll/xapiandb</filename>)
|
|
before starting the indexing.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.SECURITY">
|
|
<title>Security aspects</title>
|
|
|
|
<para>The &RCL; index does not hold copies of the indexed
|
|
documents. But it does hold enough data to allow for an almost
|
|
complete reconstruction. If confidential data is indexed,
|
|
access to the database directory should be restricted. </para>
|
|
|
|
<para>&RCL; (since version 1.4) will create the configuration
|
|
directory with a mode of 0700 (access by owner only). As the
|
|
index data directory is by default a sub-directory of the
|
|
configuration directory, this should result in appropriate
|
|
protection.</para>
|
|
|
|
<para>If you use another setup, you should think of the kind
|
|
of protection you need for your index, set the directory
|
|
and files access modes appropriately, and also maybe adjust
|
|
the <literal>umask</literal> used during index updates.</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.CONFIG">
|
|
<title>Index configuration</title>
|
|
|
|
<para>Variables set inside the
|
|
<link linkend="RCL.INSTALL.CONFIG">&RCL; configuration files</link>
|
|
control which areas of the file system are indexed, and how
|
|
files are processed. These variables can be set either by
|
|
editing the text files or by using the
|
|
<link linkend="RCL.INDEXING.CONFIG.GUI"> dialogs in the
|
|
<command>recoll</command> GUI</link>.</para>
|
|
|
|
<para>The first time you start <command>recoll</command>, you
|
|
will be asked whether or not you would like it to build the
|
|
index. If you want to adjust the configuration before
|
|
indexing, just click <guilabel>Cancel</guilabel> at this
|
|
point, which will get you into the configuration interface. If
|
|
you exit at this point, <filename>recoll</filename> will have
|
|
created a <filename>~/.recoll</filename> directory containing
|
|
empty configuration files, which you can edit by hand.</para>
|
|
|
|
<para>The configuration is documented inside the
|
|
<link linkend="RCL.INSTALL.CONFIG">installation chapter</link>
|
|
of this document, or in the
|
|
<citerefentry>
|
|
<refentrytitle>recoll.conf</refentrytitle>
|
|
<manvolnum>5</manvolnum>
|
|
</citerefentry>
|
|
man page, but the most
|
|
current information will most likely be the comments inside the
|
|
sample file. The most immediately useful variable you may
|
|
interested in is probably
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS">
|
|
<varname>topdirs</varname></link>,
|
|
which determines what subtrees get indexed.</para>
|
|
|
|
<para>The applications needed to index file types other than
|
|
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
|
described in the <link linkend="RCL.INSTALL.EXTERNAL">external
|
|
packages section.</link></para>
|
|
|
|
<para>As of Recoll 1.18 there are two incompatible types of Recoll
|
|
indexes, depending on the treatment of character case and
|
|
diacritics. The next section describes the two types in more
|
|
detail.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.MULTIPLE">
|
|
<title>Multiple indexes</title>
|
|
|
|
<para>Multiple &RCL; indexes can be created by
|
|
using several configuration directories which are usually set to
|
|
index different areas of the file system. A specific index can
|
|
be selected for updating or searching, using the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable or the
|
|
<option>-c</option> option to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>A typical usage scenario for the multiple index feature
|
|
would be for a system administrator to set up a central index
|
|
for shared data, that you choose to search or not in addition to
|
|
your personal data. Of course, there are other
|
|
possibilities. There are many cases where you know the subset of
|
|
files that should be searched, and where narrowing the search
|
|
can improve the results. You can achieve approximately the same
|
|
effect with the directory filter in advanced search, but
|
|
multiple indexes will have much better performance and may be
|
|
worth the trouble.</para>
|
|
|
|
<para>A <command>recollindex</command> program instance can only
|
|
update one specific index.</para>
|
|
|
|
<para>The main index (defined by
|
|
<envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is
|
|
always active. If this is undesirable, you can set up your
|
|
base configuration to index an empty directory.</para>
|
|
|
|
<para>The different search interfaces (GUI, command line, ...)
|
|
have different methods to define the set of indexes to be
|
|
used, see the appropriate section.</para>
|
|
|
|
<para>If a set of multiple indexes are to be used together for
|
|
searches, some configuration parameters must be consistent
|
|
among the set. These are parameters which need to be the same
|
|
when indexing and searching. As the parameters come from the
|
|
main configuration when searching, they need to be compatible
|
|
with what was set when creating the other indexes (which came
|
|
from their respective configuration directories).</para>
|
|
|
|
<para>Most importantly, all indexes to be queried concurrently must
|
|
have the same option concerning character case and diacritics
|
|
stripping, but there are other constraints. Most of the
|
|
relevant parameters are described in the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TERMS">linked
|
|
section</link>.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.SENS">
|
|
<title>Index case and diacritics sensitivity</title>
|
|
|
|
<para>As of &RCL; version 1.18 you have a choice of building an
|
|
index with terms stripped of character case and diacritics, or
|
|
one with raw terms. For a source term of
|
|
<literal>Résumé</literal>, the former will store
|
|
<literal>resume</literal>, the latter
|
|
<literal>Résumé</literal>.</para>
|
|
|
|
<para>Each type of index allows performing searches insensitive to
|
|
case and diacritics: with a raw index, the user entry will be
|
|
expanded to match all case and diacritics variations present in
|
|
the index. With a stripped index, the search term will be stripped
|
|
before searching.</para>
|
|
|
|
<para>A raw index allows for another possibility which a stripped
|
|
index cannot offer: using case and diacritics to discriminate
|
|
between terms, returning different results when searching for
|
|
<literal>US</literal> and <literal>us</literal> or
|
|
<literal>resume</literal> and <literal>résumé</literal>.
|
|
Read the <link linkend="RCL.SEARCH.CASEDIAC">section about search
|
|
case and diacritics sensitivity</link> for more details.</para>
|
|
|
|
<para>The type of index to be created is controlled by the
|
|
<literal>indexStripChars</literal> configuration
|
|
variable which can only be changed by editing the
|
|
configuration file. Any change implies an index reset (not
|
|
automated by &RCL;), and all indexes in a search must be set
|
|
in the same way (again, not checked by &RCL;). </para>
|
|
|
|
<para>If the <literal>indexStripChars</literal> is not set, &RCL;
|
|
1.18 creates a stripped index by default, for
|
|
compatibility with previous versions.</para>
|
|
|
|
<para>As a cost for added capability, a raw index will be slightly
|
|
bigger than a stripped one (around 10%). Also, searches will be
|
|
more complex, so probably slightly slower, and the feature is
|
|
still young, so that a certain amount of weirdness cannot be
|
|
excluded.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.GUI">
|
|
<title>The index configuration GUI</title>
|
|
|
|
<para>Most parameters for a given index configuration can
|
|
be set from a <command>recoll</command> GUI running on this
|
|
configuration (either as default, or by setting
|
|
<envar>RECOLL_CONFDIR</envar> or the <option>-c</option>
|
|
option.)</para>
|
|
|
|
<para>The interface is started from the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Index Configuration</guimenuitem>
|
|
</menuchoice>
|
|
menu entry. It is divided in four tabs,
|
|
<guilabel>Global parameters</guilabel>, <guilabel>Local
|
|
parameters</guilabel>, <guilabel>Web history</guilabel>
|
|
(which is explained in the next section) and <guilabel>Search
|
|
parameters</guilabel>.</para>
|
|
|
|
<para>The <guilabel>Global parameters</guilabel> tab allows setting
|
|
global variables, like the lists of top directories, skipped paths,
|
|
or stemming languages.</para>
|
|
|
|
<para>The <guilabel>Local parameters</guilabel> tab allows setting
|
|
variables that can be redefined for subdirectories. This second tab
|
|
has an initially empty list of customisation directories, to which
|
|
you can add. The variables are then set for the currently selected
|
|
directory (or at the top level if the empty line is
|
|
selected).</para>
|
|
|
|
<para>The <guilabel>Search parameters</guilabel> section defines
|
|
parameters which are used at query time, but are global to an
|
|
index and affect all search tools, not only the GUI.</para>
|
|
|
|
<para>The meaning for most entries in the interface is
|
|
self-evident and documented by a <literal>ToolTip</literal>
|
|
popup on the text label. For more detail, you will need to
|
|
refer to the <link linkend="RCL.INSTALL.CONFIG">configuration
|
|
section</link> of this guide.</para>
|
|
|
|
<para>The configuration tool normally respects the comments
|
|
and most of the formatting inside the configuration file, so
|
|
that it is quite possible to use it on hand-edited files,
|
|
which you might nevertheless want to backup first...</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.WEBQUEUE">
|
|
<title>Indexing WEB pages you wisit</title>
|
|
|
|
<para>With the help of a <application>Firefox</application>
|
|
extension, &RCL; can index the Internet pages that you visit. The
|
|
extension was initially designed for the
|
|
<application>Beagle</application> indexer, but it has recently be
|
|
renamed and better adapted to &RCL;.</para>
|
|
|
|
<para>The extension works by copying visited WEB pages to an indexing
|
|
queue directory, which &RCL; then processes, indexing the data,
|
|
storing it into a local cache, then removing the file from the
|
|
queue.</para>
|
|
|
|
<para>This feature can be enabled in the GUI
|
|
<guilabel>Index configuration</guilabel>
|
|
panel, or by editing the configuration file (set
|
|
<varname>processwebqueue</varname> to 1).</para>
|
|
|
|
<para>A current pointer to the extension can be found, along with
|
|
up-to-date instructions, on the
|
|
<ulink url="&WIKI;IndexWebHistory">Recoll wiki</ulink>.</para>
|
|
|
|
<para>A copy of the indexed WEB pages is retained by Recoll in a
|
|
local cache (from which previews can be fetched). The cache size can
|
|
be adjusted from the <guilabel>Index configuration</guilabel> /
|
|
<guilabel>Web history</guilabel> panel. Once the maximum size
|
|
is reached, old pages are purged - both from the cache and the index
|
|
- to make room for new ones, so you need to explicitly archive in
|
|
some other place the pages that you want to keep
|
|
indefinitely.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.EXTATTR">
|
|
<title>Extended attributes data</title>
|
|
|
|
<para>User extended attributes are named pieces of information
|
|
that most modern file systems can attach to any file.</para>
|
|
|
|
<para>&RCL; versions 1.19 and later process extended attributes
|
|
as document fields by default. For older versions, this has to
|
|
be activated at build time.</para>
|
|
|
|
<para>A
|
|
<ulink url="http://www.freedesktop.org/wiki/CommonExtendedAttributes">
|
|
freedesktop standard</ulink> defines a few special
|
|
attributes, which are handled as such by &RCL;:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>mime_type</term>
|
|
<listitem><para>If set, this overrides any other
|
|
determination of the file mime type.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>charset</term>
|
|
<listitem>If set, this defines the file character set
|
|
(mostly useful for plain text files).</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
|
|
<para>By default, other attributes are handled as &RCL; fields.
|
|
On Linux, the <literal>user</literal> prefix is removed from
|
|
the name. This can be configured more precisely inside
|
|
the <link linkend="RCL.INSTALL.CONFIG.FIELDS">
|
|
<filename>fields</filename> configuration file</link>.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.EXTTAGS">
|
|
<title>Importing external tags</title>
|
|
|
|
<para>During indexing, it is possible to import metadata for
|
|
each file by executing commands. For example, this could
|
|
extract user tag data for the file and store it in a field for
|
|
indexing.</para>
|
|
|
|
<para>See the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS">section
|
|
about the <literal>metadatacmds</literal> field</link> in
|
|
the main configuration chapter for more detail.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.PERIODIC">
|
|
<title>Periodic indexing</title>
|
|
|
|
<sect2 id="RCL.INDEXING.PERIODIC.EXEC">
|
|
<title>Running indexing</title>
|
|
|
|
<para>Indexing is always performed by the
|
|
<command>recollindex</command> program, which can be started
|
|
either from the command line or from the <guimenu>File</guimenu>
|
|
menu in the <command>recoll</command> GUI program. When started
|
|
from the GUI, the indexing will run on the same configuration
|
|
<command>recoll</command> was started on. When started from the
|
|
command line, <command>recollindex</command> will use the
|
|
<envar>RECOLL_CONFDIR</envar> variable or accept a
|
|
<option>-c</option> <replaceable>confdir</replaceable> option
|
|
to specify a non-default configuration directory.</para>
|
|
|
|
<para>If the <command>recoll</command> program finds no index
|
|
when it starts, it will automatically start indexing (except
|
|
if canceled).</para>
|
|
|
|
<para>The <command>recollindex</command> indexing process can be
|
|
interrupted by sending an interrupt (<keysym>Ctrl-C</keysym>,
|
|
SIGINT) or terminate
|
|
(SIGTERM) signal. Some time may elapse before the process exits,
|
|
because it needs to properly flush and close the index. This can
|
|
also be done from the <command>recoll</command> GUI
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Stop Indexing</guimenuitem>
|
|
</menuchoice>
|
|
menu entry.</para>
|
|
|
|
<para>After such an interruption, the index will be somewhat
|
|
inconsistent because some operations which are normally
|
|
performed at the end of the indexing pass will have been
|
|
skipped (for example, the stemming and spelling databases
|
|
will be inexistant or out of date). You just need to restart
|
|
indexing at a later time to restore consistency. The
|
|
indexing will restart at the interruption point (the full
|
|
file tree will be traversed, but files that were indexed up
|
|
to the interruption and for which the index is still up to
|
|
date will not need to be reindexed).</para>
|
|
|
|
<para><command>recollindex</command> has a number of other options
|
|
which are described in its man page. Only a few will be
|
|
described here.</para>
|
|
<para>Option <option>-z</option> will reset the index when
|
|
starting. This is almost the same as destroying the index
|
|
files (the nuance is that the &XAP; format version will not
|
|
be changed).</para>
|
|
<para>Option <option>-Z</option> will force the update of all
|
|
documents without resetting the index first. This will not
|
|
have the "clean start" aspect of <option>-z</option>, but
|
|
the advantage is that the index will remain available for
|
|
querying while it is rebuilt, which can be a significant
|
|
advantage if it is very big (some installations need days
|
|
for a full index rebuild).</para>
|
|
<para>Of special interest also, maybe, are
|
|
the <option>-i</option> and
|
|
<option>-f</option> options. <option>-i</option> allows
|
|
indexing an explicit list of files (given as command line
|
|
parameters or read on <literal>stdin</literal>).
|
|
<option>-f</option> tells
|
|
<command>recollindex</command> to ignore file selection
|
|
parameters from the configuration. Together, these options allow
|
|
building a custom file selection process for some area of the
|
|
file system, by adding the top directory to the
|
|
<varname>skippedPaths</varname> list and using an appropriate
|
|
file selection method to build the file list to be fed to
|
|
<command>recollindex</command> <option>-if</option>.
|
|
Trivial example:</para>
|
|
<programlisting>
|
|
find . -name indexable.txt -print | recollindex -if
|
|
</programlisting>
|
|
|
|
<para><command>recollindex</command> <option>-i</option> will
|
|
not descend into subdirectories specified as parameters,
|
|
but just add them as index entries. It is
|
|
up to the external file selection method to build the complete
|
|
file list.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.PERIODIC.AUTOMAT">
|
|
<title>Using <command>cron</command> to automate
|
|
indexing</title>
|
|
|
|
<para>The most common way to set up indexing is to have a cron
|
|
task execute it every night. For example the following
|
|
<filename>crontab</filename> entry would do it every day at
|
|
3:30AM (supposing <command>recollindex</command> is in your
|
|
PATH):
|
|
|
|
<screen><![CDATA[
|
|
30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
|
|
]]></screen>
|
|
|
|
Or, using <command>anacron</command>:
|
|
<screen><![CDATA[
|
|
1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>As of version 1.17 the &RCL; GUI has dialogs to manage
|
|
<filename>crontab</filename> entries for
|
|
<command>recollindex</command>. You can reach them from the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing Schedule</guimenuitem>
|
|
</menuchoice>
|
|
menu. They only
|
|
work with the good old <command>cron</command>, and do not give
|
|
access to all features of <command>cron</command> scheduling.</para>
|
|
|
|
<para>The usual command to edit your
|
|
<filename>crontab</filename> is <command>crontab</command>
|
|
<option>-e</option> (which will usually start the
|
|
<command>vi</command> editor to edit the file). You may have
|
|
more sophisticated tools available on your system.</para>
|
|
|
|
<para>Please be aware that there may be differences between your
|
|
usual interactive command line environment and the one seen by
|
|
crontab commands. Especially the PATH variable may be of
|
|
concern. Please check the crontab manual pages about possible
|
|
issues.</para>
|
|
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.MONITOR">
|
|
<title>Real time indexing</title>
|
|
|
|
<para>Real time monitoring/indexing is performed by starting the
|
|
<command>recollindex</command> <option>-m</option> command.
|
|
With this option, <command>recollindex</command> will detach
|
|
from the terminal and become a daemon, permanently monitoring
|
|
file changes and updating the index.</para>
|
|
|
|
<para>Under <application>KDE</application>,
|
|
<application>Gnome</application> and some other desktop
|
|
environments, the daemon can automatically started when you log
|
|
in, by creating a desktop file inside the
|
|
<filename>~/.config/autostart</filename> directory. This can be
|
|
done for you by the &RCL; GUI. Use the
|
|
<guimenu>Preferences->Indexing Schedule</guimenu> menu.</para>
|
|
|
|
<para>With older <application>X11</application> setups, starting
|
|
the daemon is normally performed as part of the user session
|
|
script.</para>
|
|
|
|
<para>The <filename>rclmon.sh</filename> script can be used to
|
|
easily start and stop the daemon. It can be found in the
|
|
<filename>examples</filename> directory (typically
|
|
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
|
|
|
|
<para>For example, my out of fashion
|
|
<application>xdm</application>-based session has a
|
|
<filename>.xsession</filename> script with the following lines
|
|
at the end:</para>
|
|
|
|
<programlisting>recollconf=$HOME/.recoll-home
|
|
recolldata=/usr/local/share/recoll
|
|
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
|
|
|
fvwm
|
|
|
|
</programlisting>
|
|
|
|
<para>The indexing daemon gets started, then the window manager,
|
|
for which the session waits.</para> <para>By default the
|
|
indexing daemon will monitor the state of the X11 session, and
|
|
exit when it finishes, it is not necessary to kill it
|
|
explicitly. (The <application>X11</application> server
|
|
monitoring can be disabled with option <option>-x</option> to
|
|
<command>recollindex</command>).</para>
|
|
|
|
<para>If you use the daemon completely out of an
|
|
<application>X11</application> session, you need to add option
|
|
<option>-x</option> to disable <application>X11</application> session monitoring (else
|
|
the daemon will not start).</para>
|
|
|
|
<para>By default, the messages from the indexing daemon will be
|
|
discarded. You may want to change this by setting the
|
|
<varname>daemlogfilename</varname> and
|
|
<varname>daemloglevel</varname> configuration parameters. Also the
|
|
log file will only be truncated when the daemon starts. If the
|
|
daemon runs permanently, the log file may grow quite big, depending
|
|
on the log level.</para>
|
|
|
|
<para>When building &RCL;, the real time indexing support can be
|
|
customised during package <link
|
|
linkend="RCL.INSTALL.BUILDING.BUILD">configuration</link> with
|
|
the <option>--with[out]-fam</option> or
|
|
<option>--with[out]-inotify</option> options. The default is
|
|
currently to include <application>inotify</application>
|
|
monitoring on systems that support it, and, as of &RCL; 1.17,
|
|
<application>gamin</application> support on
|
|
<application>FreeBSD</application>.</para>
|
|
|
|
<para>While it is convenient that data is indexed in real time,
|
|
repeated indexing can generate a significant load on the
|
|
system when files such as email folders change. Also,
|
|
monitoring large file trees by itself significantly taxes
|
|
system resources. You probably do not want to enable it if
|
|
your system is short on resources. Periodic indexing is
|
|
adequate in most cases.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.MONITOR.FASTFILES">
|
|
<title>Slowing down the reindexing rate for fast changing
|
|
files</title>
|
|
|
|
<para>When using the real time monitor, it may happen that some
|
|
files need to be indexed, but change so often that they impose an
|
|
excessive load for the system.</para>
|
|
|
|
<para>&RCL; provides a configuration option to specify the minimum
|
|
time before which a file, specified by a wildcard pattern, cannot be
|
|
reindexed. See the <varname>mondelaypatterns</varname> parameter in
|
|
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">
|
|
configuration section</link>.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
<chapter id="RCL.SEARCH">
|
|
<title>Searching</title>
|
|
|
|
<sect1 id="RCL.SEARCH.GUI">
|
|
<title>Searching with the Qt graphical user interface</title>
|
|
|
|
<para>The <command>recoll</command> program provides the main user
|
|
interface for searching. It is based on the
|
|
<application>Qt</application> library.</para>
|
|
|
|
<para><command>recoll</command> has two search modes:</para>
|
|
<itemizedlist>
|
|
<listitem><para>Simple search (the default, on the main screen) has
|
|
a single entry field where you can enter multiple words.</para>
|
|
</listitem>
|
|
<listitem><para>Advanced search (a panel accessed through the
|
|
<guilabel>Tools</guilabel> menu or the toolbox bar icon) has
|
|
multiple entry fields, which you may use to build a logical
|
|
condition, with additional filtering on file type, location
|
|
in the file system, modification date, and size.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>In most cases, you can enter the terms as you
|
|
think them, even if they contain embedded punctuation or other
|
|
non-textual characters. For
|
|
example, &RCL; can handle things like email addresses, or
|
|
arbitrary cut and paste from another text window, punctation
|
|
and all.</para>
|
|
|
|
<para>The main case where you should enter text differently from
|
|
how it is printed is for east-asian languages (Chinese,
|
|
Japanese, Korean). Words composed of single or multiple
|
|
characters should be entered separated by white space in this
|
|
case (they would typically be printed without white
|
|
space).</para>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SIMPLE">
|
|
<title>Simple search</title>
|
|
|
|
<procedure>
|
|
<step><para>Start the <command>recoll</command> program.</para>
|
|
</step>
|
|
<step><para>Possibly choose a search mode: <guilabel>Any
|
|
term</guilabel>, <guilabel>All terms</guilabel>,
|
|
<guilabel>File name</guilabel> or
|
|
<guilabel>Query language</guilabel>.</para>
|
|
</step>
|
|
<step><para>Enter search term(s) in the text field at the top of the
|
|
window.</para>
|
|
</step>
|
|
<step><para>Click the <guilabel>Search</guilabel> button or
|
|
hit the <keycap>Enter</keycap> key to start the search.</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>The initial default search mode is <guilabel>Query
|
|
language</guilabel>. Without special directives, this will look for
|
|
documents containing all of the search terms (the ones with more
|
|
terms will get better scores), just like the <guilabel>All
|
|
terms</guilabel> mode which will ignore such
|
|
directives. <guilabel>Any term</guilabel> will search for documents
|
|
where at least one of the terms appear. </para>
|
|
|
|
<para>The <guilabel>Query Language</guilabel> features are
|
|
described in <link linkend="RCL.SEARCH.LANG">a separate
|
|
section</link>.</para>
|
|
|
|
<para>All search modes allow wildcards inside terms
|
|
(<literal>*</literal>, <literal>?</literal>,
|
|
<literal>[]</literal>). You may want to have a look at the
|
|
<link linkend="RCL.SEARCH.WILDCARDS">section about wildcards</link>
|
|
for more information about this.</para>
|
|
|
|
<para><guilabel>File name</guilabel> will specifically look for file
|
|
names. The point of having a separate file name
|
|
search is that wild card expansion can be performed more
|
|
efficiently on a small subset of the index (allowing
|
|
wild cards on the left of terms without excessive penality).
|
|
Things to know:
|
|
<itemizedlist>
|
|
<listitem><para>White space in the entry should match white
|
|
space in the file name, and is not treated specially.</para>
|
|
</listitem>
|
|
<listitem><para>The search is insensitive to character case and
|
|
accents, independantly of the type of index.</para>
|
|
</listitem>
|
|
<listitem><para>An entry without any wild card
|
|
character and not capitalized will be prepended and appended
|
|
with '*' (ie: <replaceable>etc</replaceable> ->
|
|
<replaceable>*etc*</replaceable>, but
|
|
<replaceable>Etc</replaceable> ->
|
|
<replaceable>etc</replaceable>).</para>
|
|
</listitem>
|
|
<listitem><para>If you have a big index (many files),
|
|
excessively generic fragments may result in inefficient
|
|
searches.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>You can search for exact phrases (adjacent words in a
|
|
given order) by enclosing the input inside double quotes. Ex:
|
|
<literal>"virtual reality"</literal>.</para>
|
|
|
|
<para>When using a stripped index, character case has no influence on
|
|
search, except that you can disable stem expansion for any term by
|
|
capitalizing it. Ie: a search for <literal>floor</literal> will also
|
|
normally look for <literal>flooring</literal>,
|
|
<literal>floored</literal>, etc., but a search for
|
|
<literal>Floor</literal> will only look for <literal>floor</literal>,
|
|
in any character case. Stemming can also be disabled globally in the
|
|
preferences. When using a raw index, <link
|
|
linkend="RCL.SEARCH.CASEDIAC">the rules are a bit more
|
|
complicated</link>.</para>
|
|
|
|
<para>&RCL; remembers the last few searches that you
|
|
performed. You can use the simple search text entry widget (a
|
|
combobox) to recall them (click on the thing at the right of the
|
|
text field). Please note, however, that only the search texts
|
|
are remembered, not the mode (all/any/file name).</para>
|
|
|
|
<para>Typing <keycap>Esc</keycap> <keycap>Space</keycap> while
|
|
entering a word in the simple search entry will open a window
|
|
with possible completions for the word. The completions are
|
|
extracted from the database.</para>
|
|
|
|
<para>Double-clicking on a word in the result list or a preview
|
|
window will insert it into the simple search entry field.</para>
|
|
|
|
<para>You can cut and paste any text into an <guilabel>All
|
|
terms</guilabel> or <guilabel>Any term</guilabel> search field,
|
|
punctuation, newlines and all - except for wildcard characters
|
|
(single <literal>?</literal> characters are ok). &RCL; will process
|
|
it and produce a meaningful search. This is what most differentiates
|
|
this mode from the <guilabel>Query Language</guilabel> mode, where
|
|
you have to care about the syntax.</para>
|
|
|
|
<para>You can use the <link linkend="RCL.SEARCH.GUI.COMPLEX">
|
|
<menuchoice>
|
|
<guimenu>Tools</guimenu>
|
|
<guimenuitem>Advanced search</guimenuitem>
|
|
</menuchoice>
|
|
</link> dialog for more complex searches.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RESLIST">
|
|
<title>The default result list</title>
|
|
|
|
<para>After starting a search, a list of results will instantly
|
|
be displayed in the main list window.</para>
|
|
|
|
<para>By default, the document list is presented in order of
|
|
relevance (how well the system estimates that the document
|
|
matches the query). You can sort the result by ascending or
|
|
descending date by using the vertical arrows in the toolbar.</para>
|
|
|
|
<para>Clicking on the
|
|
<literal>Preview</literal> link for an entry will open an
|
|
internal preview window for the document. Further
|
|
<literal>Preview</literal> clicks for the same search will open
|
|
tabs in the existing preview window. You can use
|
|
<keycap>Shift</keycap>+Click to force the creation of another
|
|
preview window, which may be useful to view the documents side
|
|
by side. (You can also browse successive results in a single
|
|
preview window by typing
|
|
<keycap>Shift</keycap>+<keycap>ArrowUp/Down</keycap> in the
|
|
window).</para>
|
|
|
|
<para>Clicking the <literal>Open</literal> link will
|
|
start an external viewer for the document. By default, &RCL; lets
|
|
the desktop choose the appropriate application for most document
|
|
types (there is a short list of exceptions, see further). If you
|
|
prefer to completely customize the choice of applications, you can
|
|
uncheck the <guilabel>Use desktop preferences</guilabel> option in
|
|
the GUI preferences dialog, and click the <guilabel>Choose editor
|
|
applications</guilabel> button to adjust the predefined &RCL;
|
|
choices. The tool accepts multiple selections of mime types (e.g. to
|
|
set up the editor for the dozens of office file types).</para>
|
|
|
|
<para>Even when <guilabel>Use desktop preferences</guilabel> is
|
|
checked, there is a small list of exceptions, for mime types where
|
|
the &RCL; choice should override the desktop one. These are
|
|
applications which are well integrated with &RCL;, especially
|
|
<application>evince</application> for viewing PDF and Postscript
|
|
files because of its support for opening the document at a specific
|
|
page and passing a search string as an argument. Of course, you can
|
|
edit the list (in the GUI preferences) if you would prefer to lose
|
|
the functionality and use the standard desktop tool.</para>
|
|
|
|
<para>You may also change the choice of applications by editing the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename></link> configuration file if you find
|
|
this more convenient.</para>
|
|
|
|
<para>The <literal>Preview</literal> and <literal>Open</literal>
|
|
edit links may not be present for all entries, meaning that
|
|
&RCL; has no configured way to preview a given file type (which
|
|
was indexed by name only), or no configured external editor for
|
|
the file type. This can sometimes be adjusted simply by tweaking
|
|
the <link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<filename>mimemap</filename></link> and
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename></link> configuration files (the latter
|
|
can be modified with the user preferences dialog).</para>
|
|
|
|
<para>The format of the result list entries is entirely
|
|
configurable by using the preference dialog to
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">edit an HTML
|
|
fragment</link>.</para>
|
|
|
|
<para>You can click on the <literal>Query details</literal> link
|
|
at the top of the results page to see the query actually
|
|
performed, after stem expansion and other processing.</para>
|
|
|
|
<para>Double-clicking on any word inside the result list or a
|
|
preview window will insert it into the simple search text.</para>
|
|
|
|
<para>The result list is divided into pages (the size of which
|
|
you can change in the preferences). Use the arrow buttons in the
|
|
toolbar or the links at the bottom of the page to browse the
|
|
results.</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.RESLIST.SUGGS">
|
|
<title>No results: the spelling suggestions</title>
|
|
|
|
<para>When a search yields no result, and if the
|
|
<application>aspell</application> dictionary is configured, &RCL;
|
|
will try to check for misspellings among the query terms, and
|
|
will propose lists of replacements. Clicking on one of the
|
|
suggestions will replace the word and restart the search. You can
|
|
hold any of the modifier keys (Ctrl, Shift, etc.) while clicking
|
|
if you would rather stay on the suggestion screen because several
|
|
terms need replacement.</para>
|
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.RESULTLIST.MENU">
|
|
<title>The result list right-click menu</title>
|
|
|
|
<para>Apart from the preview and edit links, you can display a
|
|
pop-up menu by right-clicking over a paragraph in the result
|
|
list. This menu has the following entries:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><guilabel>Preview</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy File Name</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy Url</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Save to File</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Find similar</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Preview Parent
|
|
document</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open Parent
|
|
document</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open Snippets
|
|
Window</guilabel></para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <guilabel>Preview</guilabel> and
|
|
<guilabel>Open</guilabel> entries do the same thing as the
|
|
corresponding links.</para>
|
|
|
|
<para>The <guilabel>Copy File Name</guilabel> and
|
|
<guilabel>Copy Url</guilabel> copy the relevant data to the
|
|
clipboard, for later pasting.</para>
|
|
|
|
<para><guilabel>Save to File</guilabel> allows saving the
|
|
contents of a result document to a chosen file. This entry
|
|
will only appear if the document does not correspond to an
|
|
existing file, but is a subdocument inside such a file (ie: an
|
|
email attachment). It is especially useful to extract attachments
|
|
with no associated editor.</para>
|
|
|
|
<para>The <guilabel>Find similar</guilabel> entry will select
|
|
a number of relevant term from the current document and enter
|
|
them into the simple search field. You can then start a simple
|
|
search, with a good chance of finding documents related to the
|
|
current result.</para>
|
|
|
|
<para>The <guilabel>Parent document</guilabel> entries will
|
|
appear for documents which are not actually files but are part
|
|
of, or attached to, a higher level document. This entry is mainly
|
|
useful for email attachments and permits viewing the message to
|
|
which the document is attached. Note that the entry will also
|
|
appear for an email which is part of an mbox folder file, but
|
|
that you can't actually visualize the folder (there will be an
|
|
error dialog if you try). &RCL; is unfortunately not yet smart
|
|
enough to disable the entry in this case. In other cases, the
|
|
<guilabel>Open</guilabel> option makes sense, for example to
|
|
start a <application>chm</application> viewer on the parent
|
|
document for a help page.</para>
|
|
|
|
<para>The <guilabel>Open Snippets Window</guilabel> entry will only
|
|
appear for documents which support page breaks (typically
|
|
PDF, Postscript, DVI). The snippets window lists extracts from
|
|
the document, taken around search terms occurrences, along with the
|
|
corresponding page number, as links which can be used to start
|
|
the native viewer on the appropriate page. If the viewer supports
|
|
it, its search function will also be primed with one of the
|
|
search terms.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RESTABLE">
|
|
<title>The result table</title>
|
|
|
|
<para>In &RCL; 1.15 and newer, the results can be displayed in
|
|
spreadsheet-like fashion. You can switch to this presentation by
|
|
clicking the table-like icon in the toolbar (this is a toggle,
|
|
click again to restore the list).</para>
|
|
|
|
<para>Clicking on the column headers will allow sorting by the
|
|
values in the column. You can click again to invert the order, and
|
|
use the header right-click menu to reset sorting to the default
|
|
relevance order (you can also use the sort-by-date arrows to do
|
|
this).</para>
|
|
|
|
<para>Both the list and the table display the same underlying
|
|
results. The sort order set from the table is still active if you
|
|
switch back to the list mode. You can click twice on a date sort
|
|
arrow to reset it from there.</para>
|
|
|
|
<para>The header right-click menu allows adding or deleting
|
|
columns. The columns can be resized, and their order can be changed
|
|
(by dragging). All the changes are recorded when you quit
|
|
<command>recoll</command></para>
|
|
|
|
<para>Hovering over a table row will update the detail area at the
|
|
bottom of the window with the corresponding values. You can click
|
|
the row to freeze the display. The bottom area is equivalent to a
|
|
result list paragraph, with links for starting a preview or a
|
|
native application, and an equivalent right-click menu. Typing
|
|
<keycap>Esc</keycap> (the Escape key) will unfreeze the
|
|
display.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.THUMBNAILS">
|
|
<title>Displaying thumbnails</title>
|
|
|
|
<para>The default format for the result list entries and the
|
|
detail area of the result table display an icon for each result
|
|
document. The icon is either a generic one determined from the
|
|
MIME type, or a thumbnail of the document appearance. Thumbnails
|
|
are only displayed if found in the standard
|
|
<application>freedesktop</application> location, where they would
|
|
typically have been created by a file manager.</para>
|
|
|
|
<para>Recoll has no capability to create thumbnails. A relatively
|
|
simple trick is to use the <guilabel>Open parent
|
|
document/folder</guilabel> entry in the result list popup
|
|
menu. This should open a file manager window on the containing
|
|
directory, which should in turn create the thumbnails (depending on
|
|
your settings). Restarting the search should then display the
|
|
thumbnails.</para>
|
|
|
|
<para>There are also <ulink url="&WIKI;ResultThumbnails.wiki">some
|
|
pointers about thumbnail generation</ulink> on the &RCL; wiki.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.PREVIEW">
|
|
<title>The preview window</title>
|
|
|
|
<para>The preview window opens when you first click a
|
|
<literal>Preview</literal> link inside the result list.</para>
|
|
|
|
<para>Subsequent preview requests for a given search open new
|
|
tabs in the existing window (except if you hold the
|
|
<keycap>Shift</keycap> key while clicking which will open a new
|
|
window for side by side viewing).</para>
|
|
|
|
<para>Starting another search and requesting a preview will
|
|
create a new preview window. The old one stays open until you
|
|
close it.</para>
|
|
|
|
<para>You can close a preview tab by typing <keycap>Ctrl-W</keycap>
|
|
(<keycap>Ctrl</keycap> + <keycap>W</keycap>) in the
|
|
window. Closing the last tab for a window will also close the
|
|
window.</para>
|
|
|
|
<para>Of course you can also close a preview window by using the
|
|
window manager button in the top of the frame.</para>
|
|
|
|
<para>You can display successive or previous documents from the
|
|
result list inside a preview tab by typing
|
|
<keycap>Shift</keycap>+<keycap>Down</keycap> or
|
|
<keycap>Shift</keycap>+<keycap>Up</keycap> (<keycap>Down</keycap>
|
|
and <keycap>Up</keycap> are the arrow keys).</para>
|
|
|
|
<para>A right-click menu in the text area allows switching
|
|
between displaying the main text or the contents of fields
|
|
associated to the document (ie: author, abtract, etc.). This is
|
|
especially useful in cases where the term match did not occur in
|
|
the main text but in one of the fields. In the case of
|
|
images, you can switch between three displays: the image
|
|
itself, the image metadata as extracted
|
|
by <command>exiftool</command> and the fields, which is the
|
|
metadata stored in the index.</para>
|
|
|
|
|
|
<para>You can print the current preview window contents by typing
|
|
<keycap>Ctrl-P</keycap> (<keycap>Ctrl</keycap> +
|
|
<keycap>P</keycap>) in the window text.</para>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.PREVIEW.SEARCH">
|
|
<title>Searching inside the preview</title>
|
|
|
|
<para>The preview window has an internal search capability,
|
|
mostly controlled by the panel at the bottom of the window,
|
|
which works in two modes: as a classical editor incremental
|
|
search, where we look for the text entered in the entry
|
|
zone, or as a way to walk the matches between the document
|
|
and the &RCL; query that found it.</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>Incremental text search</term>
|
|
<listitem><para>The preview tabs have an internal incremental search
|
|
function. You initiate the search either by typing a
|
|
<keycap>/</keycap> (slash) or <keycap>CTL-F</keycap>
|
|
inside the text area or by clicking into
|
|
the <guilabel>Search for:</guilabel> text field and
|
|
entering the search string. You can then use the
|
|
<guilabel>Next</guilabel>
|
|
and <guilabel>Previous</guilabel> buttons
|
|
to find the next/previous occurrence. You can also type
|
|
<keycap>F3</keycap> inside the text area to get to the next
|
|
occurrence.</para>
|
|
<para>If you have a search string entered and you use
|
|
Ctrl-Up/Ctrl-Down to browse the results, the search is
|
|
initiated for each successive document. If the string is
|
|
found, the cursor will be positioned at the first
|
|
occurrence of the search string.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Walking the match lists</term>
|
|
<listitem><para>If the entry area is empty when you click
|
|
the <guilabel>Next</guilabel>
|
|
or <guilabel>Previous</guilabel> buttons, the editor will
|
|
be scrolled to show the next match to any search term
|
|
(the next highlighted zone). If you select a search group
|
|
from the dropdown list and click <guilabel>Next</guilabel>
|
|
or <guilabel>Previous</guilabel>, the match list for this
|
|
group will be walked. This is not the same as a text
|
|
search, because the occurences will include non-exact
|
|
matches (as caused by stemming or wildcards). The search
|
|
will revert to the text mode as soon as you edit the
|
|
entry area.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.COMPLEX">
|
|
<title>Complex/advanced search</title>
|
|
|
|
<para>The advanced search dialog helps you build more complex queries
|
|
without memorizing the search language constructs. It can be opened
|
|
through the <guilabel>Tools</guilabel> menu or through the main
|
|
toolbar.</para>
|
|
|
|
<para>The dialog has two tabs:</para>
|
|
|
|
<orderedlist>
|
|
|
|
<listitem><para>The first tab lets you specify terms to search
|
|
for, and permits specifying multiple clauses which are combined
|
|
to build the search.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>The second tab lets filter the results according
|
|
to file size, date of modification, mime type, or
|
|
location.</para>
|
|
</listitem>
|
|
|
|
</orderedlist>
|
|
|
|
<para>Click on the <guilabel>Start Search</guilabel> button in
|
|
the advanced search dialog, or type <keycap>Enter</keycap> in
|
|
any text field to start the search. The button in
|
|
the main window always performs a simple search.</para>
|
|
|
|
<para>Click on the <literal>Show query details</literal> link at
|
|
the top of the result page to see the query expansion.</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.TERMS">
|
|
<title>Avanced search: the "find" tab</title>
|
|
|
|
<para>This part of the dialog lets you constructc a query by
|
|
combining multiple clauses of different types. Each entry
|
|
field is configurable for the following modes:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>All terms.</para>
|
|
</listitem>
|
|
<listitem><para>Any term.</para>
|
|
</listitem>
|
|
<listitem><para>None of the terms.</para>
|
|
</listitem>
|
|
<listitem><para>Phrase (exact terms in order within an
|
|
adjustable window).</para>
|
|
</listitem>
|
|
<listitem><para>Proximity (terms in any order within an
|
|
adjustable window).</para>
|
|
</listitem>
|
|
<listitem><para>Filename search.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Additional entry fields can be created by clicking the
|
|
<guilabel>Add clause</guilabel> button.</para>
|
|
|
|
<para>When searching, the non-empty clauses will be
|
|
combined either with an AND or an OR conjunction, depending on
|
|
the choice made on the left (<guilabel>All clauses</guilabel> or
|
|
<guilabel>Any clause</guilabel>).</para>
|
|
|
|
<para>Entries of all types except "Phrase" and "Near" accept
|
|
a mix of single words and phrases enclosed in double quotes.
|
|
Stemming and wildcard expansion will be performed as for simple
|
|
search. </para>
|
|
|
|
<formalpara><title>Phrases and Proximity searches</title>
|
|
<para>These two clauses work in similar ways, with the
|
|
difference that proximity searches do not impose an order on the
|
|
words. In both cases, an adjustable number (slack) of non-matched words
|
|
may be accepted between the searched ones (use the counter on
|
|
the left to adjust this count). For phrases, the default count
|
|
is zero (exact match). For proximity it is ten (meaning that two search
|
|
terms, would be matched if found within a window of twelve
|
|
words). Examples: a phrase search for <literal>quick
|
|
fox</literal> with a slack of 0 will match <literal>quick
|
|
fox</literal> but not <literal>quick brown fox</literal>. With
|
|
a slack of 1 it will match the latter, but not <literal>fox
|
|
quick</literal>. A proximity search for <literal>quick
|
|
fox</literal> with the default slack will match the
|
|
latter, and also <literal>a fox is a cunning and quick
|
|
animal</literal>.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.FILTER">
|
|
<title>Avanced search: the "filter" tab</title>
|
|
|
|
<para>This part of the dialog has several sections which allow
|
|
filtering the results of a search according to a number of
|
|
criteria</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
<para>The first section allows filtering by dates of last
|
|
modification. You can specify both a minimum and a maximum date. The
|
|
initial values are set according to the oldest and newest documents
|
|
found in the index.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The next section allows filtering the results by
|
|
file size. There are two entries for minimum and maximum
|
|
size. Enter decimal numbers. You can use suffix multipliers:
|
|
<literal>k/K</literal>, <literal>m/M</literal>,
|
|
<literal>g/G</literal>, <literal>t/T</literal> for 1E3, 1E6,
|
|
1E9, 1E12 respectively.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The next section allows filtering the results by their mime
|
|
types, or mime categories (ie: media/text/message/etc.).</para>
|
|
<para>You can transfer the types between two boxes, to define
|
|
which will be included or excluded by the search.</para>
|
|
<para>The state of the file type selection can be saved as
|
|
the default (the file type filter will not be activated at
|
|
program start-up, but the lists will be in the restored
|
|
state).</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The bottom section allows restricting the search results to a
|
|
sub-tree of the indexed area. You can use the
|
|
<guilabel>Invert</guilabel> checkbox to search for files not in
|
|
the sub-tree instead. If you use directory filtering often and on
|
|
big subsets of the file system, you may think of setting up
|
|
multiple indexes instead, as the performance may be
|
|
better.</para>
|
|
<para>You can use relative/partial paths for filtering. Ie,
|
|
entering <literal>dirA/dirB</literal> would match either
|
|
<filename>/dir1/dirA/dirB/myfile1</filename> or
|
|
<filename>/dir2/dirA/dirB/someother/myfile2</filename>.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.HISTORY">
|
|
<title>Avanced search history</title>
|
|
|
|
<para>The advanced search tool memorizes the last 100 searches
|
|
performed. You can walk the saved searches by using the up and
|
|
down arrow keys while the keyboard focus belongs to the advanced
|
|
search dialog.</para>
|
|
|
|
<para>The complex search history can be erased, along with the
|
|
one for simple search, by selecting the <menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Erase Search History</guimenuitem>
|
|
</menuchoice> menu entry.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.TERMEXPLORER">
|
|
<title>The term explorer tool</title>
|
|
|
|
<para>&RCL; automatically manages the expansion of search terms
|
|
to their derivatives (ie: plural/singular, verb
|
|
inflections). But there are other cases where the exact search
|
|
term is not known. For example, you may not remember the exact
|
|
spelling, or only know the beginning of the name.</para>
|
|
|
|
<para>The term explorer tool (started from the toolbar icon or
|
|
from the <guilabel>Term explorer</guilabel> entry of the
|
|
<guilabel>Tools</guilabel> menu) can be used to search the full index
|
|
terms list. It has three modes of operations:</para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Wildcard</term>
|
|
<listitem><para>In this mode of operation, you can enter a
|
|
search string with shell-like wildcards (*, ?, []). ie:
|
|
<replaceable>xapi*</replaceable> would display all index terms
|
|
beginning with <replaceable>xapi</replaceable>. (More
|
|
about wildcards <link
|
|
linkend="RCL.SEARCH.WILDCARDS">here</link>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Regular expression</term>
|
|
<listitem><para>This mode will accept a regular expression
|
|
as input. Example:
|
|
<replaceable>word[0-9]+</replaceable>. The expression is
|
|
implicitely anchored at the beginning. Ie:
|
|
<replaceable>press</replaceable> will match
|
|
<replaceable>pression</replaceable> but not
|
|
<replaceable>expression</replaceable>. You can use
|
|
<replaceable>.*press</replaceable> to match the latter,
|
|
but be aware that this will cause a full index term list
|
|
scan, which can be quite long.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
|
|
<term>Stem expansion</term>
|
|
<listitem><para>This mode will perform the usual stem expansion
|
|
normally done as part user input processing. As such it is
|
|
probably mostly useful to demonstrate the process.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Spelling/Phonetic</term> <listitem><para>In this
|
|
mode, you enter the term as you think it is spelled, and
|
|
&RCL; will do its best to find index terms that sound like
|
|
your entry. This mode uses the
|
|
<application>Aspell</application> spelling application,
|
|
which must be installed on your system for things to work
|
|
(if your documents contain non-ascii characters, &RCL;
|
|
needs an aspell version newer than 0.60 for UTF-8
|
|
support). The language which is used to build the
|
|
dictionary out of the index terms (which is done at the
|
|
end of an indexing pass) is the one defined by your NLS
|
|
environment. Weird things will probably happen if
|
|
languages are mixed up.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<para>Note that in cases where &RCL; does not know the beginning
|
|
of the string to search for (ie a wildcard expression like
|
|
<replaceable>*coll</replaceable>), the expansion can take quite
|
|
a long time because the full index term list will have to be
|
|
processed. The expansion is currently limited at 10000 results for
|
|
wildcards and regular expressions. It is possible to change the
|
|
limit in the configuration file.</para>
|
|
|
|
<para>Double-clicking on a term in the result list will insert
|
|
it into the simple search entry field. You can also cut/paste
|
|
between the result list and any entry field (the end of lines
|
|
will be taken care of).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.MULTIDB">
|
|
<title>Multiple indexes</title>
|
|
|
|
<para>See the <link linkend="RCL.INDEXING.CONFIG.MULTIPLE">section
|
|
describing the use of multiple indexes</link> for
|
|
generalities. Only the aspects concerning
|
|
the <command>recoll</command> GUI are described here.</para>
|
|
|
|
<para>A <command>recoll</command> program instance is always
|
|
associated with a specific index, which is the one to be updated
|
|
when requested from the <guimenu>File</guimenu> menu, but it can
|
|
use any number of &RCL; indexes for searching. The external
|
|
indexes can be selected through the <guilabel>external
|
|
indexes</guilabel> tab in the preferences dialog.</para>
|
|
|
|
<para>Index selection is performed in two phases. A set of all
|
|
usable indexes must first be defined, and then the subset of
|
|
indexes to be used for searching. These parameters
|
|
are retained across program executions (there are kept
|
|
separately for each &RCL; configuration). The set of all indexes
|
|
is usually quite stable, while the active ones might typically
|
|
be adjusted quite frequently.</para>
|
|
|
|
<para>The main index (defined by
|
|
<envar>RECOLL_CONFDIR</envar>) is always active. If this is
|
|
undesirable, you can set up your base configuration to index
|
|
an empty directory.</para>
|
|
|
|
<para>When adding a new index to the set, you can select either
|
|
a &RCL; configuration directory, or directly a &XAP; index
|
|
directory. In the first case, the &XAP; index directory will
|
|
be obtained from the selected configuration.</para>
|
|
|
|
<para>As building the set of all indexes can be a little tedious
|
|
when done through the user interface, you can use the
|
|
<envar>RECOLL_EXTRA_DBS</envar> environment
|
|
variable to provide an initial set. This might typically be
|
|
set up by a system administrator so that every user does not
|
|
have to do it. The variable should define a colon-separated list
|
|
of index directories, ie:
|
|
</para>
|
|
<screen>export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</screen>
|
|
|
|
<para>Another environment variable,
|
|
<envar>RECOLL_ACTIVE_EXTRA_DBS</envar> allows adding to the active
|
|
list of indexes. This variable was suggested and implemented by a
|
|
&RCL; user. It is mostly useful if you use scripts to mount
|
|
external volumes with &RCL; indexes. By using
|
|
<envar>RECOLL_EXTRA_DBS</envar> and
|
|
<envar>RECOLL_ACTIVE_EXTRA_DBS</envar>, you can add and activate
|
|
the index for the mounted volume when starting
|
|
<command>recoll</command>.
|
|
</para>
|
|
|
|
<para><envar>RECOLL_ACTIVE_EXTRA_DBS</envar> is available for
|
|
&RCL; versions 1.17.2 and later. A change was made in the same
|
|
update so that <command>recoll</command> will
|
|
automatically deactivate unreachable indexes when starting
|
|
up.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.HISTORY">
|
|
<title>Document history</title>
|
|
|
|
<para>Documents that you actually view (with the internal preview
|
|
or an external tool) are entered into the document history,
|
|
which is remembered.</para>
|
|
<para>You can display the history list by using
|
|
the <guilabel>Tools/</guilabel><guilabel>Doc History</guilabel> menu
|
|
entry.</para>
|
|
<para>You can erase the document history by using the
|
|
<guilabel>Erase document history</guilabel> entry in the
|
|
<guimenu>File</guimenu> menu.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SORT">
|
|
<title>Sorting search results and collapsing duplicates</title>
|
|
|
|
<para>The documents in a result list are normally sorted in
|
|
order of relevance. It is possible to specify a different sort
|
|
order, either by using the vertical arrows in the GUI toolbox to
|
|
sort by date, or switching to the result table display and clicking
|
|
on any header. The sort order chosen inside the result table
|
|
remains active if you switch back to the result list, until you
|
|
click one of the vertical arrows, until both are unchecked (you are
|
|
back to sort by relevance).</para>
|
|
|
|
<para>Sort parameters are remembered between program
|
|
invocations, but result sorting is normally always inactive
|
|
when the program starts. It is possible to keep the sorting
|
|
activation state between program invocations by checking the
|
|
<guilabel>Remember sort activation state</guilabel> option in
|
|
the preferences.</para>
|
|
|
|
<para>It is also possible to hide duplicate entries inside
|
|
the result list (documents with the exact same contents as the
|
|
displayed one). The test of identity is based on an MD5 hash
|
|
of the document container, not only of the text contents (so
|
|
that ie, a text document with an image added will not be a
|
|
duplicate of the text only). Duplicates hiding is controlled
|
|
by an entry in the <guilabel>GUI configuration</guilabel>
|
|
dialog, and is off by default.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.TIPS">
|
|
<title>Search tips, shortcuts</title>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.TERMS">
|
|
<title>Terms and search expansion</title>
|
|
|
|
<formalpara><title>Term completion</title>
|
|
<para>Typing <keycap>Esc</keycap> <keycap>Space</keycap> in
|
|
the simple search entry field while entering a word will
|
|
either complete the current word if its beginning matches a
|
|
unique term in the index, or open a window to propose a list
|
|
of completions.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Picking up new terms from result or preview
|
|
text</title>
|
|
<para>Double-clicking on a word in the result list or in a
|
|
preview window will copy it to the simple search entry field.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Wildcards</title>
|
|
<para>Wildcards can be used inside search terms in all forms
|
|
of searches. <link linkend="RCL.SEARCH.WILDCARDS">
|
|
More about wildcards</link>.
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Automatic suffixes</title>
|
|
<para>Words like <literal>odt</literal> or <literal>ods</literal>
|
|
can be automatically turned into query language
|
|
<literal>ext:xxx</literal> clauses. This can be enabled in the
|
|
<guilabel>Search preferences</guilabel> panel in the GUI.
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Disabling stem expansion</title>
|
|
<para>Entering a capitalized word in any search field will prevent
|
|
stem expansion (no search for
|
|
<literal>gardening</literal> if you enter
|
|
<literal>Garden</literal> instead of
|
|
<literal>garden</literal>). This is the only case where
|
|
character case should make a difference for a &RCL;
|
|
search. You can also disable stem expansion or change the
|
|
stemming language in the preferences.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Finding related documents</title>
|
|
<para>Selecting the <guilabel>Find similar documents</guilabel> entry
|
|
in the result list paragraph right-click menu will select a
|
|
set of "interesting" terms from the current result, and insert
|
|
them into the simple search entry field. You can then possibly
|
|
edit the list and start a search to find documents which may
|
|
be apparented to the current result.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>File names</title>
|
|
<para>File names are added as terms during indexing, and you can
|
|
specify them as ordinary terms in normal search fields (&RCL; used
|
|
to index all directories in the file path as terms. This has been
|
|
abandoned as it did not seem really useful). Alternatively, you
|
|
can use the specific file name search which will
|
|
<emphasis>only</emphasis> look for file names, and may be
|
|
faster than the generic search especially when using wildcards.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.PHRASES">
|
|
<title>Working with phrases and proximity</title>
|
|
|
|
<formalpara><title>Phrases and Proximity searches</title>
|
|
<para>A phrase can be looked for by enclosing it in double
|
|
quotes. Example: <literal>"user manual"</literal> will look
|
|
only for occurrences of <literal>user</literal> immediately
|
|
followed by <literal>manual</literal>. You can use the
|
|
<guilabel>This phrase</guilabel> field of the advanced
|
|
search dialog to the same effect. Phrases can be entered along
|
|
simple terms in all simple or advanced search entry fields
|
|
(except <guilabel>This exact phrase</guilabel>).</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>AutoPhrases</title>
|
|
<para>This option can be set in the preferences dialog. If it is
|
|
set, a phrase will be automatically built and added to simple
|
|
searches when looking for <literal>Any terms</literal>. This
|
|
will not change radically the results, but will give a relevance
|
|
boost to the results where the search terms appear as a
|
|
phrase. Ie: searching for <literal>virtual reality</literal>
|
|
will still find all documents where either
|
|
<literal>virtual</literal> or <literal>reality</literal> or
|
|
both appear, but those which contain <literal>virtual
|
|
reality</literal> should appear sooner in the list.</para>
|
|
</formalpara>
|
|
|
|
<para>Phrase searches can strongly slow down a query if most of the
|
|
terms in the phrase are common. This is why the
|
|
<varname>autophrase</varname> option is off by default for &RCL;
|
|
versions before 1.17. As of version 1.17,
|
|
<varname>autophrase</varname> is on by default, but very common
|
|
terms will be removed from the constructed phrase. The removal
|
|
threshold can be adjusted from the search preferences.</para>
|
|
|
|
<formalpara><title>Phrases and abbreviations</title> <para>As of
|
|
&RCL; version 1.17, dotted abbreviations like
|
|
<literal>I.B.M.</literal> are also automatically indexed as a word
|
|
without the dots: <literal>IBM</literal>. Searching for the word
|
|
inside a phrase (ie: <literal>"the IBM company"</literal>) will only
|
|
match the dotted abrreviation if you increase the phrase slack (using the
|
|
advanced search panel control, or the <literal>o</literal> query
|
|
language modifier). Literal occurences of the word will be matched
|
|
normally.</para></formalpara>
|
|
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.MISC">
|
|
<title>Others</title>
|
|
|
|
<formalpara><title>Using fields</title>
|
|
<para>You can use the <link linkend="RCL.SEARCH.LANG">query
|
|
language </link> and field specifications
|
|
to only search certain parts of documents. This can be
|
|
especially helpful with email, for example only searching
|
|
emails from a specific originator:
|
|
<literal>search tips from:helpfulgui</literal>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Ajusting the result table columns</title>
|
|
<para>When displaying results in table mode, you can use a
|
|
right click on the table headers to activate a pop-up menu
|
|
which will let you adjust what columns are displayed. You can
|
|
drag the column headers to adjust their order. You can click
|
|
them to sort by the field displayed in the column. You can
|
|
also save the result list in CSV format.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Query explanation</title>
|
|
<para>You can get an exact description of what the query
|
|
looked for, including stem expansion, and Boolean operators
|
|
used, by clicking on the result list header.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Advanced search history</title>
|
|
<para>As of &RCL; 1.18, you can display any of the last 100 complex
|
|
searches performed by using the up and down arrow keys while the
|
|
advanced search panel is active.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Browsing the result list inside a preview
|
|
window</title>
|
|
<para>Entering <keycap>Shift-Down</keycap> or <keycap>Shift-Up</keycap>
|
|
(<keycap>Shift</keycap> + an arrow key) in a preview window will
|
|
display the next or the previous document from the result
|
|
list. Any secondary search currently active will be executed on
|
|
the new document.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Scrolling the result list from the keyboard</title>
|
|
<para>You can use <keycap>PageUp</keycap> and <keycap>PageDown</keycap>
|
|
to scroll the result list, <keycap>Shift+Home</keycap> to go back
|
|
to the first page. These work even while the focus is in the
|
|
search entry.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Editing a new search while the focus is not
|
|
in the search entry</title>
|
|
<para>You can use the <keycap>Ctrl-Shift-S</keycap> shortcut to
|
|
return the cursor to the search entry (and select the current
|
|
search text), while the focus is anywhere in the main
|
|
window.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Forced opening of a preview window</title>
|
|
<para>You can use <keycap>Shift</keycap>+Click on a result list
|
|
<literal>Preview</literal> link to force the creation of a
|
|
preview window instead of a new tab in the existing one.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Closing previews</title>
|
|
<para>Entering <keycap>Ctrl-W</keycap> in a tab will
|
|
close it (and, for the last tab, close the preview
|
|
window). Entering <keycap>Esc</keycap> will close the preview
|
|
window and all its tabs.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Printing previews</title>
|
|
<para>Entering <keycap>Ctrl-P</keycap> in a preview window will print
|
|
the currently displayed text.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Quitting</title>
|
|
<para>Entering <keycap>Ctrl-Q</keycap> almost anywhere will
|
|
close the application.</para>
|
|
</formalpara>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.CUSTOM">
|
|
<title>Customizing the search interface</title>
|
|
|
|
<para>You can customize some aspects of the search interface by using
|
|
the <guimenu>GUI configuration</guimenu> entry in the
|
|
<guimenu>Preferences</guimenu> menu.</para>
|
|
|
|
<para>There are several tabs in the dialog, dealing with the
|
|
interface itself, the parameters used for searching and
|
|
returning results, and what indexes are searched.</para>
|
|
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.UI">
|
|
<title>User interface parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Highlight color for query
|
|
terms</guilabel>: Terms from the user query are highlighted in
|
|
the result list samples and the preview window. The color can
|
|
be chosen here. Any Qt color string should work (ie
|
|
<literal>red</literal>, <literal>#ff0000</literal>). The
|
|
default is <literal>blue</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Style sheet</guilabel>:
|
|
The name of a <application>Qt</application> style sheet
|
|
text file which is applied to the whole Recoll application
|
|
on startup. The default value is empty, but there is a
|
|
skeleton style sheet (<filename>recoll.qss</filename>)
|
|
inside the <filename>/usr/share/recoll/examples</filename>
|
|
directory. Using a style sheet, you can change most
|
|
<command>recoll</command> graphical parameters:
|
|
colors, fonts, etc. See the sample file for a few
|
|
simple examples.</para>
|
|
<para>You should be aware that parameters (e.g.: the
|
|
background color) set inside the &RCL; GUI style sheet
|
|
will override global system preferences, with possible
|
|
strange side effects: for example if you set the
|
|
foreground to a light color and the background to a
|
|
dark one in the desktop preferences, but only the
|
|
background is set inside the &RCL; style sheet, and it
|
|
is light too, then text will appear light-on-light
|
|
inside the &RCL; GUI.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Maximum text size highlighted for
|
|
preview</guilabel> Inserting highlights on search term inside
|
|
the text before inserting it in the preview window involves
|
|
quite a lot of processing, and can be disabled over the given
|
|
text size to speed up loading.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Prefer HTML to plain text for
|
|
preview</guilabel> if set, Recoll will display HTML as such
|
|
inside the preview window. If this causes problems with the Qt
|
|
HTML display, you can uncheck it to display the plain text
|
|
version instead. </para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Plain text to HTML line style</guilabel>:
|
|
when displaying plain text inside the preview window, &RCL;
|
|
tries to preserve some of the original text line breaks and
|
|
indentation. It can either use PRE HTML tags, which will
|
|
well preserve the indentation but will force horizontal
|
|
scrolling for long lines, or use BR tags to break at the
|
|
original line breaks, which will let the editor introduce
|
|
other line breaks according to the window width, but will
|
|
lose some of the original indentation. The third option has
|
|
been available in recent releases and is probably now the best
|
|
one: use PRE tags with line wrapping.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Use desktop preferences to choose
|
|
document editor</guilabel>: if this is checked, the
|
|
<command>xdg-open</command> utility will be used to open files
|
|
when you click the <guilabel>Open</guilabel> link in the result
|
|
list, instead of the application defined in
|
|
<filename>mimeview</filename>. <command>xdg-open</command> will
|
|
in term use your desktop preferences to choose an appropriate
|
|
application.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Exceptions</guilabel>: when using the
|
|
desktop preferences for opening documents, these are mime types
|
|
that will still be opened according to &RCL; preferences. This
|
|
is useful for passing parameters like page numbers or search
|
|
strings to applications that support them
|
|
(e.g. <application>evince</application>). This cannot be done
|
|
with <command>xdg-open</command> which only supports passing
|
|
one parameter.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Choose editor applications</guilabel>
|
|
this will let you choose the command started by the
|
|
<guilabel>Open</guilabel> links inside the result list, for
|
|
specific document types.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Display category filter as
|
|
toolbar...</guilabel> this will let you choose if the document
|
|
categories are displayed as a list or a set of buttons.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Auto-start simple search on white
|
|
space entry</guilabel>: if this is checked, a search will be
|
|
executed each time you enter a space in the simple search input
|
|
field. This lets you look at the result list as you enter new
|
|
terms. This is off by default, you may like it or not...</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Start with advanced search dialog open
|
|
</guilabel>: If you use this dialog frequently, checking
|
|
the entries will get it to open when recoll starts.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Remember sort activation
|
|
state</guilabel> if set, Recoll will remember the sort tool
|
|
stat between invocations. It normally starts with sorting
|
|
disabled.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.RL">
|
|
<title>Result list parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Number of results in a result
|
|
page</guilabel></para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Result list font</guilabel>: There is
|
|
quite a lot of information shown in the result list, and you
|
|
may want to customize the font and/or font size. The rest of
|
|
the fonts used by &RCL; are determined by your generic Qt
|
|
config (try the <command>qtconfig</command> command).</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.RESULTPARA">
|
|
<para><guilabel>Edit result list paragraph format string</guilabel>:
|
|
allows you to change the presentation of each result list
|
|
entry. See the <link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
result list customisation section</link>.</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.RESULTHEAD">
|
|
<para><guilabel>Edit result page HTML header insert</guilabel>:
|
|
allows you to define text inserted at the end of the result
|
|
page HTML header.
|
|
More detail in the <link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
result list customisation section.</link></para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><guilabel>Date format</guilabel>: allows specifying the
|
|
format used for displaying dates inside the result list. This
|
|
should be specified as an strftime() string (man strftime).</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.ABSSEP">
|
|
<para><guilabel>Abstract snippet separator</guilabel>:
|
|
for synthetic abstracts built from index data, which are
|
|
usually made of several snippets from different parts of the
|
|
document, this defines the snippet separator, an ellipsis by
|
|
default. </para>
|
|
</listitem>
|
|
|
|
</itemizedlist></para>
|
|
</formalpara>
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.SEARCH">
|
|
<title>Search parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Hide duplicate results</guilabel>:
|
|
decides if result list entries are shown for identical
|
|
documents found in different places.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Stemming language</guilabel>:
|
|
stemming obviously depends on the document's language. This
|
|
listbox will let you chose among the stemming databases which
|
|
were built during indexing (this is set in the <link
|
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">main configuration
|
|
file</link>), or later added with <command>recollindex
|
|
-s</command> (See the recollindex manual). Stemming languages
|
|
which are dynamically added will be deleted at the next
|
|
indexing pass unless they are also added in the configuration
|
|
file.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Automatically add phrase to simple
|
|
searches</guilabel>: a phrase will be automatically built and
|
|
added to simple searches when looking for <literal>Any
|
|
terms</literal>. This will give a relevance boost to the
|
|
results where the search terms appear as a phrase (consecutive
|
|
and in order).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Autophrase term frequency threshold
|
|
percentage</guilabel>: very frequent terms should not be included
|
|
in automatic phrase searches for performance reasons. The
|
|
parameter defines the cutoff percentage (percentage of the
|
|
documents where the term appears).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Replace abstracts from
|
|
documents</guilabel>: this decides if we should synthesize and
|
|
display an abstract in place of an explicit abstract found
|
|
within the document itself.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Dynamically build
|
|
abstracts</guilabel>: this decides if &RCL; tries to build
|
|
document abstracts (lists of <emphasis>snippets</emphasis>)
|
|
when displaying the result list. Abstracts are constructed by
|
|
taking context from the document information, around the search
|
|
terms.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Synthetic abstract size</guilabel>:
|
|
adjust to taste...</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Synthetic abstract context
|
|
words</guilabel>: how many words should be displayed around
|
|
each term occurrence.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Query language magic file name
|
|
suffixes</guilabel>: a list of words which automatically get
|
|
turned into <literal>ext:xxx</literal> file name suffix clauses
|
|
when starting a query language query (ie: <literal>doc xls
|
|
xlsx...</literal>). This will save some typing for people who
|
|
use file types a lot when querying.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.EXTRADB">
|
|
<title>External indexes:</title>
|
|
<para>This panel will let you browse for additional indexes
|
|
that you may want to search. External indexes are designated by
|
|
their database directory (ie:
|
|
<filename>/home/someothergui/.recoll/xapiandb</filename>,
|
|
<filename>/usr/local/recollglobal/xapiandb</filename>).</para>
|
|
</formalpara>
|
|
|
|
<para>Once entered, the indexes will appear in the
|
|
<guilabel>External indexes</guilabel> list, and you can
|
|
chose which ones you want to use at any moment by checking or
|
|
unchecking their entries.</para>
|
|
|
|
<para>Your main database (the one the current configuration
|
|
indexes to), is always implicitly active. If this is not
|
|
desirable, you can set up your configuration so that it indexes,
|
|
for example, an empty directory. An alternative indexer may also
|
|
need to implement a way of purging the index from stale data,
|
|
</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
<title>The result list format</title>
|
|
|
|
<para>The result list presentation can be exhaustively customized
|
|
by adjusting two elements:</para>
|
|
<itemizedlist>
|
|
<listitem><para>The paragraph format</para></listitem>
|
|
<listitem><para>HTML code inside the header
|
|
section</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>These can be edited from the <guilabel>Result list</guilabel>
|
|
tab of the <guilabel>GUI configuration</guilabel>.</para>
|
|
|
|
<para>Newer versions of Recoll (from 1.17) use a WebKit HTML
|
|
object by default (this may be disabled at build time), and
|
|
total customisation is possible with full support for CSS and
|
|
Javascript. Conversely, there are limits to what you can do with
|
|
the older Qt QTextBrowser, but still, it is possible to decide
|
|
what data each result will contain, and how it will be
|
|
displayed.</para>
|
|
|
|
<para>No more detail will be given about the header part (only
|
|
useful with the WebKit build), if there are restrictions to
|
|
what you can do, they are beyond this author's HTML/CSS/Javascript
|
|
abilities... There are a few examples on the
|
|
<ulink url="http://www.recoll.org/custom.html">page about
|
|
customising the result list</ulink> on the &RCL; web site.</para>
|
|
|
|
<sect4 id="RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA">
|
|
<title>The paragraph format</title>
|
|
|
|
<para>This is an arbitrary HTML string where the following printf-like
|
|
<literal>%</literal> substitutions will be performed:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>%A</title><para>Abstract</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%D</title><para>Date</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%I</title><para>Icon image
|
|
name. This is normally determined from the mime type. The
|
|
associations are defined inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<filename>mimeconf</filename> configuration file</link>.
|
|
If a thumbnail for the file is found at
|
|
the standard Freedesktop location, this will be displayed
|
|
instead.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%K</title><para>Keywords (if
|
|
any)</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%L</title><para>Precooked Preview,
|
|
Edit, and possibly Snippets links</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%M</title><para>Mime
|
|
type</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%N</title><para>result Number inside
|
|
the result page</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%R</title><para>Relevance
|
|
percentage</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%S</title><para>Size
|
|
information</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%T</title><para>Title or Filename if
|
|
not set.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%t</title><para>Title or Filename if
|
|
not set.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%U</title><para>Url</para></formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
The format of the Preview, Edit, and Snippets links is
|
|
<literal><a href="P%N"></literal>,
|
|
<literal><a href="E%N"></literal>
|
|
and
|
|
<literal><a href="A%N"></literal>
|
|
where <replaceable>docnum</replaceable> (%N) expands to the document
|
|
number inside the result page).</para>
|
|
|
|
<para>In addition to the predefined values above, all strings
|
|
like <literal>%(fieldname)</literal> will be replaced by the
|
|
value of the field named <literal>fieldname</literal> for this
|
|
document. Only stored fields can be accessed in this way, the
|
|
value of indexed but not stored fields is not known at this
|
|
point in the search process
|
|
(see <link linkend="RCL.PROGRAM.FIELDS">field
|
|
configuration</link>). There are currently very few fields
|
|
stored by default, apart from the values above
|
|
(only <literal>author</literal>
|
|
and <literal>filename</literal>), so this feature will need
|
|
some custom local configuration to be useful. An example
|
|
candidate would be the <literal>recipient</literal> field
|
|
which is generated by the message filters.</para>
|
|
|
|
<para>The default value for the paragraph format string is:
|
|
<screen><![CDATA[
|
|
<img src="%I" align="left">%R %S %L <b>%T</b><br>
|
|
%M %D <i>%U</i> %i<br>
|
|
%A %K
|
|
]]></screen>
|
|
|
|
You may, for example, try the following for a more web-like
|
|
experience:
|
|
|
|
<screen><![CDATA[
|
|
<u><b><a href="P%N">%T</a></b></u><br>
|
|
%A<font color=#008000>%U - %S</font> - %L
|
|
]]></screen>
|
|
|
|
Note that the P%N link in the above paragraph makes the title a
|
|
preview link. Or the clean looking:
|
|
|
|
<screen><![CDATA[
|
|
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
|
<b>%T&</b><br>%S
|
|
<font color="#808080"><i>%U</i></font>
|
|
<table bgcolor="#e0e0e0">
|
|
<tr><td><div>%A</div></td></tr>
|
|
</table>%K
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>These samples, and some others are
|
|
<ulink url="http://www.recoll.org/custom.html">on the web
|
|
site, with pictures to show how they look.</ulink></para>
|
|
|
|
<para>It is also possible to
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.ABSSEP">
|
|
define the value of the snippet separator inside the abstract
|
|
section</link>.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
</sect1> <!-- search GUI -->
|
|
|
|
<sect1 id="RCL.SEARCH.KIO">
|
|
<title>Searching with the KDE KIO slave</title>
|
|
|
|
<sect2 id="RCL.SEARCH.KIO.INTRO">
|
|
<title>What's this</title>
|
|
|
|
<para>The &RCL; KIO slave allows performing a &RCL; search
|
|
by entering an appropriate URL in a KDE open dialog, or with an
|
|
HTML-based interface displayed in
|
|
<command>Konqueror</command>.</para>
|
|
|
|
<para>The HTML-based interface is similar to the Qt-based
|
|
interface, but slightly less powerful for now. Its advantage is
|
|
that you can perform your search while staying fully within the
|
|
KDE framework: drag and drop from the result list works normally
|
|
and you have your normal choice of applications for opening
|
|
files.</para>
|
|
|
|
<para>The alternative interface uses a directory view of search
|
|
results. Due to limitations in the current KIO slave interface,
|
|
it is currently not obviously useful (to me).</para>
|
|
|
|
<para>The interface is described in more detail inside a help
|
|
file which you can access by entering
|
|
<filename>recoll:/</filename> inside the
|
|
<command>konqueror</command> URL line (this works only if the
|
|
recoll KIO slave has been previously installed).</para>
|
|
|
|
|
|
<para>The instructions for building this module are located in the
|
|
source tree. See:
|
|
<filename>kde/kio/recoll/00README.txt</filename>. Some Linux
|
|
distributions do package the kio-recoll module, so check before
|
|
diving into the build process, maybe it's already out there ready for
|
|
one-click installation.</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.SEARCH.KIO.SEARCHABLEDOCS">
|
|
<title>Searchable documents</title>
|
|
|
|
<para>As a sample application, the &RCL; KIO slave could allow
|
|
preparing a set of HTML documents (for example a manual) so that
|
|
they become their own search interface inside
|
|
<command>konqueror</command>.</para>
|
|
|
|
<para>This can be done by either explicitly inserting
|
|
<literal><![CDATA[<a href="recoll://...">]]></literal> links
|
|
around some document areas, or automatically by adding a
|
|
very small <application>javascript</application> program to the
|
|
documents, like the following example, which would initiate a search by
|
|
double-clicking any term:</para>
|
|
|
|
<programlisting><script language="JavaScript">
|
|
function recollsearch() {
|
|
var t = document.getSelection();
|
|
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
|
encodeURIComponent(t);
|
|
}
|
|
</script>
|
|
....
|
|
<body ondblclick="recollsearch()">
|
|
|
|
</programlisting>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.COMMANDLINE">
|
|
<title>Searching on the command line</title>
|
|
|
|
<para>There are several ways to obtain search results as a text
|
|
stream, without a graphical interface:</para>
|
|
<itemizedlist>
|
|
<listitem><para>By passing option <option>-t</option> to the
|
|
<command>recoll</command> program.</para>
|
|
</listitem>
|
|
<listitem><para>By using the <command>recollq</command> program.</para>
|
|
</listitem>
|
|
<listitem><para>By writing a custom
|
|
<application>Python</application> program, using the
|
|
<link linkend="RCL.PROGRAM.API.PYTHON">Recoll Python API</link>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The first two methods work in the same way and accept/need the same
|
|
arguments (except for the additional <option>-t</option> to
|
|
<command>recoll</command>). The query to be executed is specified
|
|
as command line arguments.</para>
|
|
|
|
<para><command>recollq</command> is not built by default. You can
|
|
use the <filename>Makefile</filename> in the
|
|
<filename>query</filename> directory to build it. This is a very
|
|
simple program, and if you can program a little c++, you may find it
|
|
useful to taylor its output format to your needs.</para>
|
|
|
|
<para><command>recollq</command> has a man page (not installed by
|
|
default, look in the <filename>doc/man</filename> directory). The
|
|
Usage string is as follows:</para>
|
|
<programlisting>
|
|
recollq: usage:
|
|
-P: Show the date span for all the documents present in the index
|
|
[-o|-a|-f] [-q] <query string>
|
|
Runs a recoll query and displays result lines.
|
|
Default: will interpret the argument(s) as a xesam query string
|
|
query may be like:
|
|
implicit AND, Exclusion, field spec: t1 -t2 title:t3
|
|
OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4)
|
|
Phrase: "t1 t2" (needs additional quoting on cmd line)
|
|
-o Emulate the GUI simple search in ANY TERM mode
|
|
-a Emulate the GUI simple search in ALL TERMS mode
|
|
-f Emulate the GUI simple search in filename mode
|
|
-q is just ignored (compatibility with the recoll GUI command line)
|
|
Common options:
|
|
-c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
|
|
-d also dump file contents
|
|
-n [first-]<cnt> define the result slice. The default value for [first]
|
|
is 0. Without the option, the default max count is 2000.
|
|
Use n=0 for no limit
|
|
-b : basic. Just output urls, no mime types or titles
|
|
-Q : no result lines, just the processed query and result count
|
|
-m : dump the whole document meta[] array for each result
|
|
-A : output the document abstracts
|
|
-S fld : sort by field <fld>
|
|
-D : sort descending
|
|
-i <dbdir> : additional index, several can be given
|
|
-e use url encoding (%xx) for urls
|
|
-F <field name list> : output exactly these fields for each result.
|
|
The field values are encoded in base64, output in one line and
|
|
separated by one space character. This is the recommended format
|
|
for use by other programs. Use a normal query with option -m to
|
|
see the field names.
|
|
</programlisting>
|
|
|
|
<para>Sample execution:</para>
|
|
<programlisting>recollq 'ilur -nautique mime:text/html'
|
|
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11)
|
|
OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
|
|
4 results
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
|
|
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
|
|
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.PTRANS">
|
|
<title>Path translations</title>
|
|
|
|
<para>In some cases, the document paths stored inside the index do
|
|
not match the actual ones, so that document
|
|
previews and accesses will fail. This can occur in a number of
|
|
circumstances:</para>
|
|
<itemizedlist>
|
|
<listitem><para>When using multiple indexes it is a relatively common
|
|
occurrence that some will actually reside on a remote volume, for
|
|
exemple mounted via NFS. In this case, the paths used to access
|
|
the documents on the local machine are not necessarily the same
|
|
than the ones used while indexing on the remote machine. For
|
|
example, <filename>/home/me</filename> may have been used as
|
|
a <literal>topdirs</literal> elements while indexing, but the
|
|
directory might be mounted
|
|
as <filename>/net/server/home/me</filename> on the local
|
|
machine.</para></listitem>
|
|
|
|
<listitem><para>The case may also occur with removable
|
|
disks. It is perfectly possible to configure an index to
|
|
live with the documents on the removable disk, but it may
|
|
happen that the disk is not mounted at the same place so
|
|
that the documents paths from the index are
|
|
invalid.</para></listitem>
|
|
|
|
<listitem><para>As a last exemple, one could imagine that a big
|
|
directory has been moved, but that it is currently
|
|
inconvenient to run the indexer.</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>More generally, the path translation facility may be useful
|
|
whenever the documents paths seen by the indexer are not the same
|
|
as the ones which should be used at query time.</para>
|
|
|
|
<para>&RCL; has a facility for rewriting access paths when
|
|
extracting the data from the index. The translations can be
|
|
defined for the main index and for any additional query
|
|
index.</para>
|
|
|
|
<para>In the above NFS example, &RCL; could be instructed to
|
|
rewrite any <filename>file:///home/me</filename> URL from the
|
|
index to <filename>file:///net/server/home/me</filename>,
|
|
allowing accesses from the client.</para>
|
|
|
|
<para>The translations are defined in the
|
|
<link linkend="RCL.INSTALL.CONFIG.PTRANS">
|
|
<filename>ptrans</filename></link> configuration file, which
|
|
can be edited by hand or from the GUI external indexes
|
|
configuration dialog.</para>
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.LANG">
|
|
<title>The query language</title>
|
|
|
|
<para>The query language processor is activated in the GUI
|
|
simple search entry when the search mode selector is set to
|
|
<guilabel>Query Language</guilabel>. It can also be used with the KIO
|
|
slave or the command line search. It broadly has the same
|
|
capabilities as the complex search interface in the
|
|
GUI.</para>
|
|
|
|
<para>The language is based on the (seemingly defunct)
|
|
<ulink url="http://www.xesam.org/main/XesamUserSearchLanguage95">
|
|
Xesam</ulink> user search language specification.</para>
|
|
|
|
<para>If the results of a query language search puzzle you and you
|
|
doubt what has been actually searched for, you can use the GUI
|
|
<literal>Show Query</literal> link at the top of the result list to
|
|
check the exact query which was finally executed by Xapian.</para>
|
|
|
|
<para>Here follows a sample request that we are going to
|
|
explain:</para>
|
|
|
|
<programlisting>
|
|
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
|
</programlisting>
|
|
|
|
<para>This would search for all documents with
|
|
<replaceable>John Doe</replaceable>
|
|
appearing as a phrase in the author field (exactly what this is
|
|
would depend on the document type, ie: the
|
|
<literal>From:</literal> header, for an email message),
|
|
and containing either <replaceable>beatles</replaceable> or
|
|
<replaceable>lennon</replaceable> and either
|
|
<replaceable>live</replaceable> or
|
|
<replaceable>unplugged</replaceable> but not
|
|
<replaceable>potatoes</replaceable> (in any part of the document).</para>
|
|
|
|
<para>An element is composed of an optional field specification,
|
|
and a value, separated by a colon (the field separator is the last
|
|
colon in the element). Example:
|
|
<replaceable>Eugenie</replaceable>,
|
|
<replaceable>author:balzac</replaceable>,
|
|
<replaceable>dc:title:grandet</replaceable> </para>
|
|
|
|
<para>The colon, if present, means "contains". Xesam defines other
|
|
relations, which are mostly unsupported for now (except in special
|
|
cases, described further down).</para>
|
|
|
|
<para>All elements in the search entry are normally combined
|
|
with an implicit AND. It is possible to specify that elements be
|
|
OR'ed instead, as in <replaceable>Beatles</replaceable>
|
|
<literal>OR</literal> <replaceable>Lennon</replaceable>. The
|
|
<literal>OR</literal> must be entered literally (capitals), and
|
|
it has priority over the AND associations:
|
|
<replaceable>word1</replaceable>
|
|
<replaceable>word2</replaceable> <literal>OR</literal>
|
|
<replaceable>word3</replaceable>
|
|
means
|
|
<replaceable>word1</replaceable> AND
|
|
(<replaceable>word2</replaceable> <literal>OR</literal>
|
|
<replaceable>word3</replaceable>)
|
|
not
|
|
(<replaceable>word1</replaceable> AND
|
|
<replaceable>word2</replaceable>) <literal>OR</literal>
|
|
<replaceable>word3</replaceable>. Explicit
|
|
parenthesis are <emphasis>not</emphasis> supported.</para>
|
|
|
|
<para>An element preceded by a <literal>-</literal> specifies a
|
|
term that should <emphasis>not</emphasis> appear. Pure negative
|
|
queries are forbidden.</para>
|
|
|
|
<para>As usual, words inside quotes define a phrase
|
|
(the order of words is significant), so that
|
|
<replaceable>title:"prejudice pride"</replaceable> is not the same as
|
|
<replaceable>title:prejudice title:pride</replaceable>, and is
|
|
unlikely to find a result.</para>
|
|
|
|
<para>Modifiers can be set on a phrase clause, for example to specify
|
|
a proximity search (unordered). See
|
|
<link linkend="RCL.SEARCH.LANG.MODIFIERS">the modifier
|
|
section</link>.</para>
|
|
|
|
<para>&RCL; currently manages the following default fields:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>title</literal>,
|
|
<literal>subject</literal> or <literal>caption</literal> are
|
|
synonyms which specify data to be searched for in the
|
|
document title or subject.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>author</literal> or
|
|
<literal>from</literal> for searching the documents
|
|
originators.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>recipient</literal> or
|
|
<literal>to</literal> for searching the documents
|
|
recipients.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>keyword</literal> for searching the
|
|
document-specified keywords (few documents actually have
|
|
any).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>filename</literal> for the document's
|
|
file name.</para></listitem>
|
|
|
|
<listitem><para><literal>ext</literal> specifies the file
|
|
name extension (Ex: <literal>ext:html</literal>)</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>The field syntax also supports a few field-like, but
|
|
special, criteria:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>dir</literal> for filtering the
|
|
results on file location
|
|
(Ex: <literal>dir:/home/me/somedir</literal>).
|
|
<literal>-dir</literal>
|
|
also works to find results not in the specified directory
|
|
(release >= 1.15.8). A tilde inside the value will be
|
|
expanded to the home directory. Wildcards will be
|
|
expanded, but
|
|
please <link linkend="RCL.SEARCH.WILDCARDS.PATH"> have a
|
|
look</link> at an important limitation of wildcards in
|
|
path filters.</para>
|
|
|
|
<para>Relative paths also make sense, for example,
|
|
<literal>dir:share/doc</literal> would match either
|
|
<filename>/usr/share/doc</filename> or
|
|
<filename>/usr/local/share/doc</filename> </para>
|
|
|
|
<para>Several <literal>dir</literal> clauses can be specified,
|
|
both positive and negative. For example the following makes sense:
|
|
<programlisting>
|
|
dir:recoll dir:src -dir:utils -dir:common
|
|
</programlisting> This would select results which have both
|
|
<filename>recoll</filename> and <filename>src</filename> in the
|
|
path (in any order), and which have not either
|
|
<filename>utils</filename> or
|
|
<filename>common</filename>.</para>
|
|
|
|
<para>You can also use <literal>OR</literal> conjunctions
|
|
with <literal>dir:</literal> clauses.</para>
|
|
|
|
<para>A special aspect of <literal>dir</literal> clauses is
|
|
that the values in the index are not transcoded to UTF-8, and
|
|
never lower-cased or unaccented, but stored as binary. This means
|
|
that you need to enter the values in the exact lower or upper
|
|
case, and that searches for names with diacritics may sometimes
|
|
be impossible because of character set conversion
|
|
issues. Non-ASCII UNIX file paths are an unending source of
|
|
trouble and are best avoided.</para>
|
|
|
|
<para>You need to use double-quotes around the path value if it
|
|
contains space characters.</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem><para><literal>size</literal> for filtering the
|
|
results on file size. Example:
|
|
<literal>size<10000</literal>. You can use
|
|
<literal><</literal>, <literal>></literal> or
|
|
<literal>=</literal> as operators. You can specify a range like the
|
|
following: <literal>size>100 size<1000</literal>. The usual
|
|
<literal>k/K, m/M, g/G, t/T</literal> can be used as (decimal)
|
|
multipliers. Ex: <literal>size>1k</literal> to search for files
|
|
bigger than 1000 bytes.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>date</literal> for searching or filtering
|
|
on dates. The syntax for the argument is based on the ISO8601
|
|
standard for dates and time intervals. Only dates are supported, no
|
|
times. The general syntax is 2 elements separated by a
|
|
<literal>/</literal> character. Each element can be a date or a
|
|
period of time. Periods are specified as
|
|
<literal>P</literal><replaceable>n</replaceable><literal>Y</literal><replaceable>n</replaceable><literal>M</literal><replaceable>n</replaceable><literal>D</literal>.
|
|
The <replaceable>n</replaceable> numbers are the respective numbers
|
|
of years, months or days, any of which may be missing. Dates are
|
|
specified as
|
|
<replaceable>YYYY</replaceable>-<replaceable>MM</replaceable>-<replaceable>DD</replaceable>.
|
|
The days and months parts may be missing. If the
|
|
<literal>/</literal> is present but an element is missing, the
|
|
missing element is interpreted as the lowest or highest date in the
|
|
index. Examples:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>2001-03-01/2002-05-01</literal> the
|
|
basic syntax for an interval of dates.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001-03-01/P1Y2M</literal> the
|
|
same specified with a period.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001/</literal> from the beginning of
|
|
2001 to the latest date in the index.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001</literal> the whole year of
|
|
2001</para></listitem>
|
|
<listitem><para><literal>P2D/</literal> means 2 days ago up to
|
|
now if there are no documents with dates in the future.</para>
|
|
</listitem>
|
|
<listitem><para><literal>/2003</literal> all documents from
|
|
2003 or older.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>Periods can also be specified with small letters (ie:
|
|
p2y).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>mime</literal> or
|
|
<literal>format</literal> for specifying the
|
|
mime type. This one is quite special because you can specify
|
|
several values which will be OR'ed (the normal default for the
|
|
language is AND). Ex: <literal>mime:text/plain
|
|
mime:text/html</literal>. Specifying an explicit boolean
|
|
operator before a
|
|
<literal>mime</literal> specification is not supported and
|
|
will produce strange results. You can filter out certain types
|
|
by using negation (<literal>-mime:some/type</literal>), and you can
|
|
use wildcards in the value (<literal>mime:text/*</literal>).
|
|
Note that <literal>mime</literal> is
|
|
the ONLY field with an OR default. You do need to use
|
|
<literal>OR</literal> with <literal>ext</literal> terms for
|
|
example.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>type</literal> or
|
|
<literal>rclcat</literal> for specifying the category (as in
|
|
text/media/presentation/etc.). The classification of mime
|
|
types in categories is defined in the &RCL; configuration
|
|
(<filename>mimeconf</filename>), and can be modified or
|
|
extended. The default category names are those which permit
|
|
filtering results in the main GUI screen. Categories are OR'ed
|
|
like mime types above. This can't be negated with
|
|
<literal>-</literal> either.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>Words inside phrases and capitalized words are not
|
|
stem-expanded. Wildcards may be used anywhere inside a term.
|
|
Specifying a wild-card on the left of a term can produce a very
|
|
slow search (or even an incorrect one if the expansion is
|
|
truncated because of excessive size). Also see
|
|
<link linkend="RCL.SEARCH.WILDCARDS">
|
|
More about wildcards</link>.</para>
|
|
|
|
<para>The document filters used while indexing have the
|
|
possibility to create other fields with arbitrary names, and
|
|
aliases may be defined in the configuration, so that the exact
|
|
field search possibilities may be different for you if someone
|
|
took care of the customisation.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.LANG.MODIFIERS">
|
|
<title>Modifiers</title>
|
|
|
|
<para>Some characters are recognized as search modifiers when found
|
|
immediately after the closing double quote of a phrase, as in
|
|
<literal>"some term"modifierchars</literal>. The actual "phrase"
|
|
can be a single term of course. Supported modifiers:
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>l</literal> can be used to turn off
|
|
stemming (mostly makes sense with <literal>p</literal> because
|
|
stemming is off by default for phrases).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>o</literal> can be used to specify a
|
|
"slack" for phrase and proximity searches: the number of
|
|
additional terms that may be found between the specified
|
|
ones. If <literal>o</literal> is followed by an integer number,
|
|
this is the slack, else the default is 10.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>p</literal> can be used to turn the
|
|
default phrase search into a proximity one
|
|
(unordered). Example:<literal>"order any in"p</literal></para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>C</literal> will turn on case
|
|
sensitivity (if the index supports it).</para></listitem>
|
|
|
|
<listitem><para><literal>D</literal> will turn on diacritics
|
|
sensitivity (if the index supports it).</para></listitem>
|
|
|
|
<listitem><para>A weight can be specified for a query element
|
|
by specifying a decimal value at the start of the
|
|
modifiers. Example: <literal>"Important"2.5</literal>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
|
|
</sect2> <!-- search modifiers -->
|
|
|
|
</sect1> <!-- rcl.search.lang -->
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.CASEDIAC">
|
|
<title>Search case and diacritics sensitivity</title>
|
|
|
|
<para>For &RCL; versions 1.18 and later, and <emphasis>when working
|
|
with a raw index</emphasis> (not the default), searches can be
|
|
made sensitive
|
|
to character case and diacritics. How this happens is controlled by
|
|
configuration variables and what search data is entered.</para>
|
|
|
|
<para>The general default is that searches are insensitive to case
|
|
and diacritics. An entry of <literal>resume</literal> will match any
|
|
of <literal>Resume</literal>, <literal>RESUME</literal>,
|
|
<literal>résumé</literal>, <literal>Résumé</literal> etc.</para>
|
|
|
|
<para>Two configuration variables can automate switching on
|
|
sensitivity:</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>autodiacsens</term><listitem><para>If this is set, search
|
|
sensitivity to diacritics will be turned on as soon as an
|
|
accented character exists in a search term. When the variable
|
|
is set to true, <literal>resume</literal> will start a
|
|
diacritics-unsensitive search, but <literal>résumé</literal>
|
|
will be matched exactly. The default value is
|
|
<emphasis>false</emphasis>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>autocasesens</term><listitem><para>If this is set, search
|
|
sensitivity to character case will be turned on as soon as an
|
|
upper-case character exists in a search term <emphasis>except
|
|
for the first one</emphasis>. When the variable is set to
|
|
true, <literal>us</literal> or <literal>Us</literal> will
|
|
start a diacritics-unsensitive search, but
|
|
<literal>US</literal> will be matched exactly. The default
|
|
value is <emphasis>true</emphasis> (contrary to
|
|
<literal>autodiacsens</literal>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>As in the past, capitalizing the first letter of a word will
|
|
turn off its stem expansion and have no effect on
|
|
case-sensitivity.</para>
|
|
|
|
<para>You can also explicitely activate case and diacritics
|
|
sensitivity by using modifiers with the query
|
|
language. <literal>C</literal> will make the term case-sensitive, and
|
|
<literal>D</literal> will make it
|
|
diacritics-sensitive. Examples:</para>
|
|
<programlisting>
|
|
"us"C
|
|
</programlisting>
|
|
|
|
<para>will search for the term <literal>us</literal> exactly
|
|
(<literal>Us</literal> will not be a match).</para>
|
|
|
|
<programlisting>
|
|
"resume"D
|
|
</programlisting>
|
|
<para>will search for the term <literal>resume</literal> exactly
|
|
(<literal>résumé</literal> will not be a match).</para>
|
|
|
|
|
|
<para>When either case or diacritics sensitivity is activated, stem
|
|
expansion is turned off. Having both does not make much sense.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.ANCHORWILD">
|
|
<title>Anchored searches and wildcards</title>
|
|
|
|
<para>Some special characters are interpreted by &RCL; in search
|
|
strings to expand or specialize the search. Wildcards expand a root
|
|
term in controlled ways. Anchor characters can restrict a search to
|
|
succeed only if the match is found at or near the beginning of the
|
|
document or one of its fields.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.WILDCARDS">
|
|
<title>More about wildcards</title>
|
|
|
|
<para>All words entered in &RCL; search fields will be processed
|
|
for wildcard expansion before the request is finally
|
|
executed.</para>
|
|
|
|
<para>The wildcard characters are:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>*</literal> which matches 0 or more
|
|
characters.</para>
|
|
</listitem>
|
|
<listitem><para><literal>?</literal> which matches
|
|
a single character.</para>
|
|
</listitem>
|
|
<listitem><para><literal>[]</literal> which allow
|
|
defining sets of characters to be matched (ex:
|
|
<literal>[</literal><userinput>abc</userinput><literal>]</literal>
|
|
matches a single character which may be 'a' or 'b' or 'c',
|
|
<literal>[</literal><userinput>0-9</userinput><literal>]</literal>
|
|
matches any number.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>You should be aware of a few things when using
|
|
wildcards.</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>Using a wildcard character at the beginning of
|
|
a word can make for a slow search because &RCL; will have to
|
|
scan the whole index term list to find the
|
|
matches. However, this is much less a problem for field
|
|
searches, and queries
|
|
like <replaceable>author:*@domain.com</replaceable> can
|
|
sometimes be very useful.</para></listitem>
|
|
|
|
<listitem><para>For &RCL; version 18 only, when working with a
|
|
raw index (preserving character case and diacritics), the
|
|
literal part of a wildcard expression will be matched
|
|
exactly for case and diacritics. This is not true any
|
|
more for versions 19 and later.</para></listitem>
|
|
|
|
<listitem><para>Using a <literal>*</literal> at the end of a
|
|
word can produce more matches than you would think, and
|
|
strange search results. You can use the
|
|
<link linkend="RCL.SEARCH.GUI.TERMEXPLORER">term
|
|
explorer</link> tool to check what completions exist for
|
|
a given term. You can also see exactly what search was
|
|
performed by clicking on the link at the top of the result
|
|
list. In general, for natural language terms, stem
|
|
expansion will produce better results than an
|
|
ending <literal>*</literal> (stem expansion is turned off
|
|
when any wildcard character appears in the
|
|
term).</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<sect3 id="RCL.SEARCH.WILDCARDS.PATH">
|
|
<title>Wildcards and path filtering</title>
|
|
|
|
<para>Due to the way that &RCL; processes wildcards
|
|
inside <literal>dir</literal> path filtering clauses, they
|
|
will have a multiplicative effect on the query size. A clause
|
|
containg wildcards in several paths elements, like, for
|
|
example,
|
|
<literal>dir:</literal><replaceable>/home/me/*/*/docdir</replaceable>,
|
|
will almost certainly fail if your indexed tree is of any realistic
|
|
size.</para>
|
|
|
|
<para>Depending on the case, you may be able to work around
|
|
the issue by specifying the paths elements more narrowly, with
|
|
a constant prefix, or by using 2
|
|
separate <literal>dir:</literal> clauses instead of multiple
|
|
wildcards, as
|
|
in <literal>dir:</literal><replaceable>/home/me</replaceable> <literal>dir:</literal><replaceable>docdir</replaceable>. The
|
|
latter query is not equivalent to the initial one because it
|
|
does not specify a number of directory levels, but that's
|
|
the best we can do (and it may be actually more useful in
|
|
some cases).</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2> <!-- wildchars -->
|
|
|
|
<sect2 id="RCL.SEARCH.ANCHOR">
|
|
<title>Anchored searches</title>
|
|
|
|
<para>Two characters are used to specify that a search hit should
|
|
occur at the beginning or at the end of the
|
|
text. <literal>^</literal> at the beginning of a term or phrase
|
|
constrains the search to happen at the start, <literal>$</literal>
|
|
at the end force it to happen at the end.</para>
|
|
|
|
<para>As this function is implemented as a phrase search it is
|
|
possible to specify a maximum distance at which the hit should
|
|
occur, either through the controls of the advanced search panel, or
|
|
using the query language, for example, as in:
|
|
<programlisting>"^someterm"o10</programlisting> which would force
|
|
<literal>someterm</literal> to be found within 10 terms of the
|
|
start of the text. This can be combined with a field search as in
|
|
<literal>somefield:"^someterm"o10</literal> or
|
|
<literal>somefield:someterm$</literal>.</para>
|
|
|
|
<para>This feature can also be used with an actual phrase search,
|
|
but in this case, the distance applies to the whole phrase and
|
|
anchor, so that, for example, <literal>bla bla my unexpected
|
|
term</literal> at the beginning of the text would be a match for
|
|
<literal>"^my term"o5</literal>.</para>
|
|
|
|
<para>Anchored searches can be very useful for searches inside
|
|
somewhat structured documents like scientific articles, in case
|
|
explicit metadata has not been supplied (a most frequent case), for
|
|
example for looking for matches inside the abstract or the list of
|
|
authors (which occur at the top of the document).</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
</sect1> <!-- wildchars and anchors -->
|
|
|
|
<sect1 id="RCL.SEARCH.DESKTOP">
|
|
<title>Desktop integration</title>
|
|
|
|
<para>Being independant of the desktop type has its drawbacks: &RCL;
|
|
desktop integration is minimal. However there are a few tools
|
|
available:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>The <application>KDE</application> KIO Slave was
|
|
described in a <link linkend="RCL.SEARCH.KIO">previous
|
|
section</link>.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>If you use a recent version of Ubuntu Linux, you may
|
|
find the <ulink url="&WIKI;UnityLens">Ubuntu Unity
|
|
Lens</ulink> module useful.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>There is also an independantly developed
|
|
<ulink
|
|
url="http://kde-apps.org/content/show.php/recollrunner?content=128203">
|
|
Krunner plugin</ulink>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>Here follow a few other things that may help.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.SHORTCUT">
|
|
<title>Hotkeying recoll</title>
|
|
|
|
<para>It is surprisingly convenient to be able to show or hide the
|
|
&RCL; GUI with a single keystroke. Recoll comes with a small
|
|
Python script, based on the <application>libwnck</application> window
|
|
manager interface library, which will allow you to do just
|
|
this. The detailed instructions are on
|
|
<ulink url="&WIKI;HotRecoll">this wiki page</ulink>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.KICKER-APPLET">
|
|
<title>The KDE Kicker Recoll applet</title>
|
|
|
|
<para>This is probably obsolete now. Anyway:</para>
|
|
<para>The &RCL; source tree contains the source code to the
|
|
<application>recoll_applet</application>, a small application derived
|
|
from the <application>find_applet</application>. This can be used to
|
|
add a small &RCL; launcher to the KDE panel.</para>
|
|
|
|
<para>The applet is not automatically built with the main &RCL;
|
|
programs, nor is it included with the main source distribution
|
|
(because the KDE build boilerplate makes it relatively big). You can
|
|
download its source from the recoll.org download page. Use the
|
|
omnipotent <userinput>configure;make;make install</userinput>
|
|
incantation to build and install.</para>
|
|
|
|
<para>You can then add the applet to the panel by right-clicking the
|
|
panel and choosing the <guilabel>Add applet</guilabel> entry.</para>
|
|
|
|
<para>The <application>recoll_applet</application> has a small text
|
|
window where you can type a &RCL; query (in query language form),
|
|
and an icon which can be used to restrict the search to certain
|
|
types of files. It is quite primitive, and launches a new recoll
|
|
GUI instance every time (even if it is already running). You may
|
|
find it useful anyway.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1> <!-- rcl.search.desktop -->
|
|
|
|
</chapter> <!-- Search -->
|
|
|
|
|
|
<chapter id="RCL.PROGRAM">
|
|
<title>Programming interface</title>
|
|
|
|
<para>&RCL; has an Application Programming Interface, usable both
|
|
for indexing and searching, currently accessible from the
|
|
<application>Python</application> language.</para>
|
|
|
|
<para>Another less radical way to extend the application is to
|
|
write filters for new types of documents.</para>
|
|
|
|
<para>The processing of metadata attributes for documents
|
|
(<literal>fields</literal>) is highly configurable.</para>
|
|
|
|
|
|
|
|
<sect1 id="RCL.PROGRAM.FILTERS">
|
|
<title>Writing a document filter</title>
|
|
|
|
<para>&RCL; filters cooperate to translate from the multitude
|
|
of input document formats, simple ones
|
|
as <application>opendocument</application>,
|
|
<application>acrobat</application>), or compound ones such
|
|
as <application>Zip</application>
|
|
or <application>Email</application>, into the final &RCL;
|
|
indexing input format, which may
|
|
be <literal>text/plain</literal>
|
|
or <literal>text/html</literal>. Most filters are executable
|
|
programs or scripts. A few filters are coded in C++ and live
|
|
inside <command>recollindex</command>. This latter kind will not
|
|
be described here.</para>
|
|
|
|
<para>There are currently (1.18 and since 1.13) two kinds of
|
|
external executable filters:
|
|
<itemizedlist>
|
|
<listitem><para>Simple filters (<literal>exec</literal>
|
|
filters) run once and
|
|
exit. They can be bare programs
|
|
like <application>antiword</application>, or scripts
|
|
using other programs. They are very simple to write,
|
|
because they just need to print the converted document
|
|
to the standard output. Their output can
|
|
be <literal>text/plain</literal>
|
|
or <literal>text/html</literal>.</para>
|
|
</listitem>
|
|
<listitem><para>Multiple filters (<literal>execm</literal>
|
|
filters), run as long as
|
|
their master process (<command>recollindex</command>) is
|
|
active. They can process multiple files (sparing the
|
|
process startup time which can be very significant),
|
|
or multiple documents per file (e.g.: for zip or chm
|
|
files). They communicate with the indexer through a
|
|
simple protocol, but are nevertheless a bit more
|
|
complicated than the older kind. Most of new
|
|
filters are written
|
|
in <application>Python</application>, using a common
|
|
module to handle the protocol. There is an
|
|
exception, <command>rclimg</command> which is written
|
|
in Perl. The subdocuments output by these filters can
|
|
be directly indexable (text or HTML), or they can be
|
|
other simple or compound documents that will need to
|
|
be processed by another filter.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>In both cases, filters deal with regular file system
|
|
files, and can process either a single document, or a
|
|
linear list of documents in each file. &RCL; is responsible
|
|
for performing up to date checks, deal with more complex
|
|
embedding and other upper level issues.</para>
|
|
|
|
<para>In the extreme case of a simple filter returning a
|
|
document in <literal>text/plain</literal> format, no
|
|
metadata can be transferred from the filter to the
|
|
indexer. Generic metadata, like document size or
|
|
modification date, will be gathered and stored by the
|
|
indexer.</para>
|
|
|
|
<para>Filters that produce <literal>text/html</literal>
|
|
format can return an arbitrary amount of metadata inside HTML
|
|
<literal>meta</literal> tags. These will be processed
|
|
according to the directives found in
|
|
the <link linkend="RCL.PROGRAM.FIELDS">
|
|
<filename>fields</filename> configuration
|
|
file</link>.</para>
|
|
|
|
<para>The filters that can handle multiple documents per file
|
|
return a single piece of data to identify each document inside
|
|
the file. This piece of data, called
|
|
an <literal>ipath element</literal> will be sent back by
|
|
&RCL; to extract the document at query time, for previewing,
|
|
or for creating a temporary file to be opened by a
|
|
viewer.</para>
|
|
|
|
<para>The following section describes the simple
|
|
filters, and the next one gives a few explanations about
|
|
the <literal>execm</literal> ones. You could conceivably
|
|
write a simple filter with only the elements in the
|
|
manual. This will not be the case for the other ones, for
|
|
which you will have to look at the code.</para>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.SIMPLE">
|
|
<title>Simple filters</title>
|
|
|
|
<para>&RCL; simple filters are usually shell-scripts, but this is in
|
|
no way necessary. Extracting the text from the native format is the
|
|
difficult part. Outputting the format expected by &RCL; is
|
|
trivial. Happily enough, most document formats have translators or
|
|
text extractors which can be called from the filter. In some cases
|
|
the output of the translating program is completely appropriate,
|
|
and no intermediate shell-script is needed.</para>
|
|
|
|
<para>Filters are called with a single argument which is the
|
|
source file name. They should output the result to stdout.</para>
|
|
|
|
<para>When writing a filter, you should decide if it will output
|
|
plain text or HTML. Plain text is simpler, but you will not be able
|
|
to add metadata or vary the output character encoding (this will be
|
|
defined in a configuration file). Additionally, some formatting may
|
|
be easier to preserve when previewing HTML. Actually the deciding factor
|
|
is metadata: &RCL; has a way to <link linkend="RCL.PROGRAM.FILTERS.HTML">
|
|
extract metadata from the HTML header and use it for field
|
|
searches.</link>.</para>
|
|
|
|
<para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment
|
|
variable (values <literal>yes</literal>, <literal>no</literal>)
|
|
tells the filter if the operation is for indexing or
|
|
previewing. Some filters use this to output a slightly different
|
|
format, for example stripping uninteresting repeated keywords (ie:
|
|
<literal>Subject:</literal> for email) when indexing. This is not
|
|
essential.</para>
|
|
|
|
<para>You should look at one of the simple filters, for example
|
|
<command>rclps</command> for a starting point.</para>
|
|
|
|
<para>Don't forget to make your filter executable before
|
|
testing !</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.MULTIPLE">
|
|
<title>"Multiple" filters</title>
|
|
|
|
<para>If you can program and want to write
|
|
an <literal>execm</literal> filter, it should not be too
|
|
difficult to make sense of one of the existing modules. For
|
|
example, look at <command>rclzip</command> which uses Zip
|
|
file paths as identifiers (<literal>ipath</literal>),
|
|
and <command>rclics</command>, which uses an integer
|
|
index. Also have a look at the comments inside
|
|
the <filename>internfile/mh_execm.h</filename> file and
|
|
possibly at the corresponding module.</para>
|
|
|
|
<para><literal>execm</literal> filters sometimes need to make
|
|
a choice for the nature of the <literal>ipath</literal>
|
|
elements that they use in communication with the
|
|
indexer. Here are a few guidelines:
|
|
<itemizedlist>
|
|
<listitem><para>Use ASCII or UTF-8 (if the identifier is an
|
|
integer print it, for example, like printf %d would
|
|
do).</para></listitem>
|
|
<listitem><para>If at all possible, the data should make some
|
|
kind of sense when printed to a log file to help with
|
|
debugging.</para></listitem>
|
|
<listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
|
|
separator to store a complex path internally (for
|
|
deeper embedding). Colons inside
|
|
the <literal>ipath</literal> elements output by a
|
|
filter will be escaped, but would be a bad choice as a
|
|
filter-specific separator (mostly, again, for
|
|
debugging issues).</para></listitem>
|
|
</itemizedlist>
|
|
In any case, the main goal is that it should
|
|
be easy for the filter to extract the target document, given
|
|
the file name and the <literal>ipath</literal>
|
|
element.</para>
|
|
|
|
<para><literal>execm</literal> filters will also produce
|
|
a document with a null <literal>ipath</literal>
|
|
element. Depending on the type of document, this may have
|
|
some associated data (e.g. the body of an email message), or
|
|
none (typical for an archive file). If it is empty, this
|
|
document will be useful anyway for some operations, as the
|
|
parent of the actual data documents.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.ASSOCIATION">
|
|
<title>Telling &RCL; about the filter</title>
|
|
|
|
<para>There are two elements that link a file to the filter which
|
|
should process it: the association of file to mime type and the
|
|
association of a mime type with a filter.</para>
|
|
|
|
<para>The association of files to mime types is mostly based on
|
|
name suffixes. The types are defined inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<filename>mimemap</filename> file</link>. Example:
|
|
<programlisting>
|
|
|
|
.doc = application/msword
|
|
</programlisting>
|
|
If no suffix association is found for the file name, &RCL; will try
|
|
to execute the <command>file -i</command> command to determine a
|
|
mime type.</para>
|
|
|
|
<para>The association of file types to filters is performed in
|
|
the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<filename>mimeconf</filename> file</link>. A sample will probably be
|
|
of better help than a long explanation:</para>
|
|
<programlisting>
|
|
|
|
[index]
|
|
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
mimetype = text/plain ; charset=utf-8
|
|
|
|
application/ogg = exec rclogg
|
|
|
|
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
|
application/x-chm = execm rclchm
|
|
</programlisting>
|
|
|
|
<para>The fragment specifies that:
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>application/msword</literal> files
|
|
are processed by executing the <command>antiword</command>
|
|
program, which outputs
|
|
<literal>text/plain</literal> encoded in
|
|
<literal>utf-8</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>application/ogg</literal> files are
|
|
processed by the <command>rclogg</command> script, with
|
|
default output type (<literal>text/html</literal>, with
|
|
encoding specified in the header, or <literal>utf-8</literal>
|
|
by default).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>text/rtf</literal> is processed by
|
|
<command>unrtf</command>, which outputs
|
|
<literal>text/html</literal>. The
|
|
<literal>iso-8859-1</literal> encoding is specified because it
|
|
is not the <literal>utf-8</literal> default, and not output by
|
|
<command>unrtf</command> in the HTML header section.</para>
|
|
</listitem>
|
|
<listitem><para><literal>application/x-chm</literal> is processed
|
|
by a persistant filter. This is determined by the
|
|
<literal>execm</literal> keyword.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.HTML">
|
|
<title>Filter HTML output</title>
|
|
|
|
<para>The output HTML could be very minimal like the following
|
|
example:
|
|
<programlisting>
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
|
</head>
|
|
<body>
|
|
Some text content
|
|
</body>
|
|
</html>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>You should take care to escape some
|
|
characters inside the text by transforming them into
|
|
appropriate entities. At the very minimum,
|
|
"<literal>&</literal>" should be transformed into
|
|
"<literal>&amp;</literal>", "<literal><</literal>"
|
|
should be transformed into
|
|
"<literal>&lt;</literal>". This is not always properly
|
|
done by translating programs which output HTML, and of
|
|
course never by those which output plain text. </para>
|
|
|
|
<para>When encapsulating plain text in an HTML body,
|
|
the display of a preview may be improved by enclosing the
|
|
text inside <literal><pre></literal> tags.</para>
|
|
|
|
<para>The character set needs to be specified in the
|
|
header. It does not need to be UTF-8 (&RCL; will take care
|
|
of translating it), but it must be accurate for good
|
|
results.</para>
|
|
|
|
<para>&RCL; will process <literal>meta</literal> tags inside
|
|
the header as possible document fields candidates. Documents
|
|
fields can be processed by the indexer in different ways,
|
|
for searching or displaying inside query results. This is
|
|
described in a <link linkend="RCL.PROGRAM.FIELDS">following
|
|
section.</link>
|
|
</para>
|
|
|
|
<para>By default, the indexer will process the standard header
|
|
fields if they are present: <literal>title</literal>,
|
|
<literal>meta/description</literal>,
|
|
and <literal>meta/keywords</literal> are both indexed and stored
|
|
for query-time display.</para>
|
|
|
|
<para>A predefined non-standard <literal>meta</literal> tag
|
|
will also be processed by &RCL; without further
|
|
configuration: if a <literal>date</literal> tag is present
|
|
and has the right format, it will be used as the document
|
|
date (for display and sorting), in preference to the file
|
|
modification date. The date format should be as follows:
|
|
<programlisting>
|
|
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
|
|
or
|
|
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
|
|
</programlisting>
|
|
Example:
|
|
<programlisting>
|
|
<meta name="date" content="2013-02-24 17:50:00">
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>Filters also have the possibility to "invent" field
|
|
names. This should also be output as meta tags:</para>
|
|
|
|
<programlisting>
|
|
<meta name="somefield" content="Some textual data" />
|
|
</programlisting>
|
|
|
|
<para>You can embed HTML markup inside the content of custom
|
|
fields, for improving the display inside result lists. In this
|
|
case, add a (wildly non-standard) <literal>markup</literal>
|
|
attribute to tell &RCL; that the value is HTML and should not
|
|
be escaped for display.</para>
|
|
|
|
<programlisting>
|
|
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
|
|
</programlisting>
|
|
|
|
<para>As written above, the processing of fields is described
|
|
in a <link linkend="RCL.PROGRAM.FIELDS">further
|
|
section</link>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.PAGES">
|
|
<title>Page numbers</title>
|
|
|
|
<para>The indexer will interpret <literal>^L</literal> characters
|
|
in the filter output as indicating page breaks, and will record
|
|
them. At query time, this allows starting a viewer on the right
|
|
page for a hit or a snippet. Currently, only the PDF, Postscript
|
|
and DVI filters generate page breaks.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.PROGRAM.FIELDS">
|
|
<title>Field data processing</title>
|
|
|
|
<para><literal>Fields</literal> are named pieces of information
|
|
in or about documents, like <literal>title</literal>,
|
|
<literal>author</literal>, <literal>abstract</literal>.</para>
|
|
|
|
<para>The field values for documents can appear in several ways
|
|
during indexing: either output by filters
|
|
as <literal>meta</literal> fields in the HTML header section, or
|
|
extracted from file extended attributes, or added as attributes
|
|
of the <literal>Doc</literal> object when using the API, or
|
|
again synthetized internally by &RCL;.</para>
|
|
|
|
<para>The &RCL; query language allows searching for text in a
|
|
specific field.</para>
|
|
|
|
<para>&RCL; defines a number of default fields. Additional
|
|
ones can be output by filters, and described in the
|
|
<filename>fields</filename> configuration file.</para>
|
|
|
|
<para>Fields can be:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>indexed</literal>, meaning that their
|
|
terms are separately stored in inverted lists (with a specific
|
|
prefix), and that a field-specific search is possible.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>stored</literal>, meaning that their
|
|
value is recorded in the index data record for the document,
|
|
and can be returned and displayed with search results.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>A field can be either or both indexed and stored. This and
|
|
other aspects of fields handling is defined inside the
|
|
<filename>fields</filename> configuration file.</para>
|
|
|
|
<para>The sequence of events for field processing is as follows:
|
|
<itemizedlist>
|
|
<listitem><para>During indexing,
|
|
<command>recollindex</command> scans all <literal>meta</literal>
|
|
fields in HTML documents (most document types are transformed
|
|
into HTML at some point). It compares the name for each element
|
|
to the configuration defining what should be done with fields
|
|
(the <filename>fields</filename> file)</para>
|
|
</listitem>
|
|
<listitem><para>If the name for the <literal>meta</literal>
|
|
element matches one for a field that should be indexed, the
|
|
contents are processed and the terms are entered into the index
|
|
with the prefix defined in the <filename>fields</filename>
|
|
file.</para>
|
|
</listitem>
|
|
<listitem><para>If the name for the <literal>meta</literal> element
|
|
matches one for a field that should be stored, the content of the
|
|
element is stored with the document data record, from which it
|
|
can be extracted and displayed at query time.</para>
|
|
</listitem>
|
|
<listitem><para>At query time, if a field search is performed, the
|
|
index prefix is computed and the match is only performed against
|
|
appropriately prefixed terms in the index.</para>
|
|
</listitem>
|
|
<listitem><para>At query time, the field can be displayed inside
|
|
the result list by using the appropriate directive in the
|
|
definition of the <link
|
|
linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">result list paragraph
|
|
format</link>. All fields are displayed on the fields screen of
|
|
the preview window (which you can reach through the right-click
|
|
menu). This is independant of the fact that the search which
|
|
produced the results used the field or not.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>You can find more information in the
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS">section about the
|
|
<filename>fields</filename> file</link>, or in comments inside the
|
|
file.</para>
|
|
|
|
<para>You can also have a look at the
|
|
<ulink url="&WIKI;HandleCustomField">example on the Wiki</ulink>,
|
|
detailing how one could add a <emphasis>page count</emphasis> field
|
|
to pdf documents for displaying inside result lists.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.PROGRAM.API">
|
|
<title>API</title>
|
|
|
|
<sect2 id="RCL.PROGRAM.API.ELEMENTS">
|
|
<title>Interface elements</title>
|
|
|
|
<para>A few elements in the interface are specific and and need
|
|
an explanation.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>udi</term> <listitem><para>An udi (unique document
|
|
identifier) identifies a document. Because of limitations
|
|
inside the index engine, it is restricted in length (to
|
|
200 bytes), which is why a regular URI cannot be used. The
|
|
structure and contents of the udi is defined by the
|
|
application and opaque to the index engine. For example,
|
|
the internal file system indexer uses the complete
|
|
document path (file path + internal path), truncated to
|
|
length, the suppressed part being replaced by a hash
|
|
value.</para> </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>ipath</term>
|
|
|
|
<listitem><para>This data value (set as a field in the Doc
|
|
object) is stored, along with the URL, but not indexed by
|
|
&RCL;. Its contents are not interpreted, and its use is up
|
|
to the application. For example, the &RCL; internal file
|
|
system indexer stores the part of the document access path
|
|
internal to the container file (<literal>ipath</literal> in
|
|
this case is a list of subdocument sequential numbers). url
|
|
and ipath are returned in every search result and permit
|
|
access to the original document.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Stored and indexed fields</term>
|
|
|
|
<listitem><para>The <filename>fields</filename> file inside
|
|
the &RCL; configuration defines which document fields are
|
|
either "indexed" (searchable), "stored" (retrievable with
|
|
search results), or both.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>Data for an external indexer, should be stored in a
|
|
separate index, not the one for the &RCL; internal file system
|
|
indexer, except if the latter is not used at all). The reason
|
|
is that the main document indexer purge pass would remove all
|
|
the other indexer's documents, as they were not seen during
|
|
indexing. The main indexer documents would also probably be a
|
|
problem for the external indexer purge operation.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.API.PYTHON">
|
|
<title>Python interface</title>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHON.INTRO">
|
|
<title>Introduction</title>
|
|
|
|
<para>&RCL; versions after 1.11 define a Python programming
|
|
interface, both for searching and indexing.</para>
|
|
|
|
<para>The API is inspired by the Python database API
|
|
specification, version 1.0 for &RCL; versions up to 1.18,
|
|
version 2.0 for &RCL; versions 1.19 and later. The package
|
|
structure changed with &RCL; 1.19 too. We will mostly
|
|
describe the new API and package structure here. A paragraph
|
|
at the end of this section will explain a few differences
|
|
and ways to write code compatible with both versions.</para>
|
|
|
|
<para>The Python interface can be found in the source package,
|
|
under <filename>python/recoll</filename>.</para>
|
|
|
|
<para>The <filename>python/recoll/</filename> directory
|
|
contains the usual <filename>setup.py</filename>. After
|
|
configuring the main &RCL; code, you can use the script to
|
|
build and install the Python module:
|
|
<screen>
|
|
<userinput>cd recoll-xxx/python/recoll</userinput>
|
|
<userinput>python setup.py build</userinput>
|
|
<userinput>python setup.py install</userinput>
|
|
</screen>
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHON.PACKAGE">
|
|
<title>Recoll package</title>
|
|
|
|
<para>The <literal>recoll</literal> package contains two
|
|
modules:
|
|
<itemizedlist>
|
|
<listitem><para>The <literal>recoll</literal> module contains
|
|
functions and classes used to query (or update) the
|
|
index.</para></listitem>
|
|
<listitem><para>The <literal>rclextract</literal> module contains
|
|
functions and classes used to access document
|
|
data.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHON.RECOLL">
|
|
<title>The recoll module</title>
|
|
|
|
<sect4 id="RCL.PROGRAM.PYTHON.RECOLL.FUNCTIONS">
|
|
<title>Functions</title>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>connect(confdir=None, extra_dbs=None,
|
|
writable = False)</term>
|
|
<listitem>
|
|
The <literal>connect()</literal> function connects to
|
|
one or several &RCL; index(es) and returns
|
|
a <literal>Db</literal> object.
|
|
<itemizedlist>
|
|
<listitem><literal>confdir</literal> may specify
|
|
a configuration directory. The usual defaults
|
|
apply.</listitem>
|
|
<listitem><literal>extra_dbs</literal> is a list of
|
|
additional indexes (Xapian directories). </listitem>
|
|
<listitem><literal>writable</literal> decides if
|
|
we can index new data through this
|
|
connection.</listitem>
|
|
</itemizedlist>
|
|
This call initializes the recoll module, and it should
|
|
always be performed before any other call or object creation.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect4>
|
|
|
|
|
|
<sect4 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES">
|
|
<title>Classes</title>
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DB">
|
|
<title>The Db class</title>
|
|
|
|
<para>A Db object is created by
|
|
a <literal>connect()</literal> function and holds a
|
|
connection to a Recoll index.</para>
|
|
<variablelist>
|
|
<title>Methods</title>
|
|
<varlistentry>
|
|
<term>Db.close()</term>
|
|
<listitem>Closes the connection. You can't do anything
|
|
with the <literal>Db</literal> object after
|
|
this.</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Db.query(), Db.cursor()</term> <listitem>These
|
|
aliases return a blank <literal>Query</literal> object
|
|
for this index.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Db.setAbstractParams(maxchars, contextwords)</term>
|
|
<listitem>Set the parameters used to build snippets.</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect5>
|
|
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.QUERY">
|
|
<title>The Query class</title>
|
|
|
|
<para>A <literal>Query</literal> object (equivalent to a
|
|
cursor in the Python DB API) is created by
|
|
a <literal>Db.query()</literal> call. It is used to
|
|
execute index searches.</para>
|
|
|
|
<variablelist>
|
|
<title>Methods</title>
|
|
|
|
<varlistentry>
|
|
<term>Query.sortby(fieldname, ascending=True)</term>
|
|
<listitem>Sort results
|
|
by <replaceable>fieldname</replaceable>, in ascending
|
|
or descending order. Must be called before executing
|
|
the search.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.execute(query_string, stemming=1,
|
|
stemlang="english")</term>
|
|
<listitem>Starts a search
|
|
for <replaceable>query_string</replaceable>, a &RCL;
|
|
search language string.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.executesd(SearchData)</term>
|
|
<listitem>Starts a search for the query defined by the
|
|
SearchData object.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.fetchmany(size=query.arraysize)</term>
|
|
|
|
<listitem>Fetches
|
|
the next <literal>Doc</literal> objects in the current
|
|
search results, and returns them as an array of the
|
|
required size, which is by default the value of
|
|
the <literal>arraysize</literal> data member.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.fetchone()</term>
|
|
<listitem>Fetches the next <literal>Doc</literal> object
|
|
from the current search results.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.close()</term>
|
|
<listitem>Closes the connection. The object is unusable
|
|
after the call.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.scroll(value, mode='relative')</term>
|
|
<listitem>Adjusts the position in the current result
|
|
set. <literal>mode</literal> can
|
|
be <literal>relative</literal>
|
|
or <literal>absolute</literal>. </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getgroups()</term>
|
|
<listitem>Retrieves the expanded query terms as a list
|
|
of pairs. Meaningful only after executexx
|
|
In each pair, the first entry is a list of user terms,
|
|
the second a list of query terms as derived from the
|
|
user terms and used in the Xapian Query. The size of
|
|
each list is one for simple terms, or more for group
|
|
and phrase clauses.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getxquery()</term>
|
|
<listitem>Return the Xapian query description as a Unicode string.
|
|
Meaningful only after executexx.</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.highlight(text, ishtml = 0, methods = object)</term>
|
|
<listitem>Will insert <span "class=rclmatch">,
|
|
</span> tags around the match areas in the input text
|
|
and return the modified text. <literal>ishtml</literal>
|
|
can be set to indicate that the input text is HTML and
|
|
that HTML special characters should not be escaped.
|
|
<literal>methods</literal> if set should be an object
|
|
with methods startMatch(i) and endMatch() which will be
|
|
called for each match and should return a begin and end
|
|
tag</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.makedocabstract(doc, methods = object))</term>
|
|
<listitem>Create a snippets abstract
|
|
for <literal>doc</literal> (a <literal>Doc</literal>
|
|
object) by selecting text around the match terms.
|
|
If methods is set, will also perform highlighting. See
|
|
the highlight method.
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.__iter__() and Query.next()</term>
|
|
<listitem>So that things like <literal>for doc in
|
|
query:</literal> will work.</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<variablelist>
|
|
<title>Data descriptors</title>
|
|
|
|
<varlistentry><term>Query.arraysize</term> <listitem>Default
|
|
number of records processed by fetchmany (r/w).</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term>Query.rowcount</term><listitem>Number of
|
|
records returned by the last execute.</listitem></varlistentry>
|
|
<varlistentry><term>Query.rownumber</term><listitem>Next index
|
|
to be fetched from results. Normally increments after
|
|
each fetchone() call, but can be set/reset before the
|
|
call effect seeking. Starts at 0.</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect5>
|
|
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DOC">
|
|
<title>The Doc class</title>
|
|
|
|
<para>A <literal>Doc</literal> object contains index data
|
|
for a given document. The data is extracted from the
|
|
index when searching, or set by the indexer program when
|
|
updating. The Doc object has many attributes to be read or
|
|
set by its user. It matches exactly the Rcl::Doc C++
|
|
object. Some of the attributes are predefined, but,
|
|
especially when indexing, others can be set, the name of
|
|
which will be processed as field names by the indexing
|
|
configuration. Inputs can be specified as Unicode or
|
|
strings. Outputs are Unicode objects. All dates are
|
|
specified as Unix timestamps, printed as strings. Please
|
|
refer to the <filename>rcldb/rcldoc.h</filename> C++ file
|
|
for a description of the predefined attributes.</para>
|
|
|
|
<para>At query time, only the fields that are defined
|
|
as <literal>stored</literal> either by default or in
|
|
the <filename>fields</filename> configuration file will be
|
|
meaningful in the <literal>Doc</literal>
|
|
object. Especially this will not be the case for the
|
|
document text. See the <literal>rclextract</literal>
|
|
module for accessing document contents.</para>
|
|
|
|
<variablelist>
|
|
<title>Methods</title>
|
|
|
|
<varlistentry>
|
|
<term>get(key), [] operator</term>
|
|
<listitem>Retrieve the named doc attribute</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term>getbinurl()</term><listitem>Retrieve
|
|
the URL in byte array format (no transcoding), for use as
|
|
parameter to a system call.</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>items()</term>
|
|
<listitem>Return a dictionary of doc object
|
|
keys/values</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>keys()</term>
|
|
<listitem>list of doc object keys (attribute
|
|
names).</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect5> <!-- Doc -->
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.SEARCHDATA">
|
|
<title>The SearchData class</title>
|
|
|
|
<para>A <literal>SearchData</literal> object allows building
|
|
a query by combining clauses, for execution
|
|
by <literal>Query.executesd()</literal>. It can be used
|
|
in replacement of the query language approach. The
|
|
interface is going to change a little, so no detailed doc
|
|
for now...</para>
|
|
|
|
<variablelist>
|
|
<title>Methods</title>
|
|
|
|
<varlistentry>
|
|
<term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
qstring=string, slack=0, field='', stemming=1,
|
|
subSearch=SearchData)</term>
|
|
<listitem></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</sect5> <!-- SearchData -->
|
|
|
|
</sect4> <!-- recoll.classes -->
|
|
</sect3> <!-- Recoll module -->
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHON.RCLEXTRACT">
|
|
<title>The rclextract module</title>
|
|
|
|
<para>Document content is not provided by an index query. To
|
|
access it, the data extraction part of the indexing process
|
|
must be performed (subdocument access and format
|
|
translation). This is not trivial in
|
|
general. The <literal>rclextract</literal> module currently
|
|
provides a single class which can be used to access the data
|
|
content for result documents.</para>
|
|
|
|
<sect4 id="RCL.PROGRAM.PYTHON.RCLEXTRACT.CLASSES">
|
|
<title>Classes</title>
|
|
|
|
<sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.EXTRACTOR">
|
|
<title>The Extractor class</title>
|
|
|
|
<variablelist>
|
|
<title>Methods</title>
|
|
|
|
<varlistentry>
|
|
<term>Extractor(doc)</term>
|
|
<listitem>An <literal>Extractor</literal> object is
|
|
built from a <literal>Doc</literal> object, output
|
|
from a query.</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Extractor.textextract(ipath)</term>
|
|
<listitem>Extract document defined
|
|
by <replaceable>ipath</replaceable> and return
|
|
a <literal>Doc</literal> object. The doc.text field
|
|
has the document text as either text/plain or
|
|
text/html according to doc.mimetype.</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Extractor.idoctofile()</term>
|
|
<listitem>Extracts document into an output file,
|
|
which can be given explicitly or will be created as a
|
|
temporary file to be deleted by the caller.</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect5> <!-- Extractor class -->
|
|
</sect4> <!-- rclextract classes -->
|
|
</sect3> <!-- rclextract module -->
|
|
|
|
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHON.EXAMPLES">
|
|
<title>Example code</title>
|
|
|
|
<para>The following sample would query the index with a user
|
|
language string. See the <filename>python/samples</filename>
|
|
directory inside the &RCL; source for other
|
|
examples. The <filename>recollgui</filename> subdirectory
|
|
has a very embryonic GUI which demonstrates the
|
|
highlighting and data extraction functions.</para>
|
|
|
|
<programlisting>
|
|
#!/usr/bin/env python
|
|
<![CDATA[
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
db.setAbstractParams(maxchars=80, contextwords=4)
|
|
|
|
query = db.query()
|
|
nres = query.execute("some user question")
|
|
print "Result count: ", nres
|
|
if nres > 5:
|
|
nres = 5
|
|
for i in range(nres):
|
|
doc = query.fetchone()
|
|
print "Result #%d" % (query.rownumber,)
|
|
for k in ("title", "size"):
|
|
print k, ":", getattr(doc, k).encode('utf-8')
|
|
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
|
print abs
|
|
print
|
|
|
|
]]>
|
|
</programlisting>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHON.COMPAT">
|
|
<title>Compatibility with the previous version</title>
|
|
|
|
<para>The following code fragments can be used to ensure that
|
|
code can run with both the old and the new API (as long as it
|
|
does not use the new abilities of the new API of
|
|
course).</para>
|
|
|
|
<para>Adapting to the new package structure:</para>
|
|
<programlisting>
|
|
<![CDATA[
|
|
try:
|
|
from recoll import recoll
|
|
from recoll import rclextract
|
|
hasextract = True
|
|
except:
|
|
import recoll
|
|
hasextract = False
|
|
]]>
|
|
</programlisting>
|
|
|
|
<para>Adapting to the change of nature of
|
|
the <literal>next</literal> <literal>Query</literal>
|
|
member. The same test can be used to choose to use
|
|
the <literal>scroll()</literal> method (new) or set
|
|
the <literal>next</literal> value (old).</para>
|
|
|
|
<programlisting>
|
|
<![CDATA[
|
|
rownum = query.next if type(query.next) == int else \
|
|
query.rownumber
|
|
]]>
|
|
</programlisting>
|
|
|
|
</sect3> <!-- compat with previous version -->
|
|
</sect2>
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="RCL.INSTALL">
|
|
<title>Installation and configuration</title>
|
|
|
|
<sect1 id="RCL.INSTALL.BINARY">
|
|
<title>Installing a binary copy</title>
|
|
|
|
<para>There are three types of binary &RCL; installations:
|
|
<itemizedlist>
|
|
<listitem><para>Through your system normal software distribution
|
|
framework (ie, <application>Debian/Ubuntu apt</application>,
|
|
<application>FreeBSD</application> ports, etc.).</para>
|
|
</listitem>
|
|
|
|
<listitem><para>From a package downloaded from the
|
|
&RCL; web site.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>From a prebuilt tree downloaded from the &RCL;
|
|
web site.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
In all cases, the strict software dependancies (ie on &XAP; or
|
|
<application>iconv</application>) will be automatically satisfied,
|
|
you should not have to worry about them.</para>
|
|
|
|
<para>You will only have to check or install <link
|
|
linkend="RCL.INSTALL.EXTERNAL">supporting applications</link>
|
|
for the file types that you want to index beyond those that are
|
|
natively processed by &RCL; (text, HTML, email files, and a few
|
|
others).</para>
|
|
|
|
<para>You should also maybe have a look at the
|
|
<link linkend="RCL.INSTALL.CONFIG">configuration section</link>
|
|
(but this may not be necessary for a quick test with default
|
|
parameters). Most parameters can be more conveniently set from the
|
|
GUI interface.</para>
|
|
|
|
<sect2 id="RCL.INSTALL.BINARY.PACKAGE">
|
|
<title>Installing through a package system</title>
|
|
|
|
<para>If you use a BSD-type port system or a prebuilt package (DEB,
|
|
RPM, manually or through the system software configuration
|
|
utility), just follow the usual procedure for your system.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BINARY.RCL">
|
|
<title>Installing a prebuilt &RCL;</title>
|
|
|
|
<para>The unpackaged binary versions on the &RCL; web site are
|
|
just compressed tar files of a build tree, where only the
|
|
useful parts were kept (executables and sample
|
|
configuration).</para>
|
|
|
|
<para>The executable binary files are built with a static link to
|
|
libxapian and libiconv, to make installation easier (no
|
|
dependencies).</para>
|
|
|
|
<para>After extracting the tar file, you can proceed with
|
|
<link linkend="RCL.INSTALL.BUILDING.INSTALL">installation</link> as
|
|
if you had built the package from source (that is, just type
|
|
<literal>make install</literal>). The binary trees are built for
|
|
installation to <filename>/usr/local</filename>.</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INSTALL.EXTERNAL">
|
|
<title>Supporting packages</title>
|
|
|
|
<para>&RCL; uses external applications to index some file
|
|
types. You need to install them for the file types that you wish to
|
|
have indexed (these are run-time optional dependencies. None is
|
|
needed for building or running &RCL; except for indexing their
|
|
specific file type).</para>
|
|
|
|
<para>After an indexing pass, the commands that were found
|
|
missing can be displayed from the <command>recoll</command>
|
|
<guilabel>File</guilabel> menu. The list is stored in the
|
|
<filename>missing</filename> text file inside the configuration
|
|
directory.</para>
|
|
|
|
<para>A list of common file types which need external
|
|
commands follows. Many of the filters need the
|
|
<command>iconv</command> command, which is not always listed as a
|
|
dependancy.</para>
|
|
|
|
<para>Please note that, due to the relatively dynamic nature of this
|
|
information, the most up to date version is now kept on the &RCLAPPS;
|
|
along with links to the home pages or best source/patches pages,
|
|
and misc tips. The list below is not updated often and may be quite
|
|
stale.</para>
|
|
|
|
<para>For many Linux distributions, most of the commands listed can
|
|
be installed from the package repositories. However, the packages
|
|
are sometimes outdated, or not the best version for &RCL;, so you
|
|
should take a look at the &RCLAPPS; if a file
|
|
type is important to you.</para>
|
|
|
|
<para>As of &RCL; release 1.14, a number of XML-based formats that
|
|
were handled by ad hoc filter code now use the
|
|
<command>xsltproc</command> command, which usually comes with
|
|
<application>libxslt</application>. These are: abiword, fb2
|
|
(ebooks), kword, openoffice, svg.</para>
|
|
|
|
<para>Now for the list:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para>Openoffice files need <command>unzip</command> and
|
|
<command>xsltproc</command>.</para></listitem>
|
|
|
|
<listitem><para>PDF files need <command>pdftotext</command> which
|
|
is part of the <application>Xpdf</application> or
|
|
<application>Poppler</application> packages.</para></listitem>
|
|
|
|
<listitem><para>Postscript files need <command>pstotext</command>.
|
|
The original version has an issue with shell
|
|
character in file names, which is corrected in recent
|
|
packages. See the the &RCLAPPS; for more detail.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>MS Word needs
|
|
<command>antiword</command>. It is also useful to have
|
|
<command>wvWare</command> installed as it may be
|
|
be used as a fallback for some files which
|
|
<command>antiword</command> does not handle.</para></listitem>
|
|
|
|
<listitem><para>MS Excel and PowerPoint need <command>
|
|
catdoc</command>.</para></listitem>
|
|
|
|
<listitem><para>MS Open XML (docx) needs <command>
|
|
xsltproc</command>.</para></listitem>
|
|
|
|
<listitem><para>Wordperfect files need <command>wpd2html</command>
|
|
from the <application>libwpd</application> (or
|
|
<application>libwpd-tools</application> on Ubuntu)
|
|
package.</para></listitem>
|
|
|
|
<listitem><para>RTF files need <command>unrtf</command>, which, in
|
|
its standard version, has much trouble with non-western character
|
|
sets. Check the &RCLAPPS;.</para></listitem>
|
|
|
|
<listitem><para>TeX files need <command>untex</command> or
|
|
<command>detex</command>. Check the &RCLAPPS; for sources if it's not
|
|
packaged for your distribution.</para></listitem>
|
|
|
|
<listitem><para>dvi files need <command>dvips</command>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>djvu files need <command>djvutxt</command> and
|
|
<command>djvused</command> from the
|
|
<application>DjVuLibre</application> package.</para></listitem>
|
|
|
|
<listitem><para>Audio files: &RCL; releases before 1.13
|
|
used the <command>id3info</command> command from the <application>
|
|
id3lib</application> package to extract mp3 tag information,
|
|
<command>metaflac</command> (standard flac tools) for flac files,
|
|
and <command>ogginfo</command> (vorbis tools) for ogg
|
|
files. Releases 1.14 and later use a single
|
|
<application>Python</application> filter based
|
|
on <application>mutagen</application> for all audio file
|
|
types.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>Pictures: &RCL; uses the
|
|
<application>Exiftool</application>
|
|
<application>Perl</application> package to extract tag
|
|
information. Most image file formats are supported. Note that
|
|
there may not be much interest in indexing the technical tags
|
|
(image size, aperture, etc.). This is only of interest if you
|
|
store personal tags or textual descriptions inside the image
|
|
files.</para></listitem>
|
|
|
|
<listitem><para>chm: files in microsoft help format need Python and
|
|
the <application>pychm</application> module (which needs
|
|
<application>chmlib</application>).</para></listitem>
|
|
|
|
<listitem><para>ICS: up to &RCL; 1.13, iCalendar files need
|
|
<application>Python</application>
|
|
and the <application>icalendar</application>
|
|
module. <application>icalendar</application> is not needed for newer
|
|
versions, which use internal code.</para></listitem>
|
|
|
|
<listitem><para>Zip archives need <application>Python</application>
|
|
(and the standard zipfile module).</para></listitem>
|
|
|
|
<listitem><para>Rar archives need
|
|
<application>Python</application>, the
|
|
<application>rarfile</application> Python module and the
|
|
<command>unrar</command> utility.</para></listitem>
|
|
|
|
<listitem><para>Midi karaoke files need
|
|
<application>Python</application> and the
|
|
<ulink url="http://pypi.python.org/pypi/midi/0.2.1">
|
|
<application>Midi module</application></ulink></para>
|
|
</listitem>
|
|
|
|
<listitem><para>Konqueror webarchive format with Python (uses the
|
|
Tarfile module).</para></listitem>
|
|
|
|
<listitem><para>mimehtml web archive format (support based on the email
|
|
filter, which introduces some mild weirdness, but still
|
|
usable).</para></listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>Text, HTML, email folders, and Scribus files are
|
|
processed internally. <application>Lyx</application> is used to
|
|
index Lyx files. Many filters need <command>iconv</command> and the
|
|
standard <command>sed</command> and <command>awk</command>.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INSTALL.BUILDING">
|
|
<title>Building from source</title>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.PREREQS">
|
|
<title>Prerequisites</title>
|
|
|
|
<para>C++ compiler. Up to &RCL; version 1.13.04, its absence can
|
|
manifest itself by strange messages about a missing
|
|
iconv_open.</para>
|
|
|
|
<para>Development files for <ulink
|
|
url="http://www.xapian.org"> <application>Xapian
|
|
core</application></ulink>.</para> <important><para>If you are
|
|
building Xapian for an older CPU (before Pentium 4 or Athlon
|
|
64), you need to add the <option>--disable-sse</option> flag
|
|
to the configure command. Else all Xapian application will
|
|
crash with an <literal>illegal instruction</literal>
|
|
error.</para> </important>
|
|
|
|
<para>Development files for
|
|
<ulink url="http://www.trolltech.com/products/qt/index.html">
|
|
<application>Qt</application> </ulink>.</para>
|
|
|
|
<para>Development files for <application>X11</application> and
|
|
<application>zlib</application>.</para>
|
|
|
|
<para>Check the <ulink url="http://www.recoll.org/download.html">
|
|
&RCL; download page</ulink> for up to date version
|
|
information.</para>
|
|
|
|
<para>You will most probably be able to find a binary package for
|
|
<application>Qt</application> for your system. You may have to
|
|
compile &XAP; but this is not difficult (if you are using
|
|
<application>FreeBSD</application>, there is a port).</para>
|
|
|
|
<para>You may also need
|
|
<ulink
|
|
url="http://www.gnu.org/software/libiconv/">libiconv</ulink>. &RCL;
|
|
currently uses version 1.9 (this should not be critical). On
|
|
<application>Linux</application> systems, the iconv interface
|
|
is part of libc and you should not need to do anything
|
|
special.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.BUILD">
|
|
<title>Building</title>
|
|
|
|
<para>&RCL; has been built on Linux, FreeBSD, Mac OS X, and Solaris,
|
|
most versions after 2005 should be ok, maybe some older ones too
|
|
(Solaris 8 is ok). If you build on another system, and
|
|
need to modify things,
|
|
<ulink url="mailto:jfd@recoll.org">I would
|
|
very much welcome patches</ulink>.</para>
|
|
|
|
<para>Depending on the <application>Qt 3</application>
|
|
configuration on your system, you may have to set the
|
|
<envar>QTDIR</envar> and <envar>QMAKESPECS</envar>
|
|
variables in your environment:</para>
|
|
<itemizedlist>
|
|
<listitem><para><envar>QTDIR</envar> should point to the
|
|
directory above the one that holds the qt include files (ie:
|
|
if <filename>qt.h</filename> is
|
|
<filename>/usr/local/qt/include/qt.h</filename>, QTDIR
|
|
should be <filename>/usr/local/qt</filename>).</para>
|
|
</listitem>
|
|
<listitem><para><envar>QMAKESPECS</envar> should
|
|
be set to the name of one of the
|
|
<application>Qt</application> mkspecs sub-directories (ie:
|
|
<filename>linux-g++</filename>).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>On many Linux systems, <envar>QTDIR</envar> is set
|
|
by the login scripts, and <envar>QMAKESPECS</envar> is not
|
|
needed because there is a <filename>default</filename> link in
|
|
<filename>mkspecs/</filename>.</para>
|
|
|
|
<para>Neither <envar>QTDIR</envar> nor
|
|
<envar>QMAKESPECS</envar> should be needed with
|
|
<application>Qt 4</application>,
|
|
configuration details are entirely determined by
|
|
<command>qmake</command> (which is quite often installed as
|
|
<command>qmake-qt4</command>).</para>
|
|
|
|
<formalpara><title>Configure options:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para><option>--without-aspell</option>
|
|
will disable the code for phonetic matching of search
|
|
terms. </para>
|
|
</listitem>
|
|
<listitem><para><option>--with-fam</option> or
|
|
<option>--with-inotify</option> will enable the code for
|
|
real time indexing. Inotify support is enabled by default on
|
|
recent Linux systems.</para>
|
|
</listitem>
|
|
<listitem><para><option>--disable-webkit</option> is available
|
|
from version 1.17 to implement the result list with a
|
|
<application>Qt</application> QTextBrowser instead of a
|
|
WebKit widget if you do not or can't depend on the
|
|
latter.</para>
|
|
</listitem>
|
|
<listitem><para><option>--enable-xattr</option> will enable
|
|
code to fetch data from file extended attributes. This is only
|
|
useful is some application stores data in there, and also needs
|
|
some simple configuration (see comments in the
|
|
<filename>fields</filename> configuration file).</para>
|
|
</listitem>
|
|
<listitem><para><option>--enable-camelcase</option> will enable
|
|
splitting <replaceable>camelCase</replaceable> words. This
|
|
is not enabled by default as it has the unfortunate
|
|
side-effect of making some phrase searches quite
|
|
confusing: ie, <literal>"MySQL manual"</literal> would be
|
|
matched by <literal>"MySQL manual"</literal> and
|
|
<literal>"my sql manual"</literal> but not <literal>"mysql
|
|
manual"</literal> (only inside phrase searches).</para>
|
|
</listitem>
|
|
<listitem><para><option>--with-file-command</option> Specify
|
|
the version of the 'file' command to use (ie:
|
|
--with-file-command=/usr/local/bin/file). Can be useful to
|
|
enable the gnu version on systems where the native one is
|
|
bad.</para>
|
|
</listitem>
|
|
<listitem><para><option>--disable-qtgui</option> Disable the Qt
|
|
interface. Will allow building the indexer and the command line
|
|
search program in absence of a Qt environment.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><option>--disable-x11mon</option> Disable
|
|
<application>X11</application> connection monitoring
|
|
inside recollindex. Together with --disable-qtgui, this
|
|
allows building recoll without
|
|
<application>Qt</application> and
|
|
<application>X11</application>.</para> </listitem>
|
|
|
|
<listitem><para>Of course the usual
|
|
<application>autoconf</application> <command>configure</command>
|
|
options, like <option>--prefix</option> apply.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<para>Normal procedure:</para>
|
|
<screen>
|
|
<userinput>cd recoll-xxx</userinput>
|
|
<userinput>configure</userinput>
|
|
<userinput>make</userinput>
|
|
<userinput>(practices usual hardship-repelling invocations)</userinput>
|
|
</screen>
|
|
|
|
|
|
<para>There is little auto-configuration. The
|
|
<command>configure</command> script will mainly link one of
|
|
the system-specific files in the <filename>mk</filename>
|
|
directory to <filename>mk/sysconf</filename>. If your system
|
|
is not known yet, it will tell you as much, and you may want
|
|
to manually copy and modify one of the existing files (the new
|
|
file name should be the output of <command>uname</command>
|
|
<option>-s</option>).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.INSTALL">
|
|
<title>Installation</title>
|
|
|
|
<para>Either type <userinput>make install</userinput> or execute
|
|
<userinput>recollinstall
|
|
<replaceable>prefix</replaceable></userinput>, in the root
|
|
of the source tree. This will copy the commands to
|
|
<filename><replaceable>prefix</replaceable>/bin</filename>
|
|
and the sample configuration files, scripts and other shared
|
|
data to
|
|
<filename><replaceable>prefix</replaceable>/share/recoll</filename>.</para>
|
|
<para>If the installation prefix given to
|
|
<command>recollinstall</command> is different from either the
|
|
system default or the value which was
|
|
specified when executing <command>configure</command> (as in
|
|
<userinput>configure --prefix /some/path</userinput>), you
|
|
will have to set the <envar>RECOLL_DATADIR</envar>
|
|
environment variable to indicate where the shared data is to
|
|
be found (ie for (ba)sh:
|
|
<userinput>export RECOLL_DATADIR=/some/path/share/recoll</userinput>).
|
|
</para>
|
|
|
|
<para>You can then proceed to <link
|
|
linkend="RCL.INSTALL.CONFIG">configuration</link>. </para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INSTALL.CONFIG">
|
|
<title>Configuration overview</title>
|
|
|
|
<para>Most of the parameters specific to the
|
|
<command>recoll</command> GUI are set through the
|
|
<guilabel>Preferences</guilabel> menu and stored in the standard Qt
|
|
place (<filename>$HOME/.config/Recoll.org/recoll.conf</filename>).
|
|
You probably do not want to edit this by hand.</para>
|
|
|
|
<para>&RCL; indexing options are set inside text configuration
|
|
files located in a configuration directory. There can be
|
|
several such directories, each of which define the parameters
|
|
for one index.</para>
|
|
|
|
<para>The configuration files can be edited by hand or through
|
|
the <guilabel>Index configuration</guilabel> dialog
|
|
(<guilabel>Preferences</guilabel> menu). The GUI tool will try
|
|
to respect your formatting and comments as much as possible,
|
|
so it is quite possible to use both ways.</para>
|
|
|
|
<para>The most accurate documentation for the
|
|
configuration parameters is given by comments inside the default
|
|
files, and we will just give a general overview here.</para>
|
|
|
|
<para>For each index, there are two sets of configuration
|
|
files. System-wide configuration files are kept in a directory named
|
|
like <filename>/usr/[local/]share/recoll/examples</filename>,
|
|
and define default values, shared by all indexes. For each
|
|
index, a parallel set of files defines the customized
|
|
parameters.</para>
|
|
|
|
<para>The default location of the configuration is the
|
|
<filename>.recoll</filename>
|
|
directory in your home. Most people will only use this
|
|
directory.</para>
|
|
|
|
<para>This location can be changed, or others can be added with the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable or the
|
|
<option>-c</option> option parameter to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>If the <filename>.recoll</filename> directory does not
|
|
exist when <command>recoll</command> or
|
|
<command>recollindex</command> are started, it will be created
|
|
with a set of empty configuration files.
|
|
<command>recoll</command> will give you a chance to edit the
|
|
configuration file before starting
|
|
indexing. <command>recollindex</command> will proceed
|
|
immediately. To avoid mistakes, the automatic directory
|
|
creation will only occur for the
|
|
default location, not if <option>-c</option> or
|
|
<envar>RECOLL_CONFDIR</envar> were used (in the latter
|
|
cases, you will have to create the directory).</para>
|
|
|
|
|
|
<para>All configuration files share the same format. For
|
|
example, a short extract of the main configuration file might
|
|
look as follows:</para>
|
|
<programlisting>
|
|
# Space-separated list of directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
</programlisting>
|
|
|
|
<para>There are three kinds of lines: </para>
|
|
<itemizedlist>
|
|
<listitem><para>Comment (starts with
|
|
<emphasis>#</emphasis>) or empty.</para>
|
|
</listitem>
|
|
<listitem><para>Parameter affectation (<emphasis>name =
|
|
value</emphasis>).</para>
|
|
</listitem>
|
|
<listitem><para>Section definition
|
|
([<emphasis>somedirname</emphasis>]).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Depending on the type of configuration file, section
|
|
definitions either separate groups of parameters or allow
|
|
redefining some parameters for a directory sub-tree. They stay
|
|
in effect until another section definition, or the end of
|
|
file, is encountered. Some of the parameters used for indexing
|
|
are looked up hierarchically from the current directory
|
|
location upwards. Not all parameters can be meaningfully
|
|
redefined, this is specified for each in the next
|
|
section. </para>
|
|
|
|
<para>When found at the beginning of a file path, the tilde
|
|
character (~) is expanded to the name of the user's home
|
|
directory, as a shell would do.</para>
|
|
|
|
<para>White space is used for separation inside lists.
|
|
List elements with embedded spaces can be quoted using
|
|
double-quotes.</para>
|
|
|
|
<formalpara>
|
|
<title>Encoding issues</title>
|
|
<para>Most of the configuration parameters are plain ASCII. Two
|
|
particular sets of values may cause encoding issues:</para>
|
|
</formalpara>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para>File path parameters may contain non-ascii
|
|
characters and should use the exact same byte values as found in
|
|
the file system directory. Usually, this means that the
|
|
configuration file should use the system default locale
|
|
encoding.</para>
|
|
</listitem>
|
|
<listitem><para>The <envar>unac_except_trans</envar> parameter
|
|
should be encoded in UTF-8. If your system locale is not UTF-8, and
|
|
you need to also specify non-ascii file paths, this poses a
|
|
difficulty because common text editors cannot handle multiple
|
|
encodings in a single file. In this relatively unlikely case, you
|
|
can edit the configuration file as two separate text files with
|
|
appropriate encodings, and concatenate them to create the complete
|
|
configuration.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.RECOLLCONF">
|
|
<title>Main configuration file</title>
|
|
|
|
<para><filename>recoll.conf</filename> is the main
|
|
configuration file. It defines things like
|
|
what to index (top directories and things to ignore), and the
|
|
default character set to use for document types which do not
|
|
specify it internally.</para>
|
|
|
|
<para>The default configuration will index your home
|
|
directory. If this is not appropriate, start
|
|
<command>recoll</command> to create a blank
|
|
configuration, click <guimenu>Cancel</guimenu>, and edit
|
|
the configuration file before restarting the command. This
|
|
will start the initial indexing, which may take some time.</para>
|
|
|
|
<para>Most of the following parameters can be changed from the
|
|
<guilabel>Index Configuration</guilabel> menu in the
|
|
<command>recoll</command> interface. Some can only be set by
|
|
editing the configuration file.</para>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.FILES">
|
|
<title>Parameters affecting what documents we index:</title>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS">
|
|
<term><varname>topdirs</varname></term>
|
|
<listitem><para>Specifies the list of directories or files to
|
|
index (recursively for directories). You can use symbolic links
|
|
as elements of this list. See the
|
|
<varname>followLinks</varname> option about following symbolic links
|
|
found under the top elements (not followed by default).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>skippedNames</varname></term>
|
|
<listitem>
|
|
<para>A space-separated list of patterns for
|
|
names of files or directories that should be completely
|
|
ignored. The list defined in the default file is: </para>
|
|
<programlisting>
|
|
skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
|
|
*~ .beagle .git .hg .bzr loop.ps .xsession-errors \
|
|
.recoll* xapiandb recollrc recoll.conf
|
|
</programlisting>
|
|
<para>The list can be redefined at any sub-directory in the
|
|
indexed area.</para>
|
|
<para>The top-level directories are not affected by this
|
|
list (that is, a directory in <varname>topdirs</varname>
|
|
might match and would still be indexed).</para>
|
|
<para>The list in the default configuration does not
|
|
exclude hidden directories (names beginning with a
|
|
dot), which means that it may index quite a few things
|
|
that you do not want. On the other hand, email user
|
|
agents like <application>thunderbird</application>
|
|
usually store messages in hidden directories, and you
|
|
probably want this indexed. One possible solution is to
|
|
have <filename>.*</filename> in
|
|
<varname>skippedNames</varname>, and add things like
|
|
<filename>~/.thunderbird</filename> or
|
|
<filename>~/.evolution</filename> in
|
|
<varname>topdirs</varname>.</para>
|
|
|
|
<para>Not even the file names are indexed for patterns
|
|
in this list. See the
|
|
<varname>recoll_noindex</varname> variable in
|
|
<filename>mimemap</filename> for an alternative
|
|
approach which indexes the file names.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>skippedPaths</varname> and
|
|
<varname>daemSkippedPaths</varname> </term>
|
|
<listitem>
|
|
<para>A space-separated list of patterns for
|
|
<emphasis>paths</emphasis> of files or directories that should be skipped.
|
|
There is no default in the sample configuration file,
|
|
but the code always adds the configuration and database
|
|
directories in there.</para>
|
|
<para><varname>skippedPaths</varname> is used both by
|
|
batch and real time
|
|
indexing. <varname>daemSkippedPaths</varname> can be
|
|
used to specify things that should be indexed at
|
|
startup, but not monitored.</para>
|
|
<para>Example of use for skipping text files only in a
|
|
specific directory:</para>
|
|
<programlisting>
|
|
skippedPaths = ~/somedir/*.txt
|
|
</programlisting>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHSFNMPATHNAME">
|
|
<term><varname>skippedPathsFnmPathname</varname></term>
|
|
<listitem><para>The values in the
|
|
<varname>*skippedPaths</varname> variables are matched by
|
|
default with <literal>fnmatch(3)</literal>, with the
|
|
FNM_PATHNAME and FNM_LEADING_DIR flags. This means that '/'
|
|
characters must be matched explicitely. You can set
|
|
<varname>skippedPathsFnmPathname</varname> to 0 to disable
|
|
the use of FNM_PATHNAME (meaning that /*/dir3 will match
|
|
/dir1/dir2/dir3).</para>
|
|
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.FOLLOWLINKS">
|
|
<term><varname>followLinks</varname></term>
|
|
<listitem><para>Specifies if the indexer should follow
|
|
symbolic links while walking the file tree. The default is
|
|
to ignore symbolic links to avoid multiple indexing of
|
|
linked files. No effort is made to avoid duplication when
|
|
this option is set to true. This option can be set
|
|
individually for each of the <varname>topdirs</varname>
|
|
members by using sections. It can not be changed below the
|
|
<varname>topdirs</varname> level.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>indexedmimetypes</varname></term>
|
|
<listitem><para>&RCL; normally indexes any file which it
|
|
knows how to read. This list lets you restrict the indexed
|
|
mime types to what you specify. If the variable is
|
|
unspecified or the list empty (the default), all supported
|
|
types are processed.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>compressedfilemaxkbs</varname></term>
|
|
<listitem><para>Size limit for compressed (.gz or .bz2)
|
|
files. These need to be decompressed in a temporary
|
|
directory for identification, which can be very wasteful
|
|
if 'uninteresting' big compressed files are present.
|
|
Negative means no limit, 0 means no processing of any
|
|
compressed file. Defaults to -1.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>textfilemaxmbs</varname></term>
|
|
<listitem><para>Maximum size for text files. Very big text
|
|
files are often uninteresting logs. Set to -1 to disable
|
|
(default 20MB).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>textfilepagekbs</varname></term>
|
|
<listitem><para>If set to other than -1, text files will be
|
|
indexed as multiple documents of the given page size. This may
|
|
be useful if you do want to index very big text files as it
|
|
will both reduce memory usage at index time and help with
|
|
loading data to the preview window. A size of a few megabytes
|
|
would seem reasonable (default: 1MB).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>membermaxkbs</varname></term>
|
|
<listitem><para>This defines the maximum size in kilobytes for
|
|
an archive member (zip, tar or rar at the moment). Bigger
|
|
entries will be skipped.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>indexallfilenames</varname></term>
|
|
<listitem><para>&RCL; indexes file names in a special
|
|
section of the database to allow specific file names
|
|
searches using wild cards. This parameter decides if
|
|
file name indexing is performed only for files with mime
|
|
types that would qualify them for full text indexing, or
|
|
for all files inside the selected subtrees, independently of
|
|
mime type.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>usesystemfilecommand</varname></term>
|
|
<listitem><para>Decide if we use the
|
|
<command>file</command> <option>-i</option> system command
|
|
as a final step for determining the mime type for a file
|
|
(the main procedure uses suffix associations as defined in
|
|
the <filename>mimemap</filename> file). This can be useful
|
|
for files with suffix-less names, but it will also cause
|
|
the indexing of many bogus "text" files.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>processwebqueue</varname></term>
|
|
<listitem><para>If this is set, process the directory where
|
|
Web browser plugins copy visited pages for indexing.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>webqueuedir</varname></term>
|
|
<listitem><para>The path to the web indexing queue. This is
|
|
hard-coded in the Firefox plugin as
|
|
<filename>~/.recollweb/ToIndex</filename> so there should be no
|
|
need to change it.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.TERMS">
|
|
<title>Parameters affecting how we generate terms:</title>
|
|
|
|
<para>Changing some of these parameters will imply a full
|
|
reindex. Also, when using multiple indexes, it may not make sense
|
|
to search indexes that don't share the values for these parameters,
|
|
because they usually affect both search and index operations.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry><term><varname>indexStripChars</varname></term>
|
|
<listitem><para>Decide if we strip characters of diacritics and
|
|
convert them to lower-case before terms are indexed. If we
|
|
don't, searches sensitive to case and diacritics can be
|
|
performed, but the index will be bigger, and some marginal
|
|
weirdness may sometimes occur. The default is a stripped
|
|
index (<literal>indexStripChars = 1</literal>) for
|
|
now. When using multiple indexes for a search,
|
|
this parameter must be defined identically for
|
|
all. Changing the value implies an index reset.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>maxTermExpand</varname></term>
|
|
<listitem><para>Maximum expansion count for a single term (e.g.:
|
|
when using wildcards). The default of 10000 is reasonable and
|
|
will avoid queries that appear frozen while the engine is
|
|
walking the term list.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>maxXapianClauses</varname></term>
|
|
<listitem><para>Maximum number of elementary clauses we can add
|
|
to a single Xapian query. In some cases, the result of term
|
|
expansion can be multiplicative, and we want to avoid using
|
|
excessive memory. The default of 100 000 should be both
|
|
high enough in most cases and compatible with current
|
|
typical hardware configurations.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>nonumbers</varname></term>
|
|
<listitem><para>If this set to true, no terms will be generated
|
|
for numbers. For example "123", "1.5e6", 192.168.1.4, would not
|
|
be indexed ("value123" would still be). Numbers are often quite
|
|
interesting to search for, and this should probably not be set
|
|
except for special situations, ie, scientific documents with huge
|
|
amounts of numbers in them. This can only be set for a whole
|
|
index, not for a subtree.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>nocjk</varname></term>
|
|
<listitem><para>If this set to true, specific east asian
|
|
(Chinese Korean Japanese) characters/word splitting is
|
|
turned off. This will save a small amount of cpu if you
|
|
have no CJK documents. If your document base does include
|
|
such text but you are not interested in searching it,
|
|
setting <varname>nocjk</varname> may be a significant time
|
|
and space saver.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>cjkngramlen</varname></term>
|
|
<listitem><para>This lets you adjust the size of n-grams
|
|
used for indexing CJK text. The default value of 2 is
|
|
probably appropriate in most cases. A value of 3 would
|
|
allow more precision and efficiency on longer words, but
|
|
the index will be approximately twice as large.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>indexstemminglanguages</varname></term>
|
|
<listitem><para>A list of languages for which the stem
|
|
expansion databases will be built. See <citerefentry>
|
|
<refentrytitle>recollindex</refentrytitle>
|
|
<manvolnum>1</manvolnum> </citerefentry> or use the
|
|
<command>recollindex</command> <option>-l</option> command
|
|
for possible values. You can add a stem expansion database
|
|
for a different language by using
|
|
<command>recollindex</command> <option>-s</option>, but it
|
|
will be deleted during the next indexing. Only languages
|
|
listed in the configuration file are permanent.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>defaultcharset</varname></term>
|
|
<listitem><para>The name of the character set used for
|
|
files that do not contain a character set definition (ie:
|
|
plain text files). This can be redefined for any
|
|
sub-directory. If it is not set at all, the character set
|
|
used is the one defined by the nls environment (
|
|
<envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>,
|
|
<envar>LANG</envar>), or <literal>iso8859-1</literal>
|
|
if nothing is set.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>unac_except_trans</varname></term>
|
|
<listitem><para>This is a list of characters, encoded in UTF-8,
|
|
which should be handled specially when converting text to
|
|
unaccented lowercase. For example, in Swedish, the letter
|
|
<literal>a with diaeresis</literal> has full alphabet
|
|
citizenship and should not be turned into an
|
|
<literal>a</literal>. Each element in the space-separated list
|
|
has the special character as first element and the translation
|
|
following. The handling of both the lowercase and upper-case
|
|
versions of a character should be specified, as appartenance to
|
|
the list will turn-off both standard accent and case
|
|
processing. Example for Swedish:</para>
|
|
<programlisting>
|
|
unac_except_trans = åå Åå ää Ää öö Öö
|
|
</programlisting>
|
|
|
|
<para>Note that the translation is not limited to a single
|
|
character, you could very well have something like
|
|
<literal>üue</literal> in the list.</para>
|
|
|
|
<para>The default value set for
|
|
<literal>unac_except_trans</literal> can't be listed here
|
|
because I have trouble with SGML and UTF-8, but it only
|
|
contains ligature decompositions: german ss, oe, ae, fi,
|
|
fl.</para>
|
|
|
|
<para>This parameter can't be defined for subdirectories, it
|
|
is global, because there is no way to do otherwise when
|
|
querying. If you have document sets which would need different
|
|
values, you will have to index and query them separately.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>maildefcharset</varname></term>
|
|
<listitem><para>This can be used to define the default
|
|
character set specifically for email messages which don't
|
|
specify it. This is mainly useful for readpst (libpst) dumps,
|
|
which are utf-8 but do not say so.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>localfields</varname></term>
|
|
<listitem><para>This allows setting fields for all documents
|
|
under a given directory. Typical usage would be to set an
|
|
"rclaptg" field, to be used in <filename>mimeview</filename> to
|
|
select a specific viewer. If several fields are to be set, they
|
|
should be separated with a semi-colon (';') character, which there
|
|
is currently no way to escape. Also note the initial semi-colon.
|
|
Example:
|
|
<literal>localfields= ;rclaptg=gnus;other = val</literal>, then
|
|
select specifier viewer with
|
|
<literal>mimetype|tag=...</literal> in
|
|
<filename>mimeview</filename>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS">
|
|
<term><varname>metadatacmds</varname></term>
|
|
<listitem><para>This allows executing external commands
|
|
for each file and storing the output in a &RCL;
|
|
field. This could be used for example to index external
|
|
tag data. The value is a list of field names and commands,
|
|
don't forget an initial semi-colon. Example:
|
|
<programlisting>
|
|
[/some/area/of/the/fs]
|
|
metadatacmds = ; tags = tmsu tags %f; otherfield = somecmd -xx %f
|
|
</programlisting>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.STORAGE">
|
|
<title>Parameters affecting where and how we store things:</title>
|
|
|
|
<variablelist>
|
|
<varlistentry><term><varname>dbdir</varname></term>
|
|
<listitem><para>The name of the Xapian data directory. It
|
|
will be created if needed when the index is
|
|
initialized. If this is not an absolute path, it will be
|
|
interpreted relative to the configuration directory. The
|
|
value can have embedded spaces but starting or trailing
|
|
spaces will be trimmed. You cannot use quotes here.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>idxstatusfile</varname></term>
|
|
<listitem><para>The name of the scratch file where the indexer
|
|
process updates its status. Default:
|
|
<filename>idxstatus.txt</filename> inside the configuration
|
|
directory.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>maxfsoccuppc</varname></term>
|
|
<listitem><para>Maximum file system occupation before we
|
|
stop indexing. The value is a percentage, corresponding to
|
|
what the "Capacity" df output column shows. The default
|
|
value is 0, meaning no checking. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>mboxcachedir</varname></term>
|
|
<listitem><para>The directory where mbox message offsets cache
|
|
files are held. This is normally $RECOLL_CONFDIR/mboxcache, but
|
|
it may be useful to share a directory between different
|
|
configurations.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>mboxcacheminmbs</varname></term>
|
|
<listitem><para>The minimum mbox file size over which we
|
|
cache the offsets. There is really no sense in caching
|
|
offsets for small files. The default is 5 MB.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>webcachedir</varname></term>
|
|
<listitem><para>This is only used by the web browser
|
|
plugin indexing code, and defines where the cache for visited
|
|
pages will live. Default:
|
|
<filename>$RECOLL_CONFDIR/webcache</filename></para>
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
<varlistentry><term><varname>webcachemaxmbs</varname></term>
|
|
<listitem><para>This is only used by the web browser
|
|
plugin indexing code, and defines the maximum size for the web
|
|
page cache. Default: 40 MB.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
|
|
<varlistentry><term><varname>idxflushmb</varname></term>
|
|
<listitem><para>Threshold (megabytes of new text data) where we
|
|
flush from memory to disk index. Setting this can help control
|
|
memory usage. A value of 0 means no explicit flushing, letting
|
|
Xapian use its own default, which is flushing every 10000 (or
|
|
XAPIAN_FLUSH_THRESHOLD) documents, which gives little memory
|
|
usage control, as memory usage also depends on average document
|
|
size. The default value is 10, and it is probably a bit low. If
|
|
your system usually has free memory, you can try higher values
|
|
between 20 and 80. In my experience, values beyond 100 are
|
|
always counterproductive.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">
|
|
<title>Miscellaneous parameters:</title>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry><term><varname>autodiacsens</varname></term>
|
|
<listitem><para>IF the index is not stripped, decide if we
|
|
automatically trigger diacritics sensitivity if the search
|
|
term has accented characters (not in
|
|
<literal>unac_except_trans</literal>). Else you need to use
|
|
the query language and the <literal>D</literal> modifier to
|
|
specify diacritics sensitivity. Default is no.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>autocasesens</varname></term>
|
|
<listitem><para>IF the index is not stripped, decide if we
|
|
automatically trigger character case sensitivity if the
|
|
search term has upper-case characters in any but the first
|
|
position. Else you need to use the query language and the
|
|
<literal>C</literal> modifier to specify character-case
|
|
sensitivity. Default is yes.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>loglevel,daemloglevel</varname></term>
|
|
<listitem><para>Verbosity level for recoll and
|
|
recollindex. A value of 4 lists quite a lot of
|
|
debug/information messages. 2 only lists errors. The
|
|
<literal>daem</literal>version is specific to the indexing monitor
|
|
daemon.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>logfilename,
|
|
daemlogfilename</varname></term>
|
|
<listitem><para>Where the messages should go. 'stderr' can
|
|
be used as a special value, and is the default. The
|
|
<literal>daem</literal>version is specific to the indexing monitor
|
|
daemon.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>mondelaypatterns</varname></term>
|
|
<listitem><para>This allows specify wildcard path patterns
|
|
(processed with fnmatch(3) with 0 flag), to match files which
|
|
change too often and for which a delay should be observed before
|
|
re-indexing. This is a space-separated list, each entry being a
|
|
pattern and a time in seconds, separated by a colon. You can
|
|
use double quotes if a path entry contains white
|
|
space. Example:</para>
|
|
<programlisting>
|
|
mondelaypatterns = *.log:20 "this one has spaces*:10"
|
|
</programlisting>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>monixinterval</varname></term>
|
|
<listitem><para>Minimum interval (seconds) for processing the
|
|
indexing queue. The real time monitor does not process each
|
|
event when it comes in, but will wait this time for the queue
|
|
to accumulate to diminish overhead and in order to aggregate
|
|
multiple events to the same file. Default 30 S.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>monauxinterval</varname></term>
|
|
<listitem><para>Period (in seconds) at which the real time
|
|
monitor will regenerate the auxiliary databases (spelling,
|
|
stemming) if needed. The default is one hour.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>monioniceclass, monioniceclassdata
|
|
</varname></term><listitem><para>These allow defining the
|
|
<application>ionice</application> class and data used by the
|
|
indexer (default class 3, no data).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>filtermaxseconds</varname></term>
|
|
<listitem><para>Maximum filter execution time, after which it
|
|
is aborted. Some postscript programs just loop...</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry><term><varname>filtersdir</varname></term>
|
|
<listitem><para>A directory to search for the external
|
|
filter scripts used to index some types of files. The
|
|
value should not be changed, except if you want to modify
|
|
one of the default scripts. The value can be redefined for
|
|
any sub-directory. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>iconsdir</varname></term>
|
|
<listitem><para>The name of the directory where
|
|
<command>recoll</command> result list icons are
|
|
stored. You can change this if you want different
|
|
images.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>idxabsmlen</varname></term>
|
|
<listitem><para>&RCL; stores an abstract for each indexed
|
|
file inside the database. The text can come from an actual
|
|
'abstract' section in the document or will just be the
|
|
beginning of the document. It is stored in the index so
|
|
that it can be displayed inside the result lists without
|
|
decoding the original
|
|
file. The <varname>idxabsmlen</varname> parameter defines
|
|
the size of the stored abstract. The default value is 250 bytes.
|
|
The search interface gives you the choice to display this
|
|
stored text or a synthetic abstract built by extracting
|
|
text around the search terms. If you always
|
|
prefer the synthetic abstract, you can reduce this value
|
|
and save a little space.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>aspellLanguage</varname></term>
|
|
<listitem><para>Language definitions to use when creating
|
|
the aspell dictionary. The value must match a set of
|
|
aspell language definition files. You can type "aspell
|
|
config" to see where these are installed (look for
|
|
data-dir). The default if the variable is not set is to
|
|
use your desktop national language environment to guess
|
|
the value.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>noaspell</varname></term>
|
|
<listitem><para>If this is set, the aspell dictionary
|
|
generation is turned off. Useful for cases where you don't
|
|
need the functionality or when it is unusable because
|
|
aspell crashes during dictionary generation.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>mhmboxquirks</varname></term>
|
|
<listitem><para>This allows definining location-related quirks
|
|
for the mailbox handler. Currently only the
|
|
<literal>tbird</literal> flag is defined, and it should be set
|
|
for directories which hold
|
|
<application>Thunderbird</application> data, as their folder
|
|
format is weird.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
|
|
</variablelist>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.FIELDS">
|
|
<title>The fields file</title>
|
|
|
|
<para>This file contains information about dynamic fields handling
|
|
in &RCL;. Some very basic fields have hard-wired behaviour,
|
|
and, mostly, you should not change the original data inside the
|
|
<filename>fields</filename> file. But you can create custom fields
|
|
fitting your data and handle them just like they were native
|
|
ones.</para>
|
|
|
|
<para>The <filename>fields</filename> file has several sections,
|
|
which each define an aspect of fields processing. Quite often,
|
|
you'll have to modify several sections to obtain the desired
|
|
behaviour.</para>
|
|
|
|
<para>We will only give a short description here, you should refer
|
|
to the comments inside the file for more detailed information.</para>
|
|
|
|
<para>Field names should be lowercase alphabetic ASCII.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>[prefixes]</term>
|
|
<listitem><para>A field becomes indexed (searchable) by having
|
|
a prefix defined in this section.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[stored]</term>
|
|
<listitem><para>A field becomes stored (displayable inside
|
|
results) by having its name listed in this section (typically
|
|
with an empty value).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[aliases]</term>
|
|
<listitem><para>This section defines lists of synonyms for the
|
|
canonical names used inside the <literal>[prefixes]</literal>
|
|
and <literal>[stored]</literal> sections</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>filter-specific sections</term>
|
|
<listitem><para>Some filters may need specific
|
|
configuration for handling fields. Only the email message filter
|
|
currently has such a section (named
|
|
<literal>[mail]</literal>). It allows indexing arbitrary email
|
|
headers in addition to the ones indexed by default. Other such
|
|
sections may appear in the future.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>Here follows a small example of a personal
|
|
<filename>fields</filename>
|
|
file. This would extract a specific email header and
|
|
use it as a searchable field, with data displayable inside result
|
|
lists. (Side note: as the email filter does no decoding on the values,
|
|
only plain ascii headers can be indexed, and only the
|
|
first occurrence will be used for headers that occur several times).
|
|
|
|
<programlisting>[prefixes]
|
|
# Index mailmytag contents (with the given prefix)
|
|
mailmytag = XMTAG
|
|
|
|
[stored]
|
|
# Store mailmytag inside the document data record (so that it can be
|
|
# displayed - as %(mailmytag) - in result lists).
|
|
mailmytag =
|
|
|
|
[mail]
|
|
# Extract the X-My-Tag mail header, and use it internally with the
|
|
# mailmytag field name
|
|
x-my-tag = mailmytag
|
|
</programlisting>
|
|
</para>
|
|
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.FIELDS.XATTR">
|
|
<title>Extended attributes in the fields file</title>
|
|
|
|
<para>&RCL; versions 1.19 and later process user extended
|
|
file attributes as documents fields by default.</para>
|
|
|
|
<para>Attributes are processed as fields of the same name,
|
|
after removing the <literal>user</literal> prefix on
|
|
Linux.</para>
|
|
|
|
<para>The <literal>[xattrtofields]</literal>
|
|
section of the <filename>fields</filename> file allows
|
|
specifying translations from extended attributes names to
|
|
&RCL; field names. An empty translation disables use of the
|
|
corresponding attribute data.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<title>The mimemap file</title>
|
|
|
|
<para><filename>mimemap</filename> specifies the
|
|
file name extension to mime type mappings.</para>
|
|
|
|
<para>For file names without an extension, or with an unknown
|
|
one, the system's <command>file</command> <option>-i</option>
|
|
command will be
|
|
executed to determine the mime type (this can be switched off
|
|
inside the main configuration file).</para>
|
|
|
|
<para>The mappings can be specified on a per-subtree basis,
|
|
which may be useful in some cases. Example:
|
|
<application>gaim</application> logs have a
|
|
<filename>.txt</filename> extension but
|
|
should be handled specially, which is possible because they
|
|
are usually all located in one place.</para>
|
|
|
|
<para><filename>mimemap</filename> also has a
|
|
<varname>recoll_noindex</varname> variable which is a list of
|
|
suffixes. Matching files will be skipped (which avoids
|
|
unnecessary decompressions or <command>file</command>
|
|
executions). This is partially redundant with
|
|
<varname>skippedNames</varname> in the main configuration
|
|
file, with a few differences: it will not affect directories,
|
|
it cannot be made dependant on the file-system location (it is
|
|
a configuration-wide parameter), and the file names will still
|
|
be indexed (not even the file names are indexed for patterns
|
|
in <varname>skippedNames</varname>.
|
|
<varname>recoll_noindex</varname> is used mostly for things
|
|
known to be unindexable by a given &RCL; version. Having it
|
|
there avoids cluttering the more user-oriented and locally
|
|
customized <varname>skippedNames</varname>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<title>The mimeconf file</title>
|
|
|
|
<para><filename>mimeconf</filename> specifies how the
|
|
different mime types are handled for indexing, and which icons
|
|
are displayed in the <command>recoll</command> result lists.</para>
|
|
|
|
<para>Changing the parameters in the [index] section is
|
|
probably not a good idea except if you are a &RCL;
|
|
developer.</para>
|
|
|
|
<para>The [icons] section allows you to change the icons which
|
|
are displayed by <command>recoll</command> in the result
|
|
lists (the values are the basenames of the png images inside
|
|
the <filename>iconsdir</filename> directory (specified in
|
|
<filename>recoll.conf</filename>).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<title>The mimeview file</title>
|
|
|
|
<para><filename>mimeview</filename> specifies which programs
|
|
are started when you click on an <guilabel>Open</guilabel> link
|
|
in a result list. Ie: HTML is normally displayed using
|
|
<application>firefox</application>, but you may prefer
|
|
<application>Konqueror</application>, your
|
|
<application>openoffice.org</application>
|
|
program might be named <command>oofice</command> instead of
|
|
<command>openoffice</command> etc.</para>
|
|
|
|
<para>Changes to this file can be done by direct editing, or
|
|
through the <command>recoll</command> GUI preferences dialog.</para>
|
|
|
|
<para>If <guilabel>Use desktop preferences to choose document
|
|
editor</guilabel> is checked in the &RCL; GUI preferences, all
|
|
<filename>mimeview</filename> entries will be ignored except the
|
|
one labelled <literal>application/x-all</literal> (which is set to
|
|
use <command>xdg-open</command> by default).</para>
|
|
|
|
<para>In this case, the <literal>xallexcepts</literal> top level
|
|
variable defines a list of mime type exceptions which
|
|
will be processed according to the local entries instead of being
|
|
passed to the desktop. This is so that specific &RCL; options
|
|
such as a page number or a search string can be passed to
|
|
applications that support them, such as the
|
|
<application>evince</application> viewer.</para>
|
|
|
|
<para>As for the other configuration files, the normal usage
|
|
is to have a <filename>mimeview</filename> inside your own
|
|
configuration directory, with just the non-default entries,
|
|
which will override those from the central configuration
|
|
file.</para>
|
|
|
|
<para>All viewer definition entries must be placed under a
|
|
<literal>[view]</literal> section.</para>
|
|
|
|
<para>The keys in the file are normally mime types. You can add an
|
|
application tag to specialize the choice for an area of the
|
|
filesystem (using a <varname>localfields</varname> specification
|
|
in <filename>mimeconf</filename>). The syntax for the key is
|
|
<replaceable>mimetype</replaceable><literal>|</literal><replaceable>tag</replaceable></para>
|
|
|
|
<para>The <varname>nouncompforviewmts</varname> entry, (placed at
|
|
the top level, outside of the <literal>[view]</literal> section),
|
|
holds a list of mime types that should not be uncompressed before
|
|
starting the viewer (if they are found compressed, ie:
|
|
<replaceable>mydoc.doc.gz</replaceable>).</para>
|
|
|
|
<para>The right side of each assignment holds a command to be
|
|
executed for opening the file. The following substitutions are
|
|
performed:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>%D</title>
|
|
<para>Document date</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%f</title>
|
|
<para>File name. This may be the name of a temporary file if
|
|
it was necessary to create one (ie: to extract a subdocument
|
|
from a container).</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%F</title>
|
|
<para>Original file name. Same as %f except if a temporary
|
|
file is used.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%i</title>
|
|
<para>Internal path, for subdocuments of containers. The
|
|
format depends on the container type. If this appears in the
|
|
command line, &RCL; will not create a temporary file to
|
|
extract the subdocument, expecting the called application
|
|
(possibly a script) to be able to handle it.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%M</title>
|
|
<para>Mime type</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%p</title>
|
|
<para>Page index. Only significant for a subset of document
|
|
types, currently only PDF, Postscript and DVI files. Can be
|
|
used to start the editor at the right page for a match or
|
|
snippet.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%s</title>
|
|
<para>Search term. The value will only be set for documents
|
|
with indexed page numbers (ie: PDF). The value will be one of
|
|
the matched search terms. It would allow pre-setting the
|
|
value in the "Find" entry inside Evince for example, for easy
|
|
highlighting of the term.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%U, %u</title>
|
|
<para>Url.</para></formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>In addition to the predefined values above, all strings like
|
|
<literal>%(fieldname)</literal> will be replaced by the value of
|
|
the field named <literal>fieldname</literal> for the
|
|
document. This could be used in combination with field
|
|
customisation to help with opening the document.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.PTRANS">
|
|
<title>The <filename>ptrans</filename> file</title>
|
|
|
|
<para><filename>ptrans</filename> specifies query-time path
|
|
translations. These can be useful
|
|
in <link linkend="RCL.SEARCH.PTRANS">multiple
|
|
cases</link>.</para>
|
|
<para>The file has a section for any index which needs
|
|
translations, either the main one or additional query
|
|
indexes. The sections are named with the &XAP; index
|
|
directory names. No slash character should exist at the end
|
|
of the paths (all comparisons are textual). An exemple
|
|
should make things sufficiently clear</para>
|
|
|
|
<programlisting>
|
|
[/home/me/.recoll/xapiandb]
|
|
/this/directory/moved = /to/this/place
|
|
|
|
[/path/to/additional/xapiandb]
|
|
/server/volume1/docdir = /net/server/volume1/docdir
|
|
/server/volume2/docdir = /net/server/volume2/docdir
|
|
</programlisting>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.EXAMPLES">
|
|
<title>Examples of configuration adjustments</title>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW">
|
|
<title>Adding an external viewer for an non-indexed type</title>
|
|
|
|
<para>Imagine that you have some kind of file which does not
|
|
have indexable content, but for which you would like to have a
|
|
functional <guilabel>Open</guilabel> link in the result list
|
|
(when found by file name). The file names end in
|
|
<replaceable>.blob</replaceable> and can be displayed by
|
|
application <replaceable>blobviewer</replaceable>.</para>
|
|
|
|
<para>You need two entries in the configuration files for this
|
|
to work:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>In <filename>$RECOLL_CONFDIR/mimemap</filename>
|
|
(typically <filename>~/.recoll/mimemap</filename>), add the
|
|
following line:<programlisting>
|
|
.blob = application/x-blobapp
|
|
</programlisting>
|
|
Note that the mime type is made up here, and you could
|
|
call it <replaceable>diesel/oil</replaceable> just the
|
|
same.</para>
|
|
</listitem>
|
|
<listitem><para>In <filename>$RECOLL_CONFDIR/mimeview</filename>
|
|
under the <literal>[view]</literal> section, add:</para>
|
|
<programlisting>
|
|
application/x-blobapp = blobviewer %f
|
|
</programlisting>
|
|
<para>We are supposing
|
|
that <replaceable>blobviewer</replaceable> wants a file
|
|
name parameter here, you would use <literal>%u</literal> if
|
|
it liked URLs better.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>If you just wanted to change the application used by
|
|
&RCL; to display a mime type which it already knows, you
|
|
would just need to edit <filename>mimeview</filename>. The
|
|
entries you add in your personal file override those in the
|
|
central configuration, which you do not need to
|
|
alter. <filename>mimeview</filename> can also be modified
|
|
from the Gui.</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX">
|
|
<title>Adding indexing support for a new file type</title>
|
|
|
|
<para>Let us now imagine that the above
|
|
<replaceable>.blob</replaceable> files actually contain
|
|
indexable text and that you know how to extract it with a
|
|
command line program. Getting &RCL; to index the files is
|
|
easy. You need to perform the above alteration, and also to
|
|
add data to the <filename>mimeconf</filename> file
|
|
(typically in <filename>~/.recoll/mimeconf</filename>):</para>
|
|
<itemizedlist>
|
|
<listitem><para>Under the <literal>[index]</literal>
|
|
section, add the following line (more about the
|
|
<replaceable>rclblob</replaceable> indexing script
|
|
later):<programlisting>
|
|
application/x-blobapp = exec rclblob
|
|
</programlisting></para>
|
|
</listitem>
|
|
<listitem><para>Under the <literal>[icons]</literal>
|
|
section, you should choose an icon to be displayed for the
|
|
files inside the result lists. Icons are normally 64x64
|
|
pixels PNG files which live in
|
|
<filename>/usr/[local/]share/recoll/images</filename>.</para>
|
|
</listitem>
|
|
<listitem><para>Under the <literal>[categories]</literal>
|
|
section, you should add the mime type where it makes sense
|
|
(you can also create a category). Categories may be used
|
|
for filtering in advanced search.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <replaceable>rclblob</replaceable> filter should
|
|
be an executable program or script which exists inside
|
|
<filename>/usr/[local/]share/recoll/filters</filename>. It
|
|
will be given a file name as argument and should output the
|
|
text or html contents on the standard output.</para>
|
|
|
|
<para>The <link linkend="RCL.PROGRAM.FILTERS">filter
|
|
programming</link> section describes in more detail how
|
|
to write a filter.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
</book>
|
|
|