7239 lines
328 KiB
XML
7239 lines
328 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
|
|
|
|
<!ENTITY RCL "<application>Recoll</application>">
|
|
<!ENTITY RCLAPPS "<ulink url='http://www.recoll.org/pages/features.html#doctypes'>http://www.recoll.org/pages/features.html</ulink>">
|
|
<!ENTITY RCLVERSION "1.29">
|
|
<!ENTITY XAP "<application>Xapian</application>">
|
|
<!ENTITY WIN "<application>Windows</application>">
|
|
<!ENTITY LIN "<application>Unix</application>-like systems">
|
|
<!ENTITY FAQS "https://www.lesbonscomptes.com/recoll/faqsandhowtos/">
|
|
<!ENTITY RCLCONF SYSTEM "recoll.conf.xml">
|
|
]>
|
|
|
|
<book lang="en">
|
|
|
|
|
|
<!-- A nice overview of docbook elements:
|
|
|
|
https://tdg.docbook.org/tdg/4.5/ch02.html#ch02-logdiv -->
|
|
|
|
<bookinfo>
|
|
<title>Recoll user manual</title>
|
|
|
|
<author>
|
|
<firstname>Jean-Francois</firstname>
|
|
<surname>Dockes</surname>
|
|
<affiliation>
|
|
<address><email>jfd@recoll.org</email></address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<copyright>
|
|
<year>2005-2020</year>
|
|
<holder role="mailto:jfd@recoll.org">Jean-Francois Dockes</holder>
|
|
</copyright>
|
|
|
|
<abstract>
|
|
<para><literal>Permission is granted to copy, distribute and/or
|
|
modify this document under the terms of the GNU Free Documentation
|
|
License, Version 1.3 or any later version published by the Free
|
|
Software Foundation; with no Invariant Sections, no Front-Cover
|
|
Texts, and no Back-Cover Texts. A copy of the license can be
|
|
found at the following
|
|
location: <ulink url="http://www.gnu.org/licenses/fdl.html">GNU
|
|
web site</ulink>.</literal></para>
|
|
|
|
<para>This document introduces full text search notions
|
|
and describes the installation and use of the &RCL;
|
|
application. This version describes &RCL; &RCLVERSION;.</para>
|
|
</abstract>
|
|
|
|
|
|
</bookinfo>
|
|
|
|
<chapter id="RCL.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<para>This document introduces full text search notions
|
|
and describes the installation and use of the &RCL;
|
|
application. It is updated for &RCL; &RCLVERSION;.</para>
|
|
|
|
<para>&RCL; was for a long time dedicated to Unix-like systems. It
|
|
was only lately (2015) ported to
|
|
<application>MS-Windows</application>. Many references in this
|
|
manual, especially file locations, are specific to Unix, and not
|
|
valid on &WIN;, where some described features are also not available.
|
|
The manual will be progressively updated. Until this happens, on
|
|
&WIN;, most references to shared files can be translated by looking
|
|
under the Recoll installation directory (Typically <filename>C:/Program
|
|
Files (x86)/Recoll</filename>, esp. anything referenced
|
|
in <filename>/usr/share</filename> in this document will be found int
|
|
the <filename>Share</filename> subdirectory). The user configuration is
|
|
stored by default under <filename>AppData/Local/Recoll</filename>
|
|
inside the user directory, along with the index itself.</para>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.TRYIT">
|
|
<title>Giving it a try</title>
|
|
|
|
<para>If you do not like reading manuals (who does?) but
|
|
wish to give &RCL; a try, just
|
|
<link linkend="RCL.INSTALL.BINARY">install</link> the application
|
|
and start the <command>recoll</command> graphical user
|
|
interface (GUI), which will ask permission to index your home
|
|
directory, allowing you to search immediately after
|
|
indexing completes.</para>
|
|
|
|
<para>Do not do this if your home directory contains a huge
|
|
number of documents and you do not want to wait or are very
|
|
short on disk space. In this case, you may first want to customize
|
|
the <link linkend="RCL.INDEXING.CONFIG">configuration</link>
|
|
to restrict the indexed area (shortcut: from the
|
|
<command>recoll</command> GUI go to:
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing configuration</guimenuitem>
|
|
</menuchoice>, then adjust the <guilabel>Top
|
|
directories</guilabel> section).</para>
|
|
|
|
<para>On &LIN;, you may need to install the
|
|
appropriate
|
|
<link linkend="RCL.INSTALL.EXTERNAL">supporting applications</link>
|
|
for document types that need them (for
|
|
example <application>antiword</application> for
|
|
<application>Microsoft Word</application> files).
|
|
The &RCL; for &WIN; package is self-contained and includes
|
|
most useful auxiliary programs.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.SEARCH">
|
|
<title>Full text search</title>
|
|
|
|
<para>&RCL; is a full text search application, which means that it
|
|
finds your data by content rather than by external attributes
|
|
(like the file name). You specify words
|
|
(terms) which should or should not appear in the text you are
|
|
looking for, and receive in return a list of matching
|
|
documents, ordered so that the most
|
|
<emphasis>relevant</emphasis> documents will appear
|
|
first.</para>
|
|
|
|
<para>You do not need to remember in what file or email message you
|
|
stored a given piece of information. You just ask for related
|
|
terms, and the tool will return a list of documents where
|
|
these terms are prominent, in a similar way to Internet search
|
|
engines.</para>
|
|
|
|
<para>Full text search applications try to determine which
|
|
documents are most relevant to the search terms you
|
|
provide. Computer algorithms for determining relevance can be
|
|
very complex, and in general are inferior to the power of the
|
|
human mind to rapidly determine relevance. The quality of
|
|
relevance guessing is probably the most important aspect when
|
|
evaluating a search application. &RCL; relies on the &XAP;
|
|
probabilistic information retrieval library to determine
|
|
relevance.</para>
|
|
|
|
<para>In many cases, you are looking for all the forms of a
|
|
word, including plurals, different tenses for a verb, or terms
|
|
derived from the same root or <emphasis>stem</emphasis>
|
|
(example: <replaceable>floor, floors, floored,
|
|
flooring...</replaceable>). Queries are usually automatically
|
|
expanded to all such related terms (words that reduce to the
|
|
same stem). This can be prevented for searching for a specific
|
|
form.</para>
|
|
|
|
<para>Stemming, by itself, does not accommodate for misspellings or
|
|
phonetic searches. A full text search application may also support
|
|
this form of approximation. For example, a search for
|
|
<replaceable>aliterattion</replaceable> returning no result might
|
|
propose <replaceable>alliteration, alteration, alterations, or
|
|
altercation</replaceable> as possible replacement terms. &RCL; bases
|
|
its suggestions on the actual index contents, so that suggestions may
|
|
be made for words which would not appear in a standard dictionary.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INTRODUCTION.RECOLL">
|
|
<title>Recoll overview</title>
|
|
|
|
<para>&RCL; uses the
|
|
<ulink url="http://www.xapian.org">&XAP;</ulink> information retrieval
|
|
library as its storage and retrieval engine. &XAP; is a very
|
|
mature package using <ulink
|
|
url="http://www.xapian.org/docs/intro_ir.html">a sophisticated
|
|
probabilistic ranking model</ulink>.</para>
|
|
|
|
<para>The &XAP; library manages an index database which
|
|
describes where terms appear in your document files. It
|
|
efficiently processes the complex queries which are produced by
|
|
the &RCL; query expansion mechanism, and is in charge of the
|
|
all-important relevance computation task.</para>
|
|
|
|
<para>&RCL; provides the mechanisms and interface to get data
|
|
into and out of the index. This includes translating the many
|
|
possible document formats into pure text, handling term
|
|
variations (using &XAP; stemmers), and spelling approximations
|
|
(using the <application>aspell</application> speller),
|
|
interpreting user queries and presenting results.</para>
|
|
|
|
<para>In a shorter way, &RCL; does the dirty footwork, &XAP;
|
|
deals with the intelligent parts of the process.</para>
|
|
|
|
<para>The &XAP; index can be big (roughly the size of the original
|
|
document set), but it is not a document archive. &RCL; can only
|
|
display documents that still exist at the place from which they were
|
|
indexed.</para>
|
|
|
|
<para>&RCL; stores all internal data in <application>Unicode
|
|
UTF-8</application> format, and it can index many types of files
|
|
with different character sets, encodings, and languages into the
|
|
same index. It can process documents embedded inside other
|
|
documents (for example a PDF document stored inside a Zip
|
|
archive sent as an email attachment...), down to an arbitrary
|
|
depth.</para>
|
|
|
|
<para>Stemming is the process by which &RCL; reduces words to
|
|
their radicals so that searching does not depend, for example, on a
|
|
word being singular or plural (floor, floors), or on a verb tense
|
|
(flooring, floored). Because the mechanisms used for stemming
|
|
depend on the specific grammatical rules for each language, there
|
|
is a separate &XAP; stemmer module for most common languages where
|
|
stemming makes sense.</para>
|
|
|
|
<para>&RCL; stores the unstemmed versions of terms in the main index
|
|
and uses auxiliary databases for term expansion (one for each
|
|
stemming language), which means that you can switch stemming
|
|
languages between searches, or add a language without needing a
|
|
full reindex.</para>
|
|
|
|
<para>Storing documents written in different languages in the same
|
|
index is possible, and commonly done. In this situation, you can
|
|
specify several stemming languages for the index. </para>
|
|
|
|
<para>&RCL; currently makes no attempt at automatic language
|
|
recognition, which means that the stemmer will sometimes be applied
|
|
to terms from other languages with potentially strange results. In
|
|
practise, even if this introduces possibilities of confusion, this
|
|
approach has been proven quite useful, and it is much less
|
|
cumbersome than separating your documents according to what
|
|
language they are written in.</para>
|
|
|
|
<para>By default, &RCL; strips most accents and
|
|
diacritics from terms, and converts them to lower case before
|
|
either storing them in the index or searching for them. As a
|
|
consequence, it is impossible to search for a particular
|
|
capitalization of a term (<literal>US</literal> /
|
|
<literal>us</literal>), or to discriminate two terms based on
|
|
diacritics (<literal>sake</literal> / <literal>saké</literal>,
|
|
<literal>mate</literal> / <literal>maté</literal>).</para>
|
|
|
|
<para>&RCL; can optionally store the raw terms, without accent
|
|
stripping or case conversion. In this configuration, default searches
|
|
will behave as before, but it is possible to perform searches
|
|
sensitive to case and diacritics. This is described in more detail in
|
|
the section about
|
|
<link linkend="RCL.INDEXING.CONFIG.SENS">index case and diacritics sensitivity</link>.
|
|
</para>
|
|
|
|
<para>&RCL; uses many parameters to define exactly what to index,
|
|
and how to classify and decode the source documents. These are kept
|
|
in <link linkend="RCL.INDEXING.CONFIG">configuration files</link>. A
|
|
default configuration is copied into a standard location (usually
|
|
something like <filename>/usr/share/recoll/examples</filename>)
|
|
during installation. The default values set by the configuration
|
|
files in this directory may be overridden by values set inside your
|
|
personal configuration. With the default configuration, &RCL; will
|
|
index your home directory with generic parameters. Most common
|
|
parameters can be set by using
|
|
configuration menus in the <command>recoll</command> GUI. Some less
|
|
common parameters can only be set by editing the text files (the
|
|
new values will be preserved by the GUI).</para>
|
|
|
|
<para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing process</link>
|
|
is started automatically (after asking permission), the
|
|
first time you execute the <command>recoll</command> GUI. Indexing
|
|
can also be performed by executing the <command>recollindex</command>
|
|
command. &RCL; indexing is multithreaded by default when appropriate
|
|
hardware resources are available, and can perform in parallel
|
|
multiple tasks for text extraction, segmentation and index
|
|
updates.</para>
|
|
|
|
<para><link linkend="RCL.SEARCH">Searches</link> are usually
|
|
performed inside the <command>recoll</command> GUI, which has many
|
|
options to help you find what you are looking for. However, there
|
|
are other ways to query the index:
|
|
<itemizedlist>
|
|
<listitem><para>A
|
|
<link linkend="RCL.SEARCH.COMMANDLINE">command line interface</link>.
|
|
</para></listitem>
|
|
<listitem><para>A
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI"><application>Python</application> programming interface</link>
|
|
</para></listitem>
|
|
<listitem><para>A <link linkend="RCL.SEARCH.KIO"><application>KDE</application> KIO slave module</link>.
|
|
</para></listitem>
|
|
<listitem><para>A Ubuntu Unity
|
|
<ulink url="https://www.lesbonscomptes.com/recoll/pages/download.html">Scope</ulink>
|
|
module.</para></listitem>
|
|
<listitem><para>A Gnome Shell
|
|
<ulink url="https://www.lesbonscomptes.com/recoll/pages/download.html">Search Provider</ulink>.
|
|
</para></listitem>
|
|
<listitem><para>A
|
|
<ulink url="https://framagit.org/medoc92/recollwebui">Web interface</ulink>.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="RCL.INDEXING">
|
|
<title>Indexing</title>
|
|
|
|
<sect1 id="RCL.INDEXING.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<para>Indexing is the process by which the set of documents is
|
|
analyzed and the data entered into the database. &RCL;
|
|
indexing is normally incremental: documents will only be
|
|
processed if they have been modified since the last run. On
|
|
the first execution, all documents will need processing. A
|
|
full index build can be forced later by specifying an option
|
|
to the indexing command (<command>recollindex</command>
|
|
<option>-z</option> or <option>-Z</option>).</para>
|
|
|
|
<para><command>recollindex</command> skips files which caused an
|
|
error during a previous pass. This is a performance optimization, and
|
|
the command line option <option>-k</option> can be set to retry
|
|
failed files, for example after updating an input handler.</para>
|
|
|
|
<para>The following sections give an overview of different
|
|
aspects of the indexing processes and configuration, with links
|
|
to detailed sections.</para>
|
|
|
|
<para>Depending on your data, temporary files may be needed during
|
|
indexing, some of them possibly quite big. You can use the
|
|
<envar>RECOLL_TMPDIR</envar> or <envar>TMPDIR</envar> environment
|
|
variables to determine where they are created (the default is to
|
|
use <filename>/tmp</filename>). Using <envar>TMPDIR</envar> has
|
|
the nice property that it may also be taken into account by
|
|
auxiliary commands executed by <command>recollindex</command>.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.INTRODUCTION.MODES">
|
|
<title>Indexing modes</title>
|
|
|
|
<para>&RCL; indexing can be performed along two main modes:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>
|
|
<link linkend="RCL.INDEXING.PERIODIC">Periodic (or batch) indexing</link>
|
|
</title> <para><command>recollindex</command> is executed at
|
|
discrete times. On &LIN;, the typical usage is to have a
|
|
nightly run
|
|
<link linkend="RCL.INDEXING.PERIODIC.AUTOMAT">
|
|
programmed</link>
|
|
into your <command>cron</command> file. On &WIN;, this is
|
|
the only mode available, and the Windows Task Scheduler can
|
|
be used to run indexing. In both cases, the GUI includes an
|
|
easy interface to the system batch scheduler.</para>
|
|
</formalpara>
|
|
</listitem>
|
|
<listitem>
|
|
<formalpara><title>
|
|
<link linkend="RCL.INDEXING.MONITOR">Real time indexing</link>
|
|
</title>
|
|
<para>(Only available on &LIN;). <command>recollindex</command> runs
|
|
permanently as a daemon and uses a file system alteration monitor
|
|
(e.g. <application>inotify</application>) to detect file
|
|
changes. New or updated files are indexed at once. Monitoring a
|
|
big file system tree can consume
|
|
significant system resources. </para>
|
|
</formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<simplesect><title>&LIN;: choosing an indexing mode</title>
|
|
<para>The choice between the two methods is mostly a matter of
|
|
preference, and they can be combined by setting up multiple
|
|
indexes (ie: use periodic indexing on a big documentation
|
|
directory, and real time indexing on a small home
|
|
directory), or, with &RCL; 1.24 and newer, by
|
|
<link linkend="RCL.INDEXING.MONITOR">configuring the index so that only a subset of the tree will be monitored.</link>
|
|
</para>
|
|
<para>The choice of method and the parameters used can be
|
|
configured from the <command>recoll</command> GUI:
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing schedule</guimenuitem>
|
|
</menuchoice> dialog.
|
|
</para>
|
|
</simplesect>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.INTRODUCTION.CONFIG">
|
|
<title>Configurations, multiple indexes</title>
|
|
|
|
<para>&RCL; supports defining multiple indexes, each defined by its
|
|
own configuration directory. A configuration directory contains
|
|
<link linkend="RCL.INDEXING.CONFIG">several files</link> which
|
|
describe what should be indexed and how.</para>
|
|
|
|
<para>When <command>recoll</command> or
|
|
<command>recollindex</command> is first executed, it creates a
|
|
default configuration directory. This configuration is the one used
|
|
for indexing and querying when no specific configuration is
|
|
specified. It is located in <filename>$HOME/.recoll/</filename> for
|
|
&LIN; and <filename>%LOCALAPPDATA%\Recoll</filename> on &WIN;
|
|
(typically
|
|
<filename>C:\Users\[me]\Appdata\Local\Recoll</filename>).</para>
|
|
|
|
<para>All configuration parameters have defaults, defined in
|
|
system-wide files. Without further customisation, the default
|
|
configuration will process your complete home directory, with a
|
|
reasonable set of defaults. It can be adjusted to process a
|
|
different area of the file system, select files in different ways,
|
|
and many other things.</para>
|
|
|
|
<para>In some cases, it may be useful to create additional
|
|
configuration directories, for example, to separate personal and
|
|
shared indexes, or to take advantage of the organization of your
|
|
data to improve search precision.</para>
|
|
|
|
<para>In order to do this, you would create an empty directory in a
|
|
location of your choice, and then instruct
|
|
<command>recoll</command> or <command>recollindex</command> to use
|
|
it by setting either a command line option (<literal>-c</literal>
|
|
<replaceable>/some/directory</replaceable>), or an environment
|
|
variable
|
|
(<envar>RECOLL_CONFDIR</envar>=<replaceable>/some/directory</replaceable>).
|
|
Any modification performed by the commands (e.g. configuration
|
|
customisation or searches by <command>recoll</command> or index
|
|
creation by <command>recollindex</command>) would then apply to the
|
|
new directory and not to the default one.</para>
|
|
|
|
<para>Once multiple indexes are created, you can use each of them
|
|
separately by setting the <literal>-c</literal> option or the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable when starting a
|
|
command, to select the desired index.</para>
|
|
|
|
<para>It is also possible to instruct one configuration to
|
|
query one or several other indexes in addition to its own, by using
|
|
the <guimenuitem>External index</guimenuitem> function in the
|
|
<command>recoll</command> GUI, or some other functions in the
|
|
command line and programming tools.</para>
|
|
|
|
<para>A plausible usage scenario for the multiple index feature
|
|
would be for a system administrator to set up a central index for
|
|
shared data, that you choose to search or not in addition to your
|
|
personal data. Of course, there are other possibilities. for
|
|
example, there are many cases where you know the subset of files
|
|
that should be searched, and where narrowing the search can improve
|
|
the results. You can achieve approximately the same effect with the
|
|
directory filter in advanced search, but multiple indexes may have
|
|
better performance and may be worth the trouble in some
|
|
cases.</para>
|
|
|
|
<para>A more advanced use case would be to use multiple index to
|
|
improve indexing performance, by updating several indexes in
|
|
parallel (using multiple CPU cores and disks, or possibly several
|
|
machines), and then merging them, or querying them in
|
|
parallel.</para>
|
|
|
|
<para>See the section about
|
|
<link linkend="RCL.INDEXING.CONFIG.MULTIPLE">configuring multiple indexes</link>
|
|
for more detail</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Document types</title>
|
|
|
|
<para>&RCL; knows about quite a few different document
|
|
types. The parameters for document types recognition and
|
|
processing are set in <link linkend="RCL.INDEXING.CONFIG">
|
|
configuration files</link>.
|
|
</para>
|
|
|
|
<para>Most file types, like HTML or word processing files, only hold
|
|
one document. Some file types, like email folders or zip
|
|
archives, can hold many individually indexed documents, which may
|
|
themselves be compound ones. Such hierarchies can go quite
|
|
deep, and &RCL; can process, for example, a
|
|
<application>LibreOffice</application>
|
|
document stored as an attachment to an email message inside an
|
|
email folder archived in a zip file...</para>
|
|
|
|
<para><command>recollindex</command> processes plain text, HTML,
|
|
OpenDocument (Open/LibreOffice), email formats, and a few others
|
|
internally.</para>
|
|
|
|
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
|
need external applications for preprocessing. The list is in the
|
|
<link linkend="RCL.INSTALL.EXTERNAL">installation</link>
|
|
section. After every indexing operation, &RCL; updates a list of
|
|
commands that would be needed for indexing existing files
|
|
types. This list can be displayed by selecting the menu option
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Show Missing Helpers</guimenuitem>
|
|
</menuchoice>
|
|
in the <command>recoll</command> GUI. It is stored in the
|
|
<filename>missing</filename> text file inside the configuration
|
|
directory.</para>
|
|
|
|
<para>After installing a missing handler, you may need to
|
|
tell <command>recollindex</command>
|
|
to retry the failed files, by adding option <literal>-k</literal>
|
|
to the command line, or by using the GUI
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Special indexing</guimenuitem>
|
|
</menuchoice> menu. This is because <command>recollindex</command>,
|
|
in its default operation mode, will not retry files which caused an
|
|
error during an earlier pass. In special cases, it may be useful to
|
|
reset the data for a category of files before indexing. See
|
|
the <command>recollindex</command> manual page. If your index is
|
|
not too big, it may be simpler to just reset it.</para>
|
|
|
|
<para>By default, &RCL; will try to index any file type that
|
|
it has a way to read. This is sometimes not desirable, and
|
|
there are ways to either exclude some types, or on the
|
|
contrary define a positive list of types to be
|
|
indexed. In the latter case, any type not in the list will
|
|
be ignored.</para>
|
|
|
|
<para>Excluding files by name can be done by adding wildcard name
|
|
patterns to the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES">
|
|
skippedNames</link>
|
|
list, which can be done from the GUI Index configuration
|
|
menu. Excluding by type can be done by setting the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES">
|
|
excludedmimetypes</link>
|
|
list in the configuration file (1.20 and later). This can be
|
|
redefined for subdirectories.</para>
|
|
|
|
<para>You can also define an exclusive list of MIME types to be
|
|
indexed (no others will be indexed), by setting
|
|
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.INDEXEDMIMETYPES">
|
|
indexedmimetypes</link>
|
|
configuration variable. Example:<programlisting>
|
|
indexedmimetypes = text/html application/pdf
|
|
</programlisting>
|
|
It is possible to redefine this parameter for
|
|
subdirectories. Example:<programlisting>
|
|
[/path/to/my/dir]
|
|
indexedmimetypes = application/pdf
|
|
</programlisting>
|
|
(When using sections like this, don't forget that they remain
|
|
in effect until the end of the file or another section
|
|
indicator).
|
|
</para>
|
|
|
|
<para><literal>excludedmimetypes</literal> or
|
|
<literal>indexedmimetypes</literal>, can be set either by editing
|
|
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration file (<filename>recoll.conf</filename>)</link>
|
|
for the index, or by using the GUI index configuration tool.</para>
|
|
|
|
<note><title>Note about MIME types</title>
|
|
<para>When editing the <literal>indexedmimetypes</literal>
|
|
or <literal>excludedmimetypes</literal> lists, you should use the
|
|
MIME values listed in the <filename>mimemap</filename> file
|
|
or in Recoll result lists in preference to <literal>file -i</literal>
|
|
output: there are a number of differences. The
|
|
<literal>file -i</literal> output should only be used for files
|
|
without extensions, or for which the extension is not listed in
|
|
<filename>mimemap</filename></para></note>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2>
|
|
<title>Indexing failures</title>
|
|
|
|
<para>Indexing may fail for some documents, for a number of
|
|
reasons: a helper program may be missing, the document may be
|
|
corrupt, we may fail to uncompress a file because no file
|
|
system space is available, etc.</para>
|
|
|
|
<para>The &RCL; indexer in versions 1.21 and later does not
|
|
retry failed files by default, because some indexing failures
|
|
can be quite costly (for example failing to uncompress a big
|
|
file because of insufficient disk space).
|
|
Retrying will only occur if an explicit option
|
|
(<option>-k</option>) is set on
|
|
the <command>recollindex</command> command line, or if a script
|
|
executed when <command>recollindex</command> starts up says
|
|
so. The script is defined by a configuration variable
|
|
(<literal>checkneedretryindexscript</literal>), and makes a
|
|
rather lame attempt at deciding if a helper command may have been
|
|
installed, by checking if any of the
|
|
common <filename>bin</filename> directories have changed.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Recovery</title>
|
|
|
|
<para>In the rare case where the index becomes corrupted (which can
|
|
signal itself by weird search results or crashes), the index files
|
|
need to be erased before restarting a clean indexing pass. Just delete
|
|
the <filename>xapiandb</filename> directory (see
|
|
<link linkend="RCL.INDEXING.STORAGE">next section</link>), or,
|
|
alternatively, start the next <command>recollindex</command> with the
|
|
<option>-z</option> option, which will reset the database before
|
|
indexing. The difference between the two methods is that the
|
|
second will not change the current index format, which may be
|
|
undesirable if a newer format is supported by the &XAP;
|
|
version.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.STORAGE">
|
|
<title>Index storage</title>
|
|
|
|
<para>The default location for the index data is the
|
|
<filename>xapiandb</filename> subdirectory of the &RCL;
|
|
configuration directory, typically
|
|
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
|
|
changed via two different methods (with different purposes):
|
|
<orderedlist>
|
|
|
|
<listitem><para>For a given configuration directory, you can
|
|
specify a non-default storage location for the index by setting
|
|
the <varname>dbdir</varname> parameter in the configuration file
|
|
(see the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration section</link>).
|
|
This method would mainly be of use if you wanted
|
|
to keep the configuration directory in its default location, but
|
|
desired another location for the index, typically out of disk
|
|
occupation or performance concerns.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>You can specify a different configuration
|
|
directory by setting the <envar>RECOLL_CONFDIR</envar>
|
|
environment variable, or using the <option>-c</option>
|
|
option to the &RCL; commands. This method would typically be
|
|
used to index different areas of the file system to
|
|
different indexes. For example, if you were to issue the
|
|
following command:
|
|
<programlisting>recoll -c ~/.indexes-email</programlisting> Then
|
|
&RCL; would use configuration files
|
|
stored in <filename>~/.indexes-email/</filename> and,
|
|
(unless specified otherwise in
|
|
<filename>recoll.conf</filename>) would look for
|
|
the index in
|
|
<filename>~/.indexes-email/xapiandb/</filename>.</para>
|
|
|
|
<para>Using multiple configuration directories and
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration options</link>
|
|
allows you to tailor multiple configurations and
|
|
indexes to handle whatever subset of the available data you wish
|
|
to make searchable.</para>
|
|
</listitem>
|
|
|
|
</orderedlist>
|
|
</para>
|
|
|
|
<para>The size of the index is determined by the size of the set
|
|
of documents, but the ratio can vary a lot. For a typical
|
|
mixed set of documents, the index size will often be close to
|
|
the data set size. In specific cases (a set of compressed mbox
|
|
files for example), the index can become much bigger than the
|
|
documents. It may also be much smaller if the documents
|
|
contain a lot of images or other non-indexed data (an extreme
|
|
example being a set of mp3 files where only the tags would be
|
|
indexed).</para>
|
|
|
|
<para>Of course, images, sound and video do not increase the index
|
|
size, which means that in most cases, the space used by the index
|
|
will be negligible compared to the total amount of data on the
|
|
computer.</para>
|
|
|
|
<para>The index data directory (<filename>xapiandb</filename>)
|
|
only contains data that can be completely rebuilt by an index run
|
|
(as long as the original documents exist), and it can always be
|
|
destroyed safely.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.FORMAT">
|
|
<title>&XAP; index formats</title>
|
|
|
|
<para>&XAP; versions usually support several formats for index
|
|
storage. A given major &XAP; version will have a current format,
|
|
used to create new indexes, and will also support the format from
|
|
the previous major version.</para>
|
|
|
|
<para>&XAP; will not convert automatically an existing index from
|
|
the older format to the newer one. If you want to upgrade to the
|
|
new format, or if a very old index needs to be converted because
|
|
its format is not supported any more, you will have to explicitly
|
|
delete the old index (typically
|
|
<filename>~/.recoll/xapiandb</filename>), then run a normal
|
|
indexing command. Using <command>recollindex</command> option
|
|
<option>-z</option> would not work in this situation.</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.SECURITY">
|
|
<title>Security aspects</title>
|
|
|
|
<para>The &RCL; index does not hold complete copies of the indexed
|
|
documents (it almost does after version 1.24). But it does
|
|
hold enough data to allow for an almost complete reconstruction. If
|
|
confidential data is indexed, access to the database directory
|
|
should be restricted. </para>
|
|
|
|
<para>&RCL; will create the configuration directory with a mode of
|
|
0700 (access by owner only). As the index data directory is by
|
|
default a sub-directory of the configuration directory, this should
|
|
result in appropriate protection.</para>
|
|
|
|
<para>If you use another setup, you should think of the kind
|
|
of protection you need for your index, set the directory
|
|
and files access modes appropriately, and also maybe adjust
|
|
the <literal>umask</literal> used during index updates.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.STORAGE.BIG">
|
|
<title>Special considerations for big indexes</title>
|
|
|
|
<para>This only needs concern you if your index is going to be
|
|
bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious
|
|
issue. Most people have much smaller indexes. For reference, 5
|
|
GBytes would be around 2000 bibles, a lot of text. If you have a
|
|
huge text dataset (remember: images don't count, the text content
|
|
of PDFs is typically less than 5% of the file size), read on.</para>
|
|
|
|
<para>The amount of writing performed by Xapian during index
|
|
creation is not linear with the index size (it is somewhere between
|
|
linear and quadratic). For big indexes this becomes a performance
|
|
issue, and may even be an SSD disk wear issue.</para>
|
|
|
|
<para>The problem can be mitigated by observing the following
|
|
rules:</para>
|
|
<itemizedlist>
|
|
<listitem><para>Partition the data set and create several indexes
|
|
of reasonable size rather than a huge one. These indexes can then
|
|
be queried in parallel (using the &RCL; external indexes
|
|
facility), or merged using
|
|
<command>xapian-compact</command>.</para></listitem>
|
|
<listitem><para>Have a lot of RAM available and set the
|
|
<literal>idxflushmb</literal> &RCL; configuration parameter as
|
|
high as you can without swapping (experimentation will be
|
|
needed). 200 would be a minimum in this
|
|
context.</para></listitem>
|
|
<listitem><para>Use Xapian 1.4.10 or newer, as this version
|
|
brought a significant improvement in the amount of writes.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.CONFIG">
|
|
<title>Index configuration</title>
|
|
|
|
<para>Variables stored inside the
|
|
<link linkend="RCL.INSTALL.CONFIG">&RCL; configuration files</link>
|
|
control which areas of the file system are indexed, and how files
|
|
are processed. The values can be set by editing the text
|
|
files. Most of the more commonly used ones can also be adjusted by
|
|
using the <link linkend="RCL.INDEXING.CONFIG.GUI">
|
|
dialogs in the <command>recoll</command> GUI</link>.</para>
|
|
|
|
<para>The first time you start <command>recoll</command>, you will be
|
|
asked whether or not you would like it to build the index. If you
|
|
want to adjust the configuration before indexing, just click
|
|
<guilabel>Cancel</guilabel> at this point, which will get you into
|
|
the configuration interface. If you exit at this point,
|
|
<filename>recoll</filename> will have created a default configuration
|
|
directory with empty configuration files, which you can then
|
|
edit.</para>
|
|
|
|
<para>The configuration is documented inside the
|
|
<link linkend="RCL.INSTALL.CONFIG">installation chapter</link>
|
|
of this document, or in the
|
|
<ulink url="https://www.lesbonscomptes.com/recoll/manpages/recoll.conf.5.html"><citerefentry><refentrytitle>recoll.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry></ulink>
|
|
manual page. Both documents are automatically generated from
|
|
the comments inside the configuration file.</para>
|
|
|
|
<para>The most immediately useful variable
|
|
is probably
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><varname>topdirs</varname></link>,
|
|
which lists the subtrees and files to be indexed.</para>
|
|
|
|
<para>The applications needed to index file types other than
|
|
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
|
described in the <link linkend="RCL.INSTALL.EXTERNAL">external packages section</link>.
|
|
</para>
|
|
|
|
<para>There are two incompatible types of Recoll
|
|
indexes, depending on the treatment of character case and
|
|
diacritics. A <link linkend="RCL.INDEXING.CONFIG.SENS">further
|
|
section</link> describes the two types in more detail. The default
|
|
type is appropriate in most cases.</para>
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.MULTIPLE">
|
|
<title>Multiple indexes</title>
|
|
|
|
<para>Multiple &RCL; indexes can be created by using several
|
|
configuration directories which are typically set to index
|
|
different areas of the file system.</para>
|
|
|
|
<para>A specific index can be selected by setting the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable or giving the
|
|
<option>-c</option> option to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>The <command>recollindex</command> program, used for creating
|
|
or updating indexes, always works on a single index. The different
|
|
configurations are entirely independent (no parameters are ever
|
|
shared between configurations when indexing). </para>
|
|
|
|
<para>All the search interfaces (<command>recoll</command>,
|
|
<command>recollq</command>, the Python API, etc.) operate with a
|
|
main configuration, from which both configuration and index data
|
|
are used, and can also query data from multiple additional
|
|
indexes. Only the index data from the latter is used, their
|
|
configuration parameters are ignored. This implies that some
|
|
parameters should be consistent among index configurations which
|
|
are to be used together.</para>
|
|
|
|
<para>When searching, the current main index (defined by
|
|
<envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is always
|
|
active. If this is undesirable, you can set up your base
|
|
configuration to index an empty directory.</para>
|
|
|
|
<para>Index configuration parameters can be set either by using a
|
|
text editor on the files, or, for most parameters, by using the
|
|
<link linkend="RCL.INDEXING.CONFIG.GUI"><command>recoll</command> index configuration GUI</link>.
|
|
In the latter case, the configuration directory for which
|
|
parameters are modified is the one which was selected by
|
|
<envar>RECOLL_CONFDIR</envar> or the <option>-c</option> parameter,
|
|
and there is no way to switch configurations within the GUI.</para>
|
|
|
|
<para>See the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration section</link>
|
|
for a detailed description of the parameters</para>
|
|
|
|
<para>Some configuration parameters must be consistent among a set
|
|
of multiple indexes used together for searches. Most importantly,
|
|
all indexes to be queried concurrently must have the same option
|
|
concerning character case and diacritics stripping, but there are
|
|
other constraints. Most of the relevant parameters affect the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TERMS">term generation</link>.
|
|
</para>
|
|
|
|
<para>Using multiple configurations implies a small
|
|
level of command line or file manager usage. The user must
|
|
explicitly create additional configuration directories, the GUI
|
|
will not do it. This is to avoid mistakenly creating additional
|
|
directories when an argument is mistyped. Also, the GUI or the
|
|
indexer must be launched with a specific option or environment to
|
|
work on the right configuration.</para>
|
|
|
|
<simplesect>
|
|
<title>In practise: creating and using an additional index</title>
|
|
|
|
|
|
<para>Initially creating the configuration and index:<programlisting>
|
|
mkdir <replaceable>/path/to/my/new/config</replaceable></programlisting></para>
|
|
|
|
<para>Configuring the new index can be done from the
|
|
<command>recoll</command> GUI, launched from the
|
|
command line to pass the <literal>-c</literal> option
|
|
(you could create a desktop file to do it for you), and then using the
|
|
<link linkend="RCL.INDEXING.CONFIG.GUI">GUI index configuration tool</link>
|
|
to set up the index.
|
|
<programlisting>
|
|
recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
|
|
</para>
|
|
|
|
|
|
<para>Alternatively, you can just start a text editor on the main
|
|
configuration file:
|
|
<programlisting>
|
|
<replaceable>someEditor</replaceable> <replaceable>/path/to/my/new/config</replaceable>/<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF"><filename>recoll.conf</filename></link>
|
|
</programlisting>
|
|
</para>
|
|
|
|
|
|
<para>Creating and updating the index can be done from the command line:
|
|
|
|
<programlisting>recollindex -c <replaceable>/path/to/my/new/config</replaceable>
|
|
</programlisting>
|
|
or from the File menu of a GUI launched with the same option
|
|
(<command>recoll</command>, see above).</para>
|
|
|
|
<para>The same GUI would also let you set up batch indexing for
|
|
the new index. Real time indexing can only be set up from the GUI
|
|
for the default index (the menu entry will be inactive if the GUI
|
|
was started with a non-default <literal>-c</literal>
|
|
option).</para>
|
|
|
|
<para>The new index can be queried alone with<programlisting>
|
|
recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
|
|
Or, in parallel with the default index, by starting
|
|
<command>recoll</command> without a <literal>-c</literal> option,
|
|
and using the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>External Index Dialog</guimenuitem>
|
|
</menuchoice> menu.</para>
|
|
</simplesect>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.SENS">
|
|
<title>Index case and diacritics sensitivity</title>
|
|
|
|
<para>As of &RCL; version 1.18 you have a choice of building an
|
|
index with terms stripped of character case and diacritics, or
|
|
one with raw terms. For a source term of
|
|
<literal>Résumé</literal>, the former will store
|
|
<literal>resume</literal>, the latter
|
|
<literal>Résumé</literal>.</para>
|
|
|
|
<para>Each type of index allows performing searches insensitive to
|
|
case and diacritics: with a raw index, the user entry will be
|
|
expanded to match all case and diacritics variations present in
|
|
the index. With a stripped index, the search term will be stripped
|
|
before searching.</para>
|
|
|
|
<para>A raw index allows using case and diacritics to discriminate
|
|
between terms, e.g., returning different results when searching for
|
|
<literal>US</literal> and <literal>us</literal> or
|
|
<literal>resume</literal> and <literal>résumé</literal>.
|
|
Read the
|
|
<link linkend="RCL.SEARCH.CASEDIAC">section about search case and diacritics sensitivity</link>
|
|
for more details.</para>
|
|
|
|
<para>The type of index to be created is controlled by the
|
|
<literal>indexStripChars</literal> configuration
|
|
variable which can only be changed by editing the
|
|
configuration file. Any change implies an index reset (not
|
|
automated by &RCL;), and all indexes in a search must be set
|
|
in the same way (again, not checked by &RCL;). </para>
|
|
|
|
<para>&RCL; creates a stripped index by default if
|
|
<literal>indexStripChars</literal> is not set.</para>
|
|
|
|
<para>As a cost for added capability, a raw index will be slightly
|
|
bigger than a stripped one (around 10%). Also, searches will be
|
|
more complex, so probably slightly slower, and the feature is
|
|
relatively little used, so that a certain amount of weirdness
|
|
cannot be excluded.</para>
|
|
|
|
<para>One of the most adverse consequence of using a raw index
|
|
is that some phrase and proximity searches may become
|
|
impossible: because each term needs to be expanded, and all
|
|
combinations searched for, the multiplicative expansion may
|
|
become unmanageable.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.THREADS">
|
|
<title>Indexing threads configuration (&LIN;)</title>
|
|
|
|
<para>The &RCL; indexing process
|
|
<command>recollindex</command> can use multiple threads to
|
|
speed up indexing on multiprocessor systems. The work done
|
|
to index files is divided in several stages and some of the
|
|
stages can be executed by multiple threads. The stages are:
|
|
<orderedlist>
|
|
<listitem><para>File system walking: this is always performed by
|
|
the main thread.</para></listitem>
|
|
<listitem><para>File conversion and data
|
|
extraction.</para></listitem>
|
|
<listitem><para>Text processing (splitting, stemming,
|
|
etc.).</para></listitem>
|
|
<listitem><para>&XAP; index update.</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
<para>You can also read a
|
|
<ulink url="http://www.recoll.org/pages/idxthreads/threadingRecoll.html">
|
|
longer document</ulink> about the transformation of
|
|
&RCL; indexing to multithreading.</para>
|
|
|
|
<para>The threads configuration is controlled by two
|
|
configuration file parameters.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry><term><varname>thrQSizes</varname></term>
|
|
<listitem><para>This variable defines the job input queues
|
|
configuration. There are three possible queues for stages
|
|
2, 3 and 4, and this parameter should give the queue depth
|
|
for each stage (three integer values). If a value of -1 is
|
|
used for a given stage, no queue is used, and the thread
|
|
will go on performing the next stage. In practise, deep
|
|
queues have not been shown to increase performance. A value
|
|
of 0 for the first queue tells &RCL; to perform
|
|
autoconfiguration (no need for anything else in this case,
|
|
thrTCounts is not used) - this is the default
|
|
configuration.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><varname>thrTCounts</varname></term>
|
|
<listitem><para>This defines the number of threads used
|
|
for each stage. If a value of -1 is used for one of
|
|
the queue depths, the corresponding thread count is
|
|
ignored. It makes no sense to use a value other than 1
|
|
for the last stage because updating the &XAP; index is
|
|
necessarily single-threaded (and protected by a
|
|
mutex).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<note><para>If the first value in <varname>thrQSizes</varname> is
|
|
0, <varname>thrTCounts</varname> is ignored.</para></note>
|
|
|
|
<para>The following example would use three queues (of depth 2),
|
|
and 4 threads for converting source documents, 2 for
|
|
processing their text, and one to update the index. This was
|
|
tested to be the best configuration on the test system
|
|
(quadri-processor with multiple disks).
|
|
<programlisting>
|
|
thrQSizes = 2 2 2
|
|
thrTCounts = 4 2 1
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>The following example would use a single queue, and the
|
|
complete processing for each document would be performed by
|
|
a single thread (several documents will still be processed
|
|
in parallel in most cases). The threads will use mutual
|
|
exclusion when entering the index update stage. In practise
|
|
the performance would be close to the precedent case in
|
|
general, but worse in certain cases (e.g. a Zip archive
|
|
would be performed purely sequentially), so the previous
|
|
approach is preferred. YMMV... The 2 last values for
|
|
thrTCounts are ignored.
|
|
<programlisting>
|
|
thrQSizes = 2 -1 -1
|
|
thrTCounts = 6 1 1
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>The following example would disable
|
|
multithreading. Indexing will be performed by a single
|
|
thread.
|
|
<programlisting>
|
|
thrQSizes = -1 -1 -1
|
|
</programlisting>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="RCL.INDEXING.CONFIG.GUI">
|
|
<title>The index configuration GUI</title>
|
|
|
|
<para>Most parameters for a given index configuration can
|
|
be set from a <command>recoll</command> GUI running on this
|
|
configuration (either as default, or by setting
|
|
<envar>RECOLL_CONFDIR</envar> or the <option>-c</option>
|
|
option.)</para>
|
|
|
|
<para>The interface is started from the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Index Configuration</guimenuitem>
|
|
</menuchoice>
|
|
menu entry. It is divided in four tabs,
|
|
<guilabel>Global parameters</guilabel>, <guilabel>Local
|
|
parameters</guilabel>, <guilabel>Web history</guilabel>
|
|
(which is explained in the next section) and <guilabel>Search
|
|
parameters</guilabel>.</para>
|
|
|
|
<para>The <guilabel>Global parameters</guilabel> tab allows setting
|
|
global variables, like the lists of top directories, skipped paths,
|
|
or stemming languages.</para>
|
|
|
|
<para>The <guilabel>Local parameters</guilabel> tab allows setting
|
|
variables that can be redefined for subdirectories. This second tab
|
|
has an initially empty list of customisation directories, to which
|
|
you can add. The variables are then set for the currently selected
|
|
directory (or at the top level if the empty line is
|
|
selected).</para>
|
|
|
|
<para>The <guilabel>Search parameters</guilabel> section defines
|
|
parameters which are used at query time, but are global to an
|
|
index and affect all search tools, not only the GUI.</para>
|
|
|
|
<para>The meaning for most entries in the interface is
|
|
self-evident and documented by a <literal>ToolTip</literal>
|
|
popup on the text label. For more detail, you will need to
|
|
refer to the
|
|
<link linkend="RCL.INSTALL.CONFIG">configuration section</link>
|
|
of this guide.</para>
|
|
|
|
<para>The configuration tool normally respects the comments
|
|
and most of the formatting inside the configuration file, so
|
|
that it is quite possible to use it on hand-edited files,
|
|
which you might nevertheless want to backup first...</para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.REMOVABLE">
|
|
<title>Removable volumes</title>
|
|
|
|
<para>&RCL; used to have no support for indexing removable volumes
|
|
(portable disks, USB keys, etc.). Recent versions have improved the
|
|
situation and support indexing removable volumes in two different
|
|
ways:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>By indexing the volume in the main, fixed, index, and
|
|
ensuring that the volume data is not purged if the indexing runs
|
|
while the volume is mounted. (since &RCL; 1.25.2).</para></listitem>
|
|
<listitem><para>By storing a volume index on the volume
|
|
itself (since &RCL; 1.24).</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<simplesect id="RCL.INDEXING.REMOVABLE.MAIN">
|
|
<title>Indexing removable volumes in the main index</title>
|
|
|
|
<para>As of version 1.25.2, &RCL; provides a simple way to ensure
|
|
that the index data for an absent volume will not be purged. Two
|
|
conditions must be met:
|
|
<itemizedlist>
|
|
<listitem><para>The volume mount
|
|
point must be a member of the <literal>topdirs</literal>
|
|
list.</para></listitem>
|
|
<listitem><para>The mount directory must be empty (when the volume
|
|
is not mounted).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>If <command>recollindex</command> finds that one of the
|
|
<literal>topdirs</literal> is empty when starting up, any existing
|
|
data for the tree will be preserved by the indexing
|
|
pass (no purge for this area).</para>
|
|
|
|
</simplesect>
|
|
|
|
<simplesect id="RCL.INDEXING.REMOVABLE.SELF">
|
|
<title>Self contained volumes</title>
|
|
|
|
<para>As of &RCL; 1.24, it has become possible to build
|
|
self-contained datasets including a &RCL; configuration directory and
|
|
index together with the indexed documents, and to move such a dataset
|
|
around (for example copying it to an USB drive), without having to
|
|
adjust the configuration for querying the index.</para>
|
|
|
|
<note><para>This is a query-time feature only. The index must only be
|
|
updated in its original location. If an update is necessary in a
|
|
different location, the index must be reset.</para></note>
|
|
|
|
<para>The principle of operation is that the configuration stores the
|
|
location of the original configuration directory, which must reside
|
|
on the movable volume. If the volume is later mounted elsewhere,
|
|
&RCL; adjusts the paths stored inside the index by the difference
|
|
between the original and current locations of the configuration
|
|
directory.</para>
|
|
|
|
<para>To make a long story short, here follows a script to create a
|
|
&RCL; configuration and index under a given directory (given as single
|
|
parameter). The resulting data set (files + recoll directory) can later
|
|
to be moved to a CDROM or thumb drive. Longer explanations come after
|
|
the script.</para>
|
|
|
|
<programlisting>#!/bin/sh
|
|
|
|
fatal()
|
|
{
|
|
echo $*;exit 1
|
|
}
|
|
usage()
|
|
{
|
|
fatal "Usage: init-recoll-volume.sh <top-directory>"
|
|
}
|
|
|
|
test $# = 1 || usage
|
|
topdir=$1
|
|
test -d "$topdir" || fatal $topdir should be a directory
|
|
|
|
confdir="$topdir/recoll-config"
|
|
test ! -d "$confdir" || fatal $confdir should not exist
|
|
|
|
mkdir "$confdir"
|
|
cd "$topdir"
|
|
topdir=`pwd`
|
|
cd "$confdir"
|
|
confdir=`pwd`
|
|
|
|
(echo topdirs = '"'$topdir'"'; \
|
|
echo orgidxconfdir = $topdir/recoll-config) > "$confdir/recoll.conf"
|
|
|
|
recollindex -c "$confdir"
|
|
</programlisting>
|
|
|
|
<para>The examples below will assume that you have a dataset under
|
|
<filename>/home/me/mydata/</filename>, with the index configuration and
|
|
data stored inside
|
|
<filename>/home/me/mydata/recoll-confdir</filename>.</para>
|
|
|
|
<para>In order to be able to run queries after the dataset has been
|
|
moved, you must ensure the following:
|
|
<itemizedlist>
|
|
<listitem><para>The main configuration file must define the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.ORGIDXCONFDIR">orgidxconfdir</link>
|
|
variable to be the original location of the configuration directory
|
|
(<filename>orgidxconfdir=/home/me/mydata/recoll-confdir</filename>
|
|
must be set inside
|
|
<filename>/home/me/mydata/recoll-confdir/recoll.conf</filename> in
|
|
the example above).</para></listitem>
|
|
|
|
<listitem><para>The configuration directory must exist with the
|
|
documents, somewhere under the directory which will be
|
|
moved. E.g. if you are moving <filename>/home/me/mydata</filename>
|
|
around, the configuration directory must exist somewhere below this
|
|
point, for example
|
|
<filename>/home/me/mydata/recoll-confdir</filename>, or
|
|
<filename>/home/me/mydata/sub/recoll-confdir</filename>.</para></listitem>
|
|
|
|
<listitem><para>You should keep the default locations for the index
|
|
elements which are relative to the configuration directory by
|
|
default (principally <literal>dbdir</literal>). Only the paths
|
|
referring to the documents themselves
|
|
(e.g. <literal>topdirs</literal> values) should be absolute (in
|
|
general, they are only used when indexing anyway).</para></listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>Only the first point needs an explicit user action, the &RCL;
|
|
defaults are compatible with the third one, and the second is
|
|
natural.</para>
|
|
|
|
<para>If, after the move, the configuration directory needs to be
|
|
copied out of the dataset (for example because the thumb drive is too
|
|
slow), you can set the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.CURIDXCONFDIR">curidxconfdir</link>,
|
|
variable inside the copied configuration to
|
|
define the location of the moved one. For example if
|
|
<filename>/home/me/mydata</filename> is now mounted onto
|
|
<filename>/media/me/somelabel</filename>, but the configuration
|
|
directory and index has been copied to
|
|
<filename>/tmp/tempconfig</filename>, you would set
|
|
<literal>curidxconfdir</literal> to
|
|
<filename>/media/me/somelabel/recoll-confdir</filename> inside
|
|
<filename>/tmp/tempconfig/recoll.conf</filename>.
|
|
<literal>orgidxconfdir</literal> would still be
|
|
<filename>/home/me/mydata/recoll-confdir</filename> in the original and
|
|
the copy.</para>
|
|
|
|
<para>If you are regularly copying the configuration out of the
|
|
dataset, it will be useful to write a script to automate the
|
|
procedure. This can't really be done inside &RCL; because there are
|
|
probably many possible variants. One example would be to copy the
|
|
configuration to make it writable, but keep the index data on the
|
|
medium because it is too big - in this case, the script would also need
|
|
to set <literal>dbdir</literal> in the copied configuration.</para>
|
|
|
|
<para>The same set of modifications (&RCL; 1.24) has also made it
|
|
possible to run queries from a readonly configuration directory (with
|
|
slightly reduced function of course, such as not recording the query
|
|
history).</para>
|
|
</simplesect>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.WebQUEUE">
|
|
<title>&LIN;: indexing visited Web pages</title>
|
|
|
|
<para>With the help of a <application>Firefox</application>
|
|
extension, &RCL; can index the Internet pages that you visit. The
|
|
extension has a long history: it was initially designed for
|
|
the <application>Beagle</application> indexer, then adapted to
|
|
&RCL; and
|
|
the <application>Firefox</application> <application>XUL</application>
|
|
API. The current version of the extension is located in
|
|
the <ulink url="https://addons.mozilla.org/en-US/firefox/addon/recoll-we/">Mozilla
|
|
add-ons repository</ulink> uses
|
|
the <application>WebExtensions</application> API, and works with
|
|
current <application>Firefox</application> versions.</para>
|
|
|
|
<para>The extension works by copying visited Web pages to an indexing
|
|
queue directory, which &RCL; then processes, storing the data into a
|
|
local cache, then indexing it, then removing the file from the
|
|
queue.</para>
|
|
|
|
<note><title>The local cache is not an archive</title><para>As
|
|
mentioned above, a copy of the indexed Web pages is retained by
|
|
Recoll in a local cache (from which data is fetched for previews,
|
|
or when resetting the index). The cache is not changed by an
|
|
index reset, just read for indexing. The cache has a maximum
|
|
size, which can be adjusted from the <guilabel>Index
|
|
configuration</guilabel> / <guilabel>Web history</guilabel> panel
|
|
(<literal>webcachemaxmbs</literal> parameter
|
|
in <filename>recoll.conf</filename>). Once the maximum size is
|
|
reached, old pages are erased to make room for new ones. The
|
|
pages which you want to keep indefinitely need to be explicitly
|
|
archived elsewhere. Using a very high value for the cache size
|
|
can avoid data erasure, but see the above 'Howto' page for more
|
|
details and gotchas.</para></note>
|
|
|
|
<para>The visited Web pages indexing feature can be enabled on the
|
|
&RCL; side from the GUI <guilabel>Index configuration</guilabel>
|
|
panel, or by editing the configuration file (set
|
|
<varname>processwebqueue</varname> to 1).</para>
|
|
|
|
<para>The &RCL; GUI has a tool to list and edit the contents of the
|
|
Web cache. (<menuchoice><guimenu>Tools</guimenu><guimenuitem>Webcache
|
|
editor</guimenuitem></menuchoice>)</para>
|
|
<para>The <command>recollindex</command> command has two options to
|
|
help manage the Web cache:</para>
|
|
<itemizedlist>
|
|
<listitem><option>--webcache-compact</option> will recover
|
|
the space from erased entries. It may need to use twice the disk space
|
|
currently needed for the Web cache.</listitem>
|
|
<listitem><option>--webcache-burst <replaceable>destdir</replaceable></option>
|
|
will extract all current entries into pairs of metadata and data
|
|
files created
|
|
inside <replaceable>destdir</replaceable></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>You can find more details on Web indexing, its usage and configuration
|
|
in a <ulink url="&FAQS;IndexWebHistory">Recoll 'Howto'
|
|
entry</ulink>.</para>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.EXTATTR">
|
|
<title>&LIN;: using extended attributes</title>
|
|
|
|
<para>User extended attributes are named pieces of information
|
|
that most modern file systems can attach to any file.</para>
|
|
|
|
<para>&RCL; processes extended attributes as document fields by
|
|
default.</para>
|
|
|
|
<para>A
|
|
<ulink url="http://www.freedesktop.org/wiki/CommonExtendedAttributes">
|
|
freedesktop standard</ulink> defines a few special
|
|
attributes, which are handled as such by &RCL;:
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>mime_type</term>
|
|
<listitem><para>If set, this overrides any other
|
|
determination of the file MIME type.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>charset</term>
|
|
<listitem><para>If set, this defines the file character set
|
|
(mostly useful for plain text files).</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
|
|
<para>By default, other attributes are handled as &RCL; fields of the
|
|
same name.</para>
|
|
|
|
<para>On Linux, the <literal>user</literal> prefix is removed from
|
|
the name.</para>
|
|
|
|
<para>The name translation can be configured more precisely inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS"><filename>fields</filename> configuration file</link>.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.EXTTAGS">
|
|
<title>&LIN;: importing external tags</title>
|
|
|
|
<para>During indexing, it is possible to import metadata for each
|
|
file by executing commands. This allows, for example, extracting tag
|
|
data from an external application and storing it in a field for
|
|
indexing.</para>
|
|
|
|
<para>See the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS">section about the <literal>metadatacmds</literal> field</link>
|
|
in the main configuration chapter for a description of the
|
|
configuration syntax.</para>
|
|
|
|
<para>For example, if you would want &RCL; to use tags managed by
|
|
<application>tmsu</application> in a field named
|
|
<replaceable>tags</replaceable>, you would add the following to the
|
|
configuration file:</para>
|
|
|
|
<programlisting>[/some/area/of/the/fs]
|
|
metadatacmds = ; <replaceable>tags</replaceable> = tmsu tags %f
|
|
</programlisting>
|
|
|
|
<note><para>Depending on the <application>tmsu</application> version,
|
|
you may need/want to add options like
|
|
<literal>--database=/some/db</literal>.</para></note>
|
|
|
|
<para>You may want to restrict this processing to a subset of
|
|
the directory tree, because it may slow down indexing a bit
|
|
(<literal>[some/area/of/the/fs]</literal>).</para>
|
|
<para>Note the initial semi-colon after the equal sign.</para>
|
|
|
|
<para>In the example above, the output of <command>tmsu</command> is
|
|
used to set a field named <replaceable>tags</replaceable>. The field
|
|
name is arbitrary and could be <replaceable>tmsu</replaceable> or
|
|
<replaceable>myfield</replaceable> just the same, but
|
|
<replaceable>tags</replaceable> is an alias for the standard &RCL;
|
|
<literal>keywords</literal> field, and the <command>tmsu</command>
|
|
output will just augment its contents. This will avoid the need to
|
|
extend the <link linkend="RCL.PROGRAM.FIELDS">field
|
|
configuration</link>.</para>
|
|
|
|
<para>Once re-indexing is performed (you will need to force the file
|
|
reindexing, &RCL; will not detect the need by itself), you will be
|
|
able to search from the query language, through any of its aliases:
|
|
<replaceable>tags:some/alternate/values</replaceable> or
|
|
<replaceable>tags:all,these,values</replaceable> (the compact field search
|
|
syntax is supported for recoll 1.20 and later. For older versions,
|
|
you would need to repeat the <replaceable>tags:</replaceable>
|
|
specifier for each term, e.g. <replaceable>tags:some</replaceable>
|
|
<literal>OR</literal>
|
|
<replaceable>tags:alternate</replaceable>).</para>
|
|
|
|
<para>Tags changes will not be detected by
|
|
the indexer if the file itself did not change. One possible
|
|
workaround would be to update the file <literal>ctime</literal> when
|
|
you modify the tags, which
|
|
would be consistent with how extended attributes function. A pair of
|
|
<command>chmod</command> commands could accomplish this, or a
|
|
<literal>touch -a</literal> . Alternatively, just
|
|
couple the tag update with a
|
|
<literal>recollindex -e -i</literal> <replaceable>/path/to/the/file</replaceable>.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INDEXING.PDF">
|
|
<title>The PDF input handler</title>
|
|
|
|
<para>The PDF format is very important for scientific and technical
|
|
documentation, and document archival. It has extensive
|
|
facilities for storing metadata along with the document, and these
|
|
facilities are actually used in the real world.</para>
|
|
|
|
<para>In consequence, the <command>rclpdf.py</command> PDF input
|
|
handler has more complex capabilities than most others, and it is
|
|
also more configurable. Specifically, <command>rclpdf.py</command>
|
|
has the following features:
|
|
<itemizedlist>
|
|
<listitem><para>It can be configured to extract
|
|
specific metadata tags from an XMP packet.</para></listitem>
|
|
<listitem><para>It can extract PDF
|
|
attachments.</para></listitem>
|
|
<listitem><para>It can automatically perform
|
|
OCR if the document text is empty. This is done by
|
|
executing an external program and is now described in a
|
|
<link linkend="RCL.INDEXING.OCR">separate
|
|
section</link>, because the OCR framework can also be used
|
|
with non-PDF image files.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2 id="RCL.INDEXING.PDF.XMP">
|
|
<title>XMP fields extraction</title>
|
|
|
|
<para>The <filename>rclpdf.py</filename> script in &RCL; version
|
|
1.23.2 and later can extract XMP metadata fields by executing the
|
|
<command>pdfinfo</command> command (usually found with
|
|
<application>poppler-utils</application>). This is controlled by
|
|
the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</link>
|
|
configuration variable, which specifies which tags to extract and,
|
|
possibly, how to rename them.</para>
|
|
|
|
<para>The
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</link>
|
|
variable can be used to designate a file with Python code to edit
|
|
the metadata fields (available for &RCL; 1.23.3 and later. 1.23.2
|
|
has equivalent code inside the handler script). Example:</para>
|
|
|
|
<programlisting>import sys
|
|
import re
|
|
|
|
class MetaFixer(object):
|
|
def __init__(self):
|
|
pass
|
|
|
|
def metafix(self, nm, txt):
|
|
if nm == 'bibtex:pages':
|
|
txt = re.sub(r'--', '-', txt)
|
|
elif nm == 'someothername':
|
|
# do something else
|
|
pass
|
|
elif nm == 'stillanother':
|
|
# etc.
|
|
pass
|
|
|
|
return txt
|
|
def wrapup(self, metaheaders):
|
|
pass
|
|
</programlisting>
|
|
|
|
<para>If the 'metafix()' method is defined, it is called for each
|
|
metadata field. A new MetaFixer object is created for each PDF
|
|
document (so the object can keep state for, for example,
|
|
eliminating duplicate values). If the 'wrapup()' method is defined, it
|
|
is called at the end of XMP fields processing with the whole
|
|
metadata as parameter, as an array of '(nm, val)' pairs, allowing
|
|
an alternate approach for editing or adding/deleting fields.</para>
|
|
|
|
<!-- <para> There is a <ulink url="&FAQS;PDFXMP.wiki">complete example of XMP
|
|
tags setup</ulink>, including a nice result list paragraph format in the
|
|
&RCL; Wiki </para> -->
|
|
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INDEXING.PDF.ATTACH">
|
|
<title>PDF attachment indexing</title>
|
|
|
|
<para>If <application>pdftk</application> is installed, and if the
|
|
the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
|
|
configuration variable is set, the PDF input handler will try to
|
|
extract PDF attachments for indexing as sub-documents of the PDF
|
|
file. This is disabled by default, because it slows down PDF
|
|
indexing a bit even if not one attachment is ever found (PDF
|
|
attachments are uncommon in my experience).</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INDEXING.OCR">
|
|
<title>Recoll and OCR</title>
|
|
|
|
<para>This is new in &RCL; 1.26.5. Older versions had a more limited,
|
|
non-caching capability to execute an external OCR program in the PDF
|
|
handler. The new function has the following features:
|
|
|
|
<itemizedlist>
|
|
<listitem><para>The OCR output is cached, stored as separate
|
|
files. The caching is ultimately based on a hash value of the
|
|
original file contents, so that it is immune to file renames. A
|
|
first path-based layer ensures fast operation for unchanged
|
|
(unmoved files), and the data hash (which is still orders of
|
|
magnitude faster than OCR) is only re-computed if the file has
|
|
moved. OCR is only performed if the file was not previously
|
|
processed or if it changed.</para></listitem>
|
|
<listitem><para>The support for a specific program is implemented
|
|
in a simple Python module. It should be straightforward to add
|
|
support for any OCR engine with a capability to run from the
|
|
command line.</para></listitem>
|
|
<listitem><para>Modules initially exist for
|
|
<application>tesseract</application> (Linux and Windows), and
|
|
<application>ABBYY FineReader</application> (Linux, tested with
|
|
version 11). ABBYY FineReader is a commercial closed source
|
|
program, but it sometimes perform better than
|
|
tesseract.</para></listitem>
|
|
<listitem><para>The OCR is currently only called from the PDF
|
|
handler, but there should be no problem using it for other image
|
|
types.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>To enable this feature, you need to install one of
|
|
the supported OCR applications
|
|
(<application>tesseract</application>
|
|
or <application>ABBYY</application>), enable OCR in the PDF
|
|
handler, and tell &RCL; where the appropriate command resides. The
|
|
last parts are done by setting configuration variables. See the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.OCR">
|
|
relevant section</link>. All parameters can be localized in
|
|
subdirectories through the usual main configuration mechanism (path
|
|
sections).</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INDEXING.PERIODIC">
|
|
<title>Periodic indexing</title>
|
|
|
|
<simplesect id="RCL.INDEXING.PERIODIC.EXEC">
|
|
<title>Running the indexer</title>
|
|
|
|
<para>The <command>recollindex</command> program performs index
|
|
updates. You can start it either from the command line or from the
|
|
<guimenu>File</guimenu> menu in the <command>recoll</command> GUI
|
|
program. When started from the GUI, the indexing will run on the
|
|
same configuration <command>recoll</command> was started on. When
|
|
started from the command line, <command>recollindex</command> will
|
|
use the <envar>RECOLL_CONFDIR</envar> variable or accept a
|
|
<option>-c</option> <replaceable>confdir</replaceable> option to
|
|
specify a non-default configuration directory.</para>
|
|
|
|
<para>If the <command>recoll</command> program finds no index
|
|
when it starts, it will automatically start indexing (except
|
|
if canceled).</para>
|
|
|
|
<para>The GUI <menuchoice><guimenu>File</guimenu> </menuchoice>
|
|
menu has entries to start or stop the current indexing
|
|
operation. When indexing is not currently running, you have a
|
|
choice between <guimenuitem>Update
|
|
Index</guimenuitem> or <guimenuitem>Rebuild Index</guimenuitem>.
|
|
The first choice only processes changed files, the second one
|
|
erases the index before starting so that all files are
|
|
processed.</para>
|
|
|
|
<para>On Linux and Windows, the GUI can be used to manage the indexing
|
|
operation. Stopping the indexer can be done
|
|
from the <command>recoll</command> GUI
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Stop Indexing</guimenuitem>
|
|
</menuchoice>
|
|
menu entry.
|
|
</para>
|
|
|
|
<para>On Linux, the <command>recollindex</command> indexing process
|
|
can be interrupted by sending an interrupt
|
|
(<keysym>Ctrl-C</keysym>, SIGINT) or terminate (SIGTERM)
|
|
signal.
|
|
</para>
|
|
|
|
<para>When stopped, some time may elapse before
|
|
<command>recollindex</command> exits, because it needs to properly
|
|
flush and close the index.</para>
|
|
|
|
<para>After an interruption, the index will be somewhat
|
|
inconsistent because some operations which are normally
|
|
performed at the end of the indexing pass will have been
|
|
skipped (for example, the stemming and spelling databases
|
|
will be inexistent or out of date). You just need to restart
|
|
indexing at a later time to restore consistency. The
|
|
indexing will restart at the interruption point (the full
|
|
file tree will be traversed, but files that were indexed up
|
|
to the interruption and for which the index is still up to
|
|
date will not need to be reindexed).</para>
|
|
</simplesect>
|
|
|
|
<simplesect id="RCL.INDEXING.PERIODIC.CMDLINE">
|
|
<title>recollindex command line</title>
|
|
|
|
<para><command>recollindex</command> has many options
|
|
which are listed in its
|
|
<ulink url="https://www.lesbonscomptes.com/recoll/manpages/recollindex.1.html">manual page</ulink>.
|
|
Only a few will be described here.</para>
|
|
|
|
<para>Option <option>-z</option> will reset the index when
|
|
starting. This is almost the same as destroying the index
|
|
files (the nuance is that the &XAP; format version will not
|
|
be changed).</para>
|
|
<para>Option <option>-Z</option> will force the update of all
|
|
documents without resetting the index first. This will not
|
|
have the "clean start" aspect of <option>-z</option>, but
|
|
the advantage is that the index will remain available for
|
|
querying while it is rebuilt, which can be a significant
|
|
advantage if it is very big (some installations need days
|
|
for a full index rebuild).</para>
|
|
|
|
<para>Option <option>-k</option> will force retrying files
|
|
which previously failed to be indexed, for example because
|
|
of a missing helper program.</para>
|
|
|
|
<para>Of special interest also, maybe, are
|
|
the <option>-i</option> and <option>-f</option>
|
|
options. <option>-i</option> allows indexing an explicit
|
|
list of files (given as command line parameters or read on
|
|
<literal>stdin</literal>). <option>-f</option> tells
|
|
<command>recollindex</command> to ignore file selection
|
|
parameters from the configuration. Together, these options
|
|
allow building a custom file selection process for some area
|
|
of the file system, by adding the top directory to the
|
|
<varname>skippedPaths</varname> list and using an
|
|
appropriate file selection method to build the file list to
|
|
be fed to <command>recollindex</command>
|
|
<option>-if</option>. Trivial example:</para>
|
|
|
|
<programlisting>
|
|
find . -name indexable.txt -print | recollindex -if
|
|
</programlisting>
|
|
|
|
<para><command>recollindex</command> <option>-i</option> will
|
|
not descend into subdirectories specified as parameters,
|
|
but just add them as index entries. It is
|
|
up to the external file selection method to build the complete
|
|
file list.</para>
|
|
</simplesect>
|
|
|
|
<simplesect id="RCL.INDEXING.PERIODIC.AUTOMAT">
|
|
<title>Linux: using <command>cron</command> to automate indexing</title>
|
|
|
|
<para>The most common way to set up indexing is to have a cron
|
|
task execute it every night. For example the following
|
|
<filename>crontab</filename> entry would do it every day at
|
|
3:30AM (supposing <command>recollindex</command> is in your
|
|
PATH):
|
|
|
|
<screen><![CDATA[
|
|
30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
|
|
]]></screen>
|
|
|
|
Or, using <command>anacron</command>:
|
|
<screen><![CDATA[
|
|
1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>The &RCL; GUI has dialogs to manage
|
|
<filename>crontab</filename> entries for
|
|
<command>recollindex</command>. You can reach them from the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>Indexing Schedule</guimenuitem>
|
|
</menuchoice>
|
|
menu. They only
|
|
work with the good old <command>cron</command>, and do not give
|
|
access to all features of <command>cron</command>
|
|
scheduling. Entries created via the tool are marked with
|
|
a <literal>RCLCRON_RCLINDEX=</literal> marker so that the tool
|
|
knows which entries belong to it. As a side effect, this sets an
|
|
environment variable for the process, but it's not actually used,
|
|
this is just a marker.</para>
|
|
|
|
<para>The usual command to edit your
|
|
<filename>crontab</filename> is <command>crontab</command>
|
|
<option>-e</option> (which will usually start the
|
|
<command>vi</command> editor to edit the file). You may have
|
|
more sophisticated tools available on your system.</para>
|
|
|
|
<para>Please be aware that there may be differences between your
|
|
usual interactive command line environment and the one seen by
|
|
crontab commands. Especially the PATH variable may be of
|
|
concern. Please check the crontab manual pages about possible
|
|
issues.</para>
|
|
|
|
|
|
</simplesect>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INDEXING.MONITOR">
|
|
<title>&LIN;: real time indexing</title>
|
|
|
|
<para>Real time monitoring/indexing is performed by starting the
|
|
<command>recollindex</command> <option>-m</option> command.
|
|
With this option, <command>recollindex</command> will detach
|
|
from the terminal and become a daemon, permanently monitoring
|
|
file changes and updating the index.</para>
|
|
|
|
<para>In this situation, the <command>recoll</command>
|
|
GUI <menuchoice><guimenu>File</guimenu></menuchoice> menu makes two
|
|
operations available: <guimenuitem>Stop</guimenuitem>
|
|
and <guimenuitem>Trigger incremental pass</guimenuitem>.
|
|
</para>
|
|
|
|
<para><guimenuitem>Trigger incremental pass</guimenuitem> has the
|
|
same effect as restarting the indexer, and will cause a complete
|
|
walk of the indexed area, processing the changed files, then switch
|
|
to monitoring. This is only marginally useful, maybe in cases where
|
|
the indexer is configured to delay updates, or to force an
|
|
immediate rebuild of the stemming and phonetic data, which are only
|
|
processed at intervals by the real time indexer.</para>
|
|
|
|
<para>While it is convenient that data is indexed in real time,
|
|
repeated indexing can generate a significant load on the
|
|
system when files such as email folders change. Also,
|
|
monitoring large file trees by itself significantly taxes
|
|
system resources. You probably do not want to enable it if
|
|
your system is short on resources. Periodic indexing is
|
|
adequate in most cases.</para>
|
|
|
|
<para>As of &RCL; 1.24, you can set the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS">monitordirs</link>
|
|
configuration variable to specify that only a subset of your indexed
|
|
files will be monitored for instant indexing. In this situation, an
|
|
incremental pass on the full tree can be triggered by either
|
|
restarting the indexer, or just running
|
|
<command>recollindex</command>, which will notify the running
|
|
process. The <command>recoll</command> GUI also has a menu entry for
|
|
this.</para>
|
|
|
|
<simplesect id="RCL.INDEXING.MONITOR.START.SYSTEMD">
|
|
<title>Automatic daemon start with systemd</title>
|
|
|
|
<para>The installation contains two example files
|
|
(in <filename>share/recoll/examples</filename>) for starting the indexing daemon with
|
|
<application>systemd</application>.</para>
|
|
<para><filename>recollindex-user.service</filename> would be used for
|
|
starting <command>recollindex</command> as a user service, and can be installed with the
|
|
following commands:
|
|
<programlisting>systemctl --user link /usr/share/recoll/examples/recollindex-user.service
|
|
systemctl --user enable --now recollindex-user.service</programlisting>
|
|
The indexer will start when the user logs in and run while there is a session open for
|
|
them.</para>
|
|
|
|
<para><filename>recollindex-system.service</filename> would be used for starting the indexer
|
|
at boot time, running as a specific user. It can be useful when running the text search as a
|
|
shared service (e.g. when users access it through the WEB UI). You will need to edit it to
|
|
replace the @SOMEUSER@ value with something which makes sense in your case, then install it
|
|
as a regular <application>systemd</application> system service. Of course, if you want to
|
|
run several such units, you will also need to rename the installed file.</para>
|
|
|
|
</simplesect>
|
|
|
|
<simplesect id="RCL.INDEXING.MONITOR.START">
|
|
<title>Automatic daemon start from the desktop session</title>
|
|
|
|
<para>Under <application>KDE</application>,
|
|
<application>Gnome</application> and some other desktop
|
|
environments, the daemon can automatically started when you log
|
|
in, by creating a desktop file inside the
|
|
<filename>~/.config/autostart</filename> directory. This can be
|
|
done for you by the &RCL; GUI. Use the
|
|
<guimenu>Preferences->Indexing Schedule</guimenu> menu.</para>
|
|
|
|
<para>With older <application>X11</application> setups, starting
|
|
the daemon is normally performed as part of the user session
|
|
script.</para>
|
|
|
|
<para>The <filename>rclmon.sh</filename> script can be used to
|
|
easily start and stop the daemon. It can be found in the
|
|
<filename>examples</filename> directory (typically
|
|
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
|
|
|
|
<para>For example, a good old <application>xdm</application>-based
|
|
session could have a <filename>.xsession</filename> script with the
|
|
following lines at the end:</para>
|
|
|
|
<programlisting>recollconf=$HOME/.recoll-home
|
|
recolldata=/usr/local/share/recoll
|
|
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
|
|
|
fvwm
|
|
</programlisting>
|
|
|
|
<para>The indexing daemon gets started, then the window manager,
|
|
for which the session waits.</para> <para>By default the
|
|
indexing daemon will monitor the state of the X11 session, and
|
|
exit when it finishes, it is not necessary to kill it
|
|
explicitly. (The <application>X11</application> server
|
|
monitoring can be disabled with option <option>-x</option> to
|
|
<command>recollindex</command>).</para>
|
|
|
|
<para>If you use the daemon completely out of an
|
|
<application>X11</application> session, you need to add option
|
|
<option>-x</option> to disable <application>X11</application>
|
|
session monitoring (else the daemon will not start).</para>
|
|
</simplesect>
|
|
|
|
<simplesect id="RCL.INDEXING.MONITOR.DETAILS">
|
|
<title>Miscellaneous details</title>
|
|
|
|
<para>By default, the messages from the indexing daemon will be
|
|
sent to the same file as those from the interactive commands
|
|
(<literal>logfilename</literal>). You may want to change this
|
|
by setting the <varname>daemlogfilename</varname> and
|
|
<varname>daemloglevel</varname> configuration parameters. Also
|
|
the log file will only be truncated when the daemon starts. If
|
|
the daemon runs permanently, the log file may grow quite big,
|
|
depending on the log level.</para>
|
|
|
|
<formalpara><title>Increasing resources for inotify</title>
|
|
<para>On Linux systems, monitoring a big tree may need
|
|
increasing the resources available to inotify, which are
|
|
normally defined in <filename>/etc/sysctl.conf</filename>.
|
|
<programlisting>
|
|
### inotify
|
|
#
|
|
# cat /proc/sys/fs/inotify/max_queued_events - 16384
|
|
# cat /proc/sys/fs/inotify/max_user_instances - 128
|
|
# cat /proc/sys/fs/inotify/max_user_watches - 16384
|
|
#
|
|
# -- Change to:
|
|
#
|
|
fs.inotify.max_queued_events=32768
|
|
fs.inotify.max_user_instances=256
|
|
fs.inotify.max_user_watches=32768
|
|
</programlisting>
|
|
|
|
Especially, you will need to trim your tree or adjust
|
|
the <literal>max_user_watches</literal> value if indexing exits with
|
|
a message about errno <literal>ENOSPC</literal> (28) from
|
|
<function>inotify_add_watch</function>.
|
|
</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara><title>Slowing down the reindexing rate for fast changing
|
|
files</title>
|
|
<para>When using the real time monitor, it may happen that some
|
|
files need to be indexed, but change so often that they impose an
|
|
excessive load for the system.
|
|
|
|
&RCL; provides a configuration option to specify the minimum
|
|
time before which a file, specified by a wildcard pattern, cannot be
|
|
reindexed. See the <varname>mondelaypatterns</varname> parameter in
|
|
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">configuration section</link>.
|
|
</para>
|
|
</formalpara>
|
|
|
|
</simplesect>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
<chapter id="RCL.SEARCH">
|
|
<title>Searching</title>
|
|
|
|
<sect1 id="RCL.SEARCH.INTRODUCTION">
|
|
<title>Introduction</title>
|
|
|
|
<para>Getting answers to specific queries is of course the whole
|
|
point of &RCL;. The multiple provided interfaces always understand
|
|
simple queries made of one or several words, and return appropriate
|
|
results in most cases.</para>
|
|
|
|
<para>In order to make the most of &RCL; though, it may be worthwhile
|
|
to understand how it processes your input. Five different modes
|
|
exist:
|
|
<itemizedlist>
|
|
<listitem><para>In <literal>All Terms</literal> mode, &RCL; looks
|
|
for documents containing all your input terms.</para></listitem>
|
|
<listitem><para><literal>Query Language</literal> mode behaves like
|
|
|
|
<literal>All Terms</literal> in the absence of special input, but
|
|
it can also do much more. This is the best mode for getting the
|
|
most of &RCL;.</para></listitem>
|
|
|
|
<listitem><para>In <literal>Any Term</literal> mode, &RCL; looks
|
|
for documents containing any your input terms, preferring those
|
|
which contain more.</para></listitem>
|
|
|
|
<listitem><para>In <literal>File Name</literal> mode, &RCL; will
|
|
only match file names, not content. Using a small subset of the
|
|
index allows things like left-hand wildcards without performance
|
|
issues, and may sometimes be useful.</para></listitem>
|
|
|
|
<listitem><para>The GUI <literal>Advanced Search</literal> mode is
|
|
actually not more powerful than the query language, but it helps
|
|
you build complex queries without having to remember the language,
|
|
and avoids any interpretation ambiguity, as it bypasses the user
|
|
input parser.</para></listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>These five input modes are supported by the different user
|
|
interfaces which are described in the following sections.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.GUI">
|
|
<title>Searching with the Qt graphical user interface</title>
|
|
|
|
<para>The <command>recoll</command> program provides the main user
|
|
interface for searching. It is based on the
|
|
<application>Qt</application> library.</para>
|
|
|
|
<para><command>recoll</command> has two search interfaces:</para>
|
|
<itemizedlist>
|
|
<listitem><para>Simple search (the default, on the main screen) has
|
|
a single entry field where you can enter multiple words.</para>
|
|
</listitem>
|
|
<listitem><para>Advanced search (a panel accessed through the
|
|
<guilabel>Tools</guilabel> menu or the toolbox bar icon) has
|
|
multiple entry fields, which you may use to build a logical
|
|
condition, with additional filtering on file type, location
|
|
in the file system, modification date, and size.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>In most cases, you can enter the terms as you think them, even
|
|
if they contain embedded punctuation or other non-textual characters
|
|
(e.g. &RCL; can handle things like email addresses).</para>
|
|
|
|
<para>The main case where you should enter text differently from
|
|
how it is printed is for east-asian languages (Chinese,
|
|
Japanese, Korean). Words composed of single or multiple
|
|
characters should be entered separated by white space in this
|
|
case (they would typically be printed without white
|
|
space).</para>
|
|
|
|
<para>Some searches can be quite complex, and you may want to re-use
|
|
them later, perhaps with some tweaking. &RCL; can save and restore
|
|
searches. See <link linkend="RCL.SEARCH.SAVING">Saving and restoring
|
|
queries</link>.
|
|
</para>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SIMPLE">
|
|
<title>Simple search</title>
|
|
|
|
<procedure>
|
|
<step><para>Start the <command>recoll</command> program.</para>
|
|
</step>
|
|
<step><para>Possibly choose a search mode: <guilabel>Any
|
|
term</guilabel>, <guilabel>All terms</guilabel>,
|
|
<guilabel>File name</guilabel> or
|
|
<guilabel>Query language</guilabel>.</para>
|
|
</step>
|
|
<step><para>Enter search term(s) in the text field at the top of the
|
|
window.</para>
|
|
</step>
|
|
<step><para>Click the <guilabel>Search</guilabel> button or
|
|
hit the <keycap>Enter</keycap> key to start the search.</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>The initial default search mode is <guilabel>Query
|
|
language</guilabel>. Without special directives, this will look for
|
|
documents containing all of the search terms (the ones with more
|
|
terms will get better scores), just like the <guilabel>All
|
|
terms</guilabel> mode. <guilabel>Any term</guilabel> will search
|
|
for documents where at least one of the terms
|
|
appear. <guilabel>File name</guilabel> will exclusively look for
|
|
file names, not contents</para>
|
|
|
|
<para>All search modes allow terms to be expanded with wildcards
|
|
characters (<literal>*</literal>, <literal>?</literal>,
|
|
<literal>[]</literal>). See the
|
|
<link linkend="RCL.SEARCH.WILDCARDS">section about wildcards</link> for
|
|
more details.</para>
|
|
|
|
<para>In all modes except <guilabel>File name</guilabel>, you can
|
|
search for exact phrases (adjacent words in a given order) by
|
|
enclosing the input inside double quotes. Ex:
|
|
<literal>"virtual reality"</literal>.</para>
|
|
|
|
<para>The <guilabel>Query Language</guilabel> features are
|
|
described in
|
|
<link linkend="RCL.SEARCH.LANG">a separate section</link>.
|
|
</para>
|
|
|
|
<para>When using a stripped index (the default), character case has
|
|
no influence on search, except that you can disable stem expansion
|
|
for any term by capitalizing it. Ie: a search for
|
|
<literal>floor</literal> will also normally look for
|
|
<literal>flooring</literal>, <literal>floored</literal>, etc., but
|
|
a search for <literal>Floor</literal> will only look for
|
|
<literal>floor</literal>, in any character case. Stemming can also
|
|
be disabled globally in the preferences. When using a raw index,
|
|
<link linkend="RCL.SEARCH.CASEDIAC">the rules are a bit more complicated</link>.</para>
|
|
|
|
<para>&RCL; remembers the last few searches that you performed. You
|
|
can directly access the search history by clicking the clock button
|
|
on the right of the search entry, while the latter is
|
|
empty. Otherwise, the history is used for entry completion (see
|
|
next). Only the search texts are remembered, not the mode
|
|
(all/any/file name).</para>
|
|
|
|
<para>While text is entered in the search area,
|
|
<command>recoll</command> will display possible completions,
|
|
filtered from the history and the index search terms. This can be
|
|
disabled with a GUI Preferences option.</para>
|
|
|
|
<para>Double-clicking on a word in the result list or a preview
|
|
window will insert it into the simple search entry field.</para>
|
|
|
|
<para>You can cut and paste any text into an <guilabel>All
|
|
terms</guilabel> or <guilabel>Any term</guilabel> search field,
|
|
punctuation, newlines and all - except for wildcard characters
|
|
(single <literal>?</literal> characters are ok). &RCL; will process
|
|
it and produce a meaningful search. This is what most differentiates
|
|
this mode from the <guilabel>Query Language</guilabel> mode, where
|
|
you have to care about the syntax.</para>
|
|
|
|
<para>You can use the <link linkend="RCL.SEARCH.GUI.COMPLEX"><menuchoice><guimenu>Tools</guimenu><guimenuitem>Advanced search</guimenuitem></menuchoice></link>
|
|
dialog for more complex searches.</para>
|
|
|
|
<para>The <guilabel>File name</guilabel> search mode will
|
|
specifically look for file names. The point of having a separate
|
|
file name search is that wild card expansion can be performed more
|
|
efficiently on a small subset of the index (allowing wild cards on
|
|
the left of terms without excessive cost). Things to know:
|
|
<itemizedlist>
|
|
<listitem><para>White space in the entry should match white
|
|
space in the file name, and is not treated specially.</para>
|
|
</listitem>
|
|
<listitem><para>The search is insensitive to character case and
|
|
accents, independently of the type of index.</para>
|
|
</listitem>
|
|
<listitem><para>An entry without any wild card
|
|
character and not capitalized will be prepended and appended
|
|
with '*' (ie: <replaceable>etc</replaceable> ->
|
|
<replaceable>*etc*</replaceable>, but
|
|
<replaceable>Etc</replaceable> ->
|
|
<replaceable>etc</replaceable>).</para>
|
|
</listitem>
|
|
<listitem><para>If you have a big index (many files),
|
|
excessively generic fragments may result in inefficient
|
|
searches.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RESLIST">
|
|
<title>The result list</title>
|
|
|
|
<para>After starting a search, a list of results will instantly
|
|
be displayed in the main window.</para>
|
|
|
|
<para>By default, the document list is presented in order of
|
|
relevance (how well the system estimates that the document
|
|
matches the query). You can sort the result by ascending or
|
|
descending date by using the vertical arrows in the toolbar.</para>
|
|
|
|
<para>Clicking the <literal>Preview</literal> link for an entry
|
|
will open an internal preview window for the document. Further
|
|
<literal>Preview</literal> clicks for the same search will open
|
|
tabs in the existing preview window. You can use
|
|
<keycap>Shift</keycap>+Click to force the creation of another
|
|
preview window, which may be useful to view the documents side
|
|
by side. (You can also browse successive results in a single
|
|
preview window by typing
|
|
<keycap>Shift</keycap>+<keycap>ArrowUp/Down</keycap> in the
|
|
window).</para>
|
|
|
|
<para>Clicking the <literal>Open</literal> link will
|
|
start an external viewer for the document. By default, &RCL; lets
|
|
the desktop choose the appropriate application for most document
|
|
types. See
|
|
<link linkend="RCL.SEARCH.GUI.RESLIST.APPLICATIONS">further</link>
|
|
for customizing the applications.</para>
|
|
|
|
<para>You can click on the <literal>Query details</literal> link
|
|
at the top of the results page to see the query actually
|
|
performed, after stem expansion and other processing.</para>
|
|
|
|
<para>Double-clicking on any word inside the result list or a
|
|
preview window will insert it into the simple search text.</para>
|
|
|
|
<para>The result list is divided into pages (the size of which
|
|
you can change in the preferences). Use the arrow buttons in the
|
|
toolbar or the links at the bottom of the page to browse the
|
|
results.</para>
|
|
|
|
<para>The <literal>Preview</literal> and <literal>Open</literal>
|
|
edit links may not be present for all entries, meaning that
|
|
&RCL; has no configured way to preview a given file type (which
|
|
was indexed by name only), or no configured external editor for
|
|
the file type. This can sometimes be adjusted simply by tweaking
|
|
the <link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<filename>mimemap</filename></link>
|
|
and <link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename></link>
|
|
configuration files (the latter can be modified with the user
|
|
preferences dialog).</para>
|
|
|
|
<para>The format of the result list entries is entirely
|
|
configurable by using the preference dialog to
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
edit an HTML fragment</link>.</para>
|
|
|
|
<simplesect id="RCL.SEARCH.GUI.RESLIST.APPLICATIONS">
|
|
<title>Customising the applications</title>
|
|
|
|
<para>By default &RCL; lets the desktop choose what
|
|
application should be used to open a given document, with
|
|
exceptions.</para>
|
|
|
|
<para>The details of this behaviour can be customized with the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>GUI configuration</guimenuitem>
|
|
<guimenuitem>User interface</guimenuitem>
|
|
<guimenuitem>Choose editor applications</guimenuitem>
|
|
</menuchoice> dialog or by editing
|
|
the <link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename> configuration file.</link></para>
|
|
|
|
<para>When <guilabel>Use desktop preferences</guilabel>, at the
|
|
top of the dialog, is checked, the desktop default is generally
|
|
used, but there is a small default list of exceptions, for MIME
|
|
types where the &RCL; choice should override the desktop
|
|
one. These are applications which are well integrated with
|
|
&RCL;, for example, on Linux, <application>evince</application>
|
|
for viewing PDF and Postscript files because of its support for
|
|
opening the document at a specific page and passing a search
|
|
string as an argument. You can add or remove document types to
|
|
the exceptions by using the dialog.</para>
|
|
|
|
<para>If you prefer to completely customize the choice of
|
|
applications, you can uncheck <guilabel>Use desktop
|
|
preferences</guilabel>, in which case the &RCL; predefined
|
|
applications will be used, and can be changed for each document
|
|
type. This is probably not the most convenient approach in most
|
|
cases.</para>
|
|
|
|
<para>In all cases, the applications choice dialog accepts
|
|
multiple selections of MIME types in the top section, and lets
|
|
you define how they are processed in the bottom one. In most
|
|
cases, you will be using <literal>%f</literal> as a place
|
|
holder to be replaced by the file name in the application
|
|
command line.</para>
|
|
|
|
<para>You may also change the choice of applications by editing
|
|
the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<filename>mimeview</filename></link>
|
|
configuration file if you find this more convenient.</para>
|
|
|
|
<para>Under &LIN;, each result list entry also has a right-click
|
|
menu with an
|
|
<guilabel>Open With</guilabel> entry. This lets you choose an
|
|
application from the list of those which registered with the desktop
|
|
for the document MIME type, on a case by case basis.</para>
|
|
</simplesect>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.RESLIST.SUGGS">
|
|
<title>No results: the spelling suggestions</title>
|
|
|
|
<para>When a search yields no result, and if the
|
|
<application>aspell</application> dictionary is configured, &RCL;
|
|
will try to check for misspellings among the query terms, and
|
|
will propose lists of replacements. Clicking on one of the
|
|
suggestions will replace the word and restart the search. You can
|
|
hold any of the modifier keys (Ctrl, Shift, etc.) while clicking
|
|
if you would rather stay on the suggestion screen because several
|
|
terms need replacement.</para>
|
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.RESULTLIST.MENU">
|
|
<title>The result list right-click menu</title>
|
|
|
|
<para>Apart from the preview and edit links, you can display a
|
|
pop-up menu by right-clicking over a paragraph in the result
|
|
list. This menu has the following entries:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><guilabel>Preview</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open With</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Run Script</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy File Name</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy Url</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Save to File</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Find similar</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Preview Parent
|
|
document</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open Parent
|
|
document</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Open Snippets
|
|
Window</guilabel></para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <guilabel>Preview</guilabel> and
|
|
<guilabel>Open</guilabel> entries do the same thing as the
|
|
corresponding links.</para>
|
|
|
|
<para><guilabel>Open With</guilabel> (&LIN;) lets you open the
|
|
document with one of the applications claiming to be able to
|
|
handle its MIME type (the information comes from
|
|
the <literal>.desktop</literal> files
|
|
in <filename>/usr/share/applications</filename>).</para>
|
|
|
|
<para><guilabel>Run Script</guilabel> (&LIN;) allows starting an
|
|
arbitrary command on the result file. It will only appear for
|
|
results which are top-level
|
|
files. See <link linkend="RCL.SEARCH.GUI.RUNSCRIPT">further</link>
|
|
for a more detailed description.</para>
|
|
|
|
<para>The <guilabel>Copy File Name</guilabel> and
|
|
<guilabel>Copy Url</guilabel> copy the relevant data to the
|
|
clipboard, for later pasting.</para>
|
|
|
|
<para><guilabel>Save to File</guilabel> allows saving the
|
|
contents of a result document to a chosen file. This entry
|
|
will only appear if the document does not correspond to an
|
|
existing file, but is a subdocument inside such a file (ie: an
|
|
email attachment). It is especially useful to extract attachments
|
|
with no associated editor.</para>
|
|
|
|
<para>The <guilabel>Open/Preview Parent document</guilabel> entries
|
|
allow working with the higher level document (e.g. the email
|
|
message an attachment comes from). &RCL; is sometimes not totally
|
|
accurate as to what it can or can't do in this area. For example
|
|
the <guilabel>Parent</guilabel> entry will also appear for an
|
|
email which is part of an mbox folder file, but you can't actually
|
|
visualize the mbox (there will be an error dialog if you
|
|
try).</para>
|
|
|
|
<para>If the document is a top-level file, <guilabel>Open
|
|
Parent</guilabel> will start the default file manager on the
|
|
enclosing filesystem directory.</para>
|
|
|
|
<para>The <guilabel>Find similar</guilabel> entry will select
|
|
a number of relevant term from the current document and enter
|
|
them into the simple search field. You can then start a simple
|
|
search, with a good chance of finding documents related to the
|
|
current result. I can't remember a single instance where this
|
|
function was actually useful to me...</para>
|
|
|
|
<para id="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">The
|
|
<guilabel>Open Snippets Window</guilabel> entry will only
|
|
appear for documents which support page breaks (typically
|
|
PDF, Postscript, DVI). The snippets window lists extracts from
|
|
the document, taken around search terms occurrences, along with the
|
|
corresponding page number, as links which can be used to start
|
|
the native viewer on the appropriate page. If the viewer supports
|
|
it, its search function will also be primed with one of the
|
|
search terms.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RESTABLE">
|
|
<title>The result table</title>
|
|
|
|
<para>As an alternative to the result list, the results can also be
|
|
displayed in spreadsheet-like fashion. You can switch to this
|
|
presentation by clicking the table-like icon in the toolbar (this
|
|
is a toggle, click again to restore the list).</para>
|
|
|
|
<para>Clicking on the column headers will allow sorting by the
|
|
values in the column. You can click again to invert the order, and
|
|
use the header right-click menu to reset sorting to the default
|
|
relevance order (you can also use the sort-by-date arrows to do
|
|
this).</para>
|
|
|
|
<para>Both the list and the table display the same underlying
|
|
results. The sort order set from the table is still active if you
|
|
switch back to the list mode. You can click twice on a date sort
|
|
arrow to reset it from there.</para>
|
|
|
|
<para>The header right-click menu allows adding or deleting
|
|
columns. The columns can be resized, and their order can be changed
|
|
(by dragging). All the changes are recorded when you quit
|
|
<command>recoll</command></para>
|
|
|
|
<para>Hovering over a table row will update the detail area at the
|
|
bottom of the window with the corresponding values. You can click
|
|
the row to freeze the display. The bottom area is equivalent to a
|
|
result list paragraph, with links for starting a preview or a
|
|
native application, and an equivalent right-click menu. Typing
|
|
<keycap>Esc</keycap> (the Escape key) will unfreeze the
|
|
display.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.RUNSCRIPT">
|
|
<title>&LIN;: running arbitrary commands on result files</title>
|
|
|
|
<para>Apart from the <guilabel>Open</guilabel> and <guilabel>Open
|
|
With</guilabel> operations, which allow starting an application on a
|
|
result document (or a temporary copy), based on its MIME type, it is
|
|
also possible to run arbitrary commands on results which are
|
|
top-level files, using the <guilabel>Run Script</guilabel> entry in
|
|
the results pop-up menu.</para>
|
|
|
|
<para>The commands which will appear in the <guilabel>Run
|
|
Script</guilabel> submenu must be defined by
|
|
<literal>.desktop</literal> files inside the
|
|
<filename>scripts</filename> subdirectory of the current
|
|
configuration directory.</para>
|
|
|
|
<para>Here follows an example of a <literal>.desktop</literal> file,
|
|
which could be named for example,
|
|
<filename>~/.recoll/scripts/myscript.desktop</filename> (the exact
|
|
file name inside the directory is irrelevant):
|
|
<programlisting>
|
|
[Desktop Entry]
|
|
Type=Application
|
|
Name=MyFirstScript
|
|
Exec=/home/me/bin/tryscript %F
|
|
MimeType=*/*
|
|
</programlisting>
|
|
The <literal>Name</literal> attribute defines the label which will
|
|
appear inside the <guilabel>Run Script</guilabel> menu. The
|
|
<literal>Exec</literal> attribute defines the program to be run,
|
|
which does not need to actually be a script, of course. The
|
|
<literal>MimeType</literal> attribute is not used, but needs to exist.
|
|
</para>
|
|
|
|
<para>The commands defined this way can also be used from links
|
|
inside the
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA">result paragraph</link>.
|
|
</para>
|
|
|
|
<para>As an example, it might make sense to write a script which
|
|
would move the document to the trash and purge it from the &RCL;
|
|
index.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.THUMBNAILS">
|
|
<title>&LIN;: displaying thumbnails</title>
|
|
|
|
<para>The default format for the result list entries and the
|
|
detail area of the result table display an icon for each result
|
|
document. The icon is either a generic one determined from the
|
|
MIME type, or a thumbnail of the document appearance. Thumbnails
|
|
are only displayed if found in the standard
|
|
<application>freedesktop</application> location, where they would
|
|
typically have been created by a file manager.</para>
|
|
|
|
<para>Recoll has no capability to create thumbnails. A relatively
|
|
simple trick is to use the <guilabel>Open parent
|
|
document/folder</guilabel> entry in the result list popup
|
|
menu. This should open a file manager window on the containing
|
|
directory, which should in turn create the thumbnails (depending on
|
|
your settings). Restarting the search should then display the
|
|
thumbnails.</para>
|
|
|
|
<para>There are also <ulink url="&FAQS;ResultsThumbnails.html">some
|
|
pointers about thumbnail generation</ulink> in the &RCL;
|
|
FAQ.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.PREVIEW">
|
|
<title>The preview window</title>
|
|
|
|
<para>The preview window opens when you first click a
|
|
<literal>Preview</literal> link inside the result list.</para>
|
|
|
|
<para>Subsequent preview requests for a given search open new
|
|
tabs in the existing window (except if you hold the
|
|
<keycap>Shift</keycap> key while clicking which will open a new
|
|
window for side by side viewing).</para>
|
|
|
|
<para>Starting another search and requesting a preview will
|
|
create a new preview window. The old one stays open until you
|
|
close it.</para>
|
|
|
|
<para>You can close a preview tab by typing <keycap>Ctrl-W</keycap>
|
|
(<keycap>Ctrl</keycap> + <keycap>W</keycap>) in the window. Closing
|
|
the last tab, or using the window manager button in the top of the
|
|
frame will also close the window.</para>
|
|
|
|
<para>You can display successive or previous documents from the
|
|
result list inside a preview tab by typing
|
|
<keycap>Shift</keycap>+<keycap>Down</keycap> or
|
|
<keycap>Shift</keycap>+<keycap>Up</keycap> (<keycap>Down</keycap>
|
|
and <keycap>Up</keycap> are the arrow keys).</para>
|
|
|
|
<para>A right-click menu in the text area allows switching
|
|
between displaying the main text or the contents of fields
|
|
associated to the document (ie: author, abtract, etc.). This is
|
|
especially useful in cases where the term match did not occur in
|
|
the main text but in one of the fields. In the case of
|
|
images, you can switch between three displays: the image
|
|
itself, the image metadata as extracted
|
|
by <command>exiftool</command> and the fields, which is the
|
|
metadata stored in the index.</para>
|
|
|
|
|
|
<para>You can print the current preview window contents by typing
|
|
<keycap>Ctrl-P</keycap> (<keycap>Ctrl</keycap> +
|
|
<keycap>P</keycap>) in the window text.</para>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.PREVIEW.SEARCH">
|
|
<title>Searching inside the preview</title>
|
|
|
|
<para>The preview window has an internal search capability,
|
|
mostly controlled by the panel at the bottom of the window,
|
|
which works in two modes: as a classical editor incremental
|
|
search, where we look for the text entered in the entry
|
|
zone, or as a way to walk the matches between the document
|
|
and the &RCL; query that found it.</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>Incremental text search</term>
|
|
<listitem><para>The preview tabs have an internal incremental search
|
|
function. You initiate the search either by typing a
|
|
<keycap>/</keycap> (slash) or <keycap>CTL-F</keycap>
|
|
inside the text area or by clicking into
|
|
the <guilabel>Search for:</guilabel> text field and
|
|
entering the search string. You can then use the
|
|
<guilabel>Next</guilabel>
|
|
and <guilabel>Previous</guilabel> buttons
|
|
to find the next/previous occurrence. You can also type
|
|
<keycap>F3</keycap> inside the text area to get to the next
|
|
occurrence.</para>
|
|
<para>If you have a search string entered and you use
|
|
Ctrl-Up/Ctrl-Down to browse the results, the search is
|
|
initiated for each successive document. If the string is
|
|
found, the cursor will be positioned at the first
|
|
occurrence of the search string.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Walking the match lists</term>
|
|
<listitem><para>If the entry area is empty when you click
|
|
the <guilabel>Next</guilabel>
|
|
or <guilabel>Previous</guilabel> buttons, the editor will
|
|
be scrolled to show the next match to any search term
|
|
(the next highlighted zone). If you select a search group
|
|
from the dropdown list and click <guilabel>Next</guilabel>
|
|
or <guilabel>Previous</guilabel>, the match list for this
|
|
group will be walked. This is not the same as a text
|
|
search, because the occurrences will include non-exact
|
|
matches (as caused by stemming or wildcards). The search
|
|
will revert to the text mode as soon as you edit the
|
|
entry area.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.FRAGBUTS">
|
|
<title>The Query Fragments window</title>
|
|
|
|
<para>Selecting the <menuchoice><guimenu>Tools</guimenu>
|
|
<guimenuitem>Query Fragments</guimenuitem></menuchoice> menu
|
|
entry will open a window with radio- and check-buttons which
|
|
can be used to activate query language fragments for
|
|
filtering the current query. This can be useful if you have
|
|
frequent reusable selectors, for example, filtering on
|
|
alternate directories, or searching just one category of
|
|
files, not covered by the standard category
|
|
selectors.</para>
|
|
|
|
<para>The contents of the window are entirely customizable, and
|
|
defined by the contents of the <filename>fragbuts.xml</filename>
|
|
file inside the configuration directory. The sample file
|
|
distributed with &RCL; (which you should be able to find under
|
|
<filename>/usr/share/recoll/examples/fragbuts.xml</filename>),
|
|
contains an example which filters the results from the Web
|
|
history.</para>
|
|
|
|
|
|
<para>Here follows an example:
|
|
<programlisting><![CDATA[
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<fragbuts version="1.0">
|
|
|
|
<radiobuttons>
|
|
<!-- Actually useful: toggle Web queue results inclusion -->
|
|
<fragbut>
|
|
<label>Include Web Results</label>
|
|
<frag></frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Exclude Web Results</label>
|
|
<frag>-rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Only Web Results</label>
|
|
<frag>rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
</radiobuttons>
|
|
|
|
<buttons>
|
|
|
|
<fragbut>
|
|
<label>Example: Year 2010</label>
|
|
<frag>date:2010-01-01/2010-12-31</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Example: c++ files</label>
|
|
<frag>ext:cpp OR ext:cxx</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Example: My Great Directory</label>
|
|
<frag>dir:/my/great/directory</frag>
|
|
</fragbut>
|
|
|
|
</buttons>
|
|
|
|
</fragbuts>
|
|
]]></programlisting>
|
|
</para>
|
|
|
|
<para>Each <literal>radiobuttons</literal> or
|
|
<literal>buttons</literal> section defines a line of
|
|
checkbuttons or radiobuttons inside the window. Any number of
|
|
buttons can be selected, but the radiobuttons in a line are
|
|
exclusive.</para>
|
|
|
|
<para>Each <literal>fragbut</literal> section defines the label
|
|
for a button, and the Query Language fragment which will be
|
|
added (as an AND filter) before performing the query if the
|
|
button is active.</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.COMPLEX">
|
|
<title>Complex/advanced search</title>
|
|
|
|
<para>The advanced search dialog helps you build more complex queries
|
|
without memorizing the search language constructs. It can be opened
|
|
through the <guilabel>Tools</guilabel> menu or through the main
|
|
toolbar.</para>
|
|
|
|
<para>&RCL; keeps a history of searches. See
|
|
<link linkend="RCL.SEARCH.GUI.COMPLEX.HISTORY">Advanced search history</link>.
|
|
</para>
|
|
|
|
<para>The dialog has two tabs:</para>
|
|
|
|
<orderedlist>
|
|
|
|
<listitem><para>The first tab lets you specify terms to search
|
|
for, and permits specifying multiple clauses which are combined
|
|
to build the search.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>The second tab lets filter the results according
|
|
to file size, date of modification, MIME type, or
|
|
location.</para>
|
|
</listitem>
|
|
|
|
</orderedlist>
|
|
|
|
<para>Click on the <guilabel>Start Search</guilabel> button in
|
|
the advanced search dialog, or type <keycap>Enter</keycap> in
|
|
any text field to start the search. The button in
|
|
the main window always performs a simple search.</para>
|
|
|
|
<para>Click on the <literal>Show query details</literal> link at
|
|
the top of the result page to see the query expansion.</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.TERMS">
|
|
<title>Advanced search: the "find" tab</title>
|
|
|
|
<para>This part of the dialog lets you constructc a query by
|
|
combining multiple clauses of different types. Each entry
|
|
field is configurable for the following modes:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>All terms.</para>
|
|
</listitem>
|
|
<listitem><para>Any term.</para>
|
|
</listitem>
|
|
<listitem><para>None of the terms.</para>
|
|
</listitem>
|
|
<listitem><para>Phrase (exact terms in order within an
|
|
adjustable window).</para>
|
|
</listitem>
|
|
<listitem><para>Proximity (terms in any order within an
|
|
adjustable window).</para>
|
|
</listitem>
|
|
<listitem><para>Filename search.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Additional entry fields can be created by clicking the
|
|
<guilabel>Add clause</guilabel> button.</para>
|
|
|
|
<para>When searching, the non-empty clauses will be
|
|
combined either with an AND or an OR conjunction, depending on
|
|
the choice made on the left (<guilabel>All clauses</guilabel> or
|
|
<guilabel>Any clause</guilabel>).</para>
|
|
|
|
<para>Entries of all types except "Phrase" and "Near" accept
|
|
a mix of single words and phrases enclosed in double quotes.
|
|
Stemming and wildcard expansion will be performed as for simple
|
|
search. </para>
|
|
|
|
<formalpara><title>Phrases and Proximity searches</title>
|
|
<para>These two clauses work in similar ways, with the difference
|
|
that proximity searches do not impose an order on the words. In
|
|
both cases, an adjustable number (slack) of non-matched words may
|
|
be accepted between the searched ones (use the counter on the
|
|
left to adjust this count). For phrases, the default count is
|
|
zero (exact match). For proximity it is ten (meaning that two
|
|
search terms, would be matched if found within a window of twelve
|
|
words). Examples: a phrase search for
|
|
<literal>quick fox</literal> with a slack of 0 will match
|
|
<literal>quick fox</literal> but not
|
|
<literal>quick brown fox</literal>. With
|
|
a slack of 1 it will match the latter, but not
|
|
<literal>fox quick</literal>. A proximity search for
|
|
<literal>quick fox</literal> with the default slack will
|
|
match the latter, and also
|
|
<literal>a fox is a cunning and quick animal</literal>.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.FILTER">
|
|
<title>Advanced search: the "filter" tab</title>
|
|
|
|
<para>This part of the dialog has several sections which allow
|
|
filtering the results of a search according to a number of
|
|
criteria</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
<para>The first section allows filtering by dates of last
|
|
modification. You can specify both a minimum and a maximum
|
|
date. The initial values are set according to the oldest and
|
|
newest documents found in the index.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The next section allows filtering the results by
|
|
file size. There are two entries for minimum and maximum
|
|
size. Enter decimal numbers. You can use suffix multipliers:
|
|
<literal>k/K</literal>, <literal>m/M</literal>,
|
|
<literal>g/G</literal>, <literal>t/T</literal> for 1E3, 1E6,
|
|
1E9, 1E12 respectively.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The next section allows filtering the results by their MIME
|
|
types, or MIME categories (ie: media/text/message/etc.).</para>
|
|
<para>You can transfer the types between two boxes, to define
|
|
which will be included or excluded by the search.</para>
|
|
<para>The state of the file type selection can be saved as
|
|
the default (the file type filter will not be activated at
|
|
program start-up, but the lists will be in the restored
|
|
state).</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The bottom section allows restricting the search results to a
|
|
sub-tree of the indexed area. You can use the
|
|
<guilabel>Invert</guilabel> checkbox to search for files not in
|
|
the sub-tree instead. If you use directory filtering often and on
|
|
big subsets of the file system, you may think of setting up
|
|
multiple indexes instead, as the performance may be
|
|
better.</para>
|
|
<para>You can use relative/partial paths for filtering. Ie,
|
|
entering <literal>dirA/dirB</literal> would match either
|
|
<filename>/dir1/dirA/dirB/myfile1</filename> or
|
|
<filename>/dir2/dirA/dirB/someother/myfile2</filename>.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.COMPLEX.HISTORY">
|
|
<title>Advanced search history</title>
|
|
|
|
<para>The advanced search tool memorizes the last 100 searches
|
|
performed. You can walk the saved searches by using the up and
|
|
down arrow keys while the keyboard focus belongs to the advanced
|
|
search dialog.</para>
|
|
|
|
<para>The complex search history can be erased, along with the
|
|
one for simple search, by selecting the <menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Erase Search History</guimenuitem>
|
|
</menuchoice> menu entry.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.TERMEXPLORER">
|
|
<title>The term explorer tool</title>
|
|
|
|
<para>&RCL; automatically manages the expansion of search terms
|
|
to their derivatives (ie: plural/singular, verb
|
|
inflections). But there are other cases where the exact search
|
|
term is not known. For example, you may not remember the exact
|
|
spelling, or only know the beginning of the name.</para>
|
|
|
|
<para>The search will only propose replacement terms with
|
|
spelling variations when no matching document were found. In some
|
|
cases, both proper spellings and mispellings are present in the
|
|
index, and it may be interesting to look for them explicitly.</para>
|
|
|
|
<para>The term explorer tool (started from the toolbar icon or
|
|
from the <guilabel>Term explorer</guilabel> entry of the
|
|
<guilabel>Tools</guilabel> menu) can be used to search the full index
|
|
terms list. It has three modes of operations:</para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Wildcard</term>
|
|
<listitem><para>In this mode of operation, you can enter a
|
|
search string with shell-like wildcards (*, ?, []). ie:
|
|
<replaceable>xapi*</replaceable> would display all index terms
|
|
beginning with <replaceable>xapi</replaceable>. (More
|
|
about wildcards
|
|
<link linkend="RCL.SEARCH.WILDCARDS">here</link>
|
|
).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Regular expression</term>
|
|
<listitem><para>This mode will accept a regular expression
|
|
as input. Example:
|
|
<replaceable>word[0-9]+</replaceable>. The expression is
|
|
implicitly anchored at the beginning. Ie:
|
|
<replaceable>press</replaceable> will match
|
|
<replaceable>pression</replaceable> but not
|
|
<replaceable>expression</replaceable>. You can use
|
|
<replaceable>.*press</replaceable> to match the latter,
|
|
but be aware that this will cause a full index term list
|
|
scan, which can be quite long.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Stem expansion</term>
|
|
<listitem><para>This mode will perform the usual stem expansion
|
|
normally done as part user input processing. As such it is
|
|
probably mostly useful to demonstrate the process.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Spelling/Phonetic</term> <listitem><para>In this
|
|
mode, you enter the term as you think it is spelled, and
|
|
&RCL; will do its best to find index terms that sound like
|
|
your entry. This mode uses the
|
|
<application>Aspell</application> spelling application,
|
|
which must be installed on your system for things to work
|
|
(if your documents contain non-ascii characters, &RCL;
|
|
needs an aspell version newer than 0.60 for UTF-8
|
|
support). The language which is used to build the
|
|
dictionary out of the index terms (which is done at the
|
|
end of an indexing pass) is the one defined by your NLS
|
|
environment. Weird things will probably happen if
|
|
languages are mixed up.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Show index statistics</term> <listitem><para>This will
|
|
print a long list of boring numbers about the index</para>
|
|
</listitem></varlistentry>
|
|
<varlistentry>
|
|
<term>List files which could not be indexed</term>
|
|
<listitem><para>This will show the files which caused errors,
|
|
usually because <command>recollindex</command> could not
|
|
translate their format into text.</para>
|
|
</listitem></varlistentry>
|
|
</variablelist>
|
|
|
|
<para>Note that in cases where &RCL; does not know the beginning
|
|
of the string to search for (ie a wildcard expression like
|
|
<replaceable>*coll</replaceable>), the expansion can take quite
|
|
a long time because the full index term list will have to be
|
|
processed. The expansion is currently limited at 10000 results for
|
|
wildcards and regular expressions. It is possible to change the
|
|
limit in the configuration file.</para>
|
|
|
|
<para>Double-clicking on a term in the result list will insert
|
|
it into the simple search entry field. You can also cut/paste
|
|
between the result list and any entry field (the end of lines
|
|
will be taken care of).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.MULTIDB">
|
|
<title>Multiple indexes</title>
|
|
|
|
<para>See the section describing
|
|
<link linkend="RCL.INDEXING.CONFIG.MULTIPLE">the use of multiple indexes</link> for
|
|
generalities. Only the aspects concerning the
|
|
<command>recoll</command> GUI are described here.</para>
|
|
|
|
<para>A <command>recoll</command> program instance is always
|
|
associated with a specific index, which is the one to be updated
|
|
when requested from the <guimenu>File</guimenu> menu, but it can
|
|
use any number of &RCL; indexes for searching. The external
|
|
indexes can be selected through the <guilabel>external
|
|
indexes</guilabel> tab in the preferences dialog.</para>
|
|
|
|
<para>Index selection is performed in two phases. A set of all usable
|
|
indexes must first be defined, and then the subset of indexes to be
|
|
used for searching. These parameters are retained across program
|
|
executions (there are kept separately for each &RCL;
|
|
configuration). The set of all indexes is usually quite stable, while
|
|
the active ones might typically be adjusted quite frequently.</para>
|
|
|
|
<para>The main index (defined by
|
|
<envar>RECOLL_CONFDIR</envar>) is always active. If this is
|
|
undesirable, you can set up your base configuration to index
|
|
an empty directory.</para>
|
|
|
|
<para>When adding a new index to the set, you can select either
|
|
a &RCL; configuration directory, or directly a &XAP; index
|
|
directory. In the first case, the &XAP; index directory will
|
|
be obtained from the selected configuration.</para>
|
|
|
|
<para>As building the set of all indexes can be a little tedious
|
|
when done through the user interface, you can use the
|
|
<envar>RECOLL_EXTRA_DBS</envar> environment
|
|
variable to provide an initial set. This might typically be
|
|
set up by a system administrator so that every user does not
|
|
have to do it. The variable should define a colon-separated list
|
|
of index directories, ie:
|
|
</para>
|
|
<screen>export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</screen>
|
|
|
|
<para>Another environment
|
|
variable, <envar>RECOLL_ACTIVE_EXTRA_DBS</envar> allows adding to
|
|
the active list of indexes. This variable was suggested and
|
|
implemented by a &RCL; user. It is mostly useful if you use scripts
|
|
to mount external volumes with &RCL; indexes. By
|
|
using <envar>RECOLL_EXTRA_DBS</envar>
|
|
and <envar>RECOLL_ACTIVE_EXTRA_DBS</envar>, you can add and
|
|
activate the index for the mounted volume when
|
|
starting <command>recoll</command>. Unreachable indexes will
|
|
automatically be deactivated when starting up.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.HISTORY">
|
|
<title>Document history</title>
|
|
|
|
<para>Documents that you actually view (with the internal preview
|
|
or an external tool) are entered into the document history,
|
|
which is remembered.</para>
|
|
<para>You can display the history list by using
|
|
the <guilabel>Tools/</guilabel><guilabel>Doc History</guilabel> menu
|
|
entry.</para>
|
|
<para>You can erase the document history by using the
|
|
<guilabel>Erase document history</guilabel> entry in the
|
|
<guimenu>File</guimenu> menu.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SORT">
|
|
<title>Sorting search results and collapsing duplicates</title>
|
|
|
|
<para>The documents in a result list are normally sorted in
|
|
order of relevance. It is possible to specify a different sort
|
|
order, either by using the vertical arrows in the GUI toolbox to
|
|
sort by date, or switching to the result table display and clicking
|
|
on any header. The sort order chosen inside the result table
|
|
remains active if you switch back to the result list, until you
|
|
click one of the vertical arrows, until both are unchecked (you are
|
|
back to sort by relevance).</para>
|
|
|
|
<para>Sort parameters are remembered between program
|
|
invocations, but result sorting is normally always inactive
|
|
when the program starts. It is possible to keep the sorting
|
|
activation state between program invocations by checking the
|
|
<guilabel>Remember sort activation state</guilabel> option in
|
|
the preferences.</para>
|
|
|
|
<para>It is also possible to hide duplicate entries inside
|
|
the result list (documents with the exact same contents as the
|
|
displayed one). The test of identity is based on an MD5 hash
|
|
of the document container, not only of the text contents (so
|
|
that ie, a text document with an image added will not be a
|
|
duplicate of the text only). Duplicates hiding is controlled
|
|
by an entry in the <guilabel>GUI configuration</guilabel>
|
|
dialog, and is off by default.</para>
|
|
|
|
<para>When a result document does have undisplayed duplicates,
|
|
a <literal>Dups</literal> link will be shown with the result list
|
|
entry. Clicking the link will display the paths (URLs + ipaths)
|
|
for the duplicate entries.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.SHORTCUTS">
|
|
<title>Keyboard shortcuts</title>
|
|
|
|
<para>A number of common actions within the graphical interface can
|
|
be triggered through keyboard shortcuts. As of &RCL; 1.29, many
|
|
of the shortcut values can be customised from a screen in the GUI
|
|
preferences. Most shortcuts are specific to a given context
|
|
(e.g. within a preview window, within the result table).</para>
|
|
|
|
<table frame='all'>
|
|
<title>Keyboard shortcuts</title>
|
|
<tgroup cols='2' align='left' colsep='1' rowsep='1'>
|
|
<colspec colname='c1'/>
|
|
<colspec colname='c2'/>
|
|
<thead>
|
|
<row><entry>Description</entry><entry>Default value</entry></row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: almost everywhere</command></entry></row>
|
|
<row>
|
|
<entry>Program exit</entry>
|
|
<entry>Ctrl+Q</entry>
|
|
</row>
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: advanced search</command></entry></row>
|
|
<row>
|
|
<entry>Load the next entry from the search history</entry>
|
|
<entry>Up</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Load the previous entry from the search history</entry>
|
|
<entry>Down</entry>
|
|
</row>
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: main window</command></entry>
|
|
</row>
|
|
<row>
|
|
<entry>Clear search. This will move the keyboard cursor to
|
|
the simple search entry and erase the current text</entry>
|
|
<entry>Ctrl+S</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Move the keyboard cursor to the search entry area
|
|
without erasing the current text</entry>
|
|
<entry>Ctrl+L</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Move the keyboard cursor to the search entry area
|
|
without erasing the current text</entry>
|
|
<entry>Ctrl+Shift+S</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Toggle displaying the current results as a table or
|
|
as a list</entry>
|
|
<entry>Ctrl+T</entry>
|
|
</row>
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: main window, when showing the results
|
|
as a table</command></entry>
|
|
</row>
|
|
<row>
|
|
<entry>Move the keyboard cursor to currently the selected row
|
|
in the table, or to the first one if none is selected</entry>
|
|
<entry>Ctrl+R</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Jump to row 0-9 or a-z in the table</entry>
|
|
<entry>Ctrl+[0-9] or Ctrl+Shift+[a-z]</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Cancel the current selection</entry><entry>Esc</entry>
|
|
</row>
|
|
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: preview window</command></entry>
|
|
</row>
|
|
<row>
|
|
<entry>Close the preview window</entry>
|
|
<entry>Esc</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Close the current tab</entry>
|
|
<entry>Ctrl+W</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Open a print dialog for the current tab contents</entry>
|
|
<entry>Ctrl+P</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Load the next result from the list to the current tab</entry>
|
|
<entry>Shift+Down</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Load the previous result from the list to the current tab</entry>
|
|
<entry>Shift+Up</entry>
|
|
</row>
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: result table</command></entry></row>
|
|
<row>
|
|
<entry>Copy the text contained in the selected
|
|
document to the clipboard</entry> <entry>Ctrl+G</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Open the current document and exit Recoll</entry>
|
|
<entry>Ctrl+Shift+O</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Open the current document</entry>
|
|
<entry>Ctrl+O</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Show a full preview for the current document</entry>
|
|
<entry>Ctrl+D</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Toggle showing the column names</entry>
|
|
<entry>Ctrl+H</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Show a snippets (keyword in context) list for the current document</entry>
|
|
<entry>Ctrl+E</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Toggle showing the row letters/numbers</entry>
|
|
<entry>Ctrl+V</entry>
|
|
</row>
|
|
|
|
<row><entry namest="c1" nameend="c2">
|
|
<command>Context: snippets window</command></entry></row>
|
|
<row>
|
|
<entry>Close the snippets window</entry>
|
|
<entry>Esc</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Find in the snippets list (method #1)</entry>
|
|
<entry>Ctrl+F</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Find in the snippets list (method #2)</entry>
|
|
<entry>/</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Find the next instance of the search term</entry>
|
|
<entry>F3</entry>
|
|
</row>
|
|
<row>
|
|
<entry>Find the previous instance of the search term</entry>
|
|
<entry>Shift+F3</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.TIPS">
|
|
<title>Search tips</title>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.TERMS">
|
|
<title>Terms and search expansion</title>
|
|
|
|
<formalpara><title>Term completion</title>
|
|
<para>While typing into the
|
|
simple search entry, a popup menu will appear and show
|
|
completions for the current string. Values preceded by a clock
|
|
icon come from the history, those preceded by a magnifier icon
|
|
come from the index terms. This can be disabled in the
|
|
preferences.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Picking up new terms from result or preview
|
|
text</title>
|
|
<para>Double-clicking on a word in the result list or in a
|
|
preview window will copy it to the simple search entry field.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Wildcards</title>
|
|
<para>Wildcards can be used inside search terms in all forms
|
|
of searches. <link linkend="RCL.SEARCH.WILDCARDS">More about wildcards</link>.
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Automatic suffixes</title>
|
|
<para>Words like <literal>odt</literal> or <literal>ods</literal>
|
|
can be automatically turned into query language
|
|
<literal>ext:xxx</literal> clauses. This can be enabled in the
|
|
<guilabel>Search preferences</guilabel> panel in the GUI.
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Disabling stem expansion</title>
|
|
<para>Entering a capitalized word in any search field will prevent
|
|
stem expansion (no search for
|
|
<literal>gardening</literal> if you enter
|
|
<literal>Garden</literal> instead of
|
|
<literal>garden</literal>). This is the only case where
|
|
character case should make a difference for a &RCL;
|
|
search. You can also disable stem expansion or change the
|
|
stemming language in the preferences.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Finding related documents</title>
|
|
<para>Selecting the <guilabel>Find similar documents</guilabel> entry
|
|
in the result list paragraph right-click menu will select a
|
|
set of "interesting" terms from the current result, and insert
|
|
them into the simple search entry field. You can then possibly
|
|
edit the list and start a search to find documents which may
|
|
be apparented to the current result.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>File names</title>
|
|
<para>File names are added as terms during indexing, and you can
|
|
specify them as ordinary terms in normal search fields (&RCL; used
|
|
to index all directories in the file path as terms. This has been
|
|
abandoned as it did not seem really useful). Alternatively, you
|
|
can use the specific file name search which will
|
|
<emphasis>only</emphasis> look for file names, and may be
|
|
faster than the generic search especially when using wildcards.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.PHRASES">
|
|
<title>Working with phrases and proximity</title>
|
|
|
|
<formalpara><title>Phrases searches</title>
|
|
<para>A phrase can be
|
|
looked for by enclosing a number of terms in double
|
|
quotes. Example: <literal>"user manual"</literal> will look only
|
|
for occurrences of <literal>user</literal> immediately followed
|
|
by <literal>manual</literal>. You can use
|
|
the <guilabel>"Phrase"</guilabel> field of the advanced search
|
|
dialog to the same effect. Phrases can be entered along simple
|
|
terms in all simple or advanced search entry fields,
|
|
except <guilabel>"Phrase"</guilabel>. </para></formalpara>
|
|
|
|
<formalpara><title>Proximity searches</title>
|
|
<para>A proximity search differs from a phrase search in that
|
|
it does not impose an order on the terms. Proximity searches
|
|
can be entered by specifying
|
|
the <guilabel>"Proximity"</guilabel> type in the advanced
|
|
search, or by postfixing a phrase search with a 'p'. Example:
|
|
"user manual"p would also match "manual user". Also
|
|
see <link linkend="RCL.SEARCH.LANG.MODIFIERS">the modifier
|
|
section</link> from the query language
|
|
documentation.</para></formalpara>
|
|
|
|
<formalpara><title>AutoPhrases</title>
|
|
<para>This option can be set in the preferences dialog. If it is
|
|
set, a phrase will be automatically built and added to simple
|
|
searches when looking for <literal>Any terms</literal>. This
|
|
will not change radically the results, but will give a relevance
|
|
boost to the results where the search terms appear as a
|
|
phrase. Ie: searching for <literal>virtual reality</literal>
|
|
will still find all documents where either
|
|
<literal>virtual</literal> or <literal>reality</literal> or
|
|
both appear, but those which contain
|
|
<literal>virtual reality</literal> should appear sooner in the
|
|
list.</para></formalpara>
|
|
|
|
<para>Phrase searches can slow down a query if most of the
|
|
terms in the phrase are common. If
|
|
the <varname>autophrase</varname> option is on, very common
|
|
terms will be removed from the automatically constructed
|
|
phrase. The removal threshold can be adjusted from the search
|
|
preferences.</para>
|
|
|
|
<formalpara><title>Phrases and abbreviations</title>
|
|
<para>Dotted abbreviations like
|
|
<literal>I.B.M.</literal> are also automatically indexed as a
|
|
word without the dots: <literal>IBM</literal>. Searching for
|
|
the word inside a phrase (ie: <literal>"the IBM
|
|
company"</literal>) will only match the dotted abrreviation
|
|
if you increase the phrase slack (using the advanced search
|
|
panel control, or the <literal>o</literal> query language
|
|
modifier). Literal occurrences of the word will be matched
|
|
normally.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.TIPS.MISC">
|
|
<title>Others</title>
|
|
|
|
<formalpara><title>Using fields</title>
|
|
<para>You can use the <link linkend="RCL.SEARCH.LANG">query
|
|
language </link> and field specifications
|
|
to only search certain parts of documents. This can be
|
|
especially helpful with email, for example only searching
|
|
emails from a specific originator:
|
|
<literal>search tips from:helpfulgui</literal>
|
|
</para></formalpara>
|
|
|
|
<formalpara><title>Adjusting the result table columns</title>
|
|
<para>When displaying results in table mode, you can use a
|
|
right click on the table headers to activate a pop-up menu
|
|
which will let you adjust what columns are displayed. You can
|
|
drag the column headers to adjust their order. You can click
|
|
them to sort by the field displayed in the column. You can
|
|
also save the result list in CSV format.</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara><title>Changing the GUI geometry</title>
|
|
<para>It is possible to configure the GUI in wide form
|
|
factor by dragging the toolbars to one of the sides (their
|
|
location is remembered between sessions), and moving the
|
|
category filters to a menu (can be set in the
|
|
<menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>GUI configuration</guimenuitem>
|
|
<guimenuitem>User interface</guimenuitem>
|
|
</menuchoice> panel).</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Query explanation</title>
|
|
<para>You can get an exact description of what the query
|
|
looked for, including stem expansion, and Boolean operators
|
|
used, by clicking on the result list header.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Advanced search history</title> <para>You can
|
|
display any of the last 100 complex searches performed by
|
|
using the up and down arrow keys while the advanced search
|
|
panel is active.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Forced opening of a preview window</title>
|
|
<para>You can use <keycap>Shift</keycap>+Click on a result list
|
|
<literal>Preview</literal> link to force the creation of a
|
|
preview window instead of a new tab in the existing one.</para>
|
|
</formalpara>
|
|
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.SAVING">
|
|
<title>Saving and restoring queries (1.21 and later)</title>
|
|
|
|
<para>Both simple and advanced query dialogs save recent
|
|
history, but the amount is limited: old queries will eventually
|
|
be forgotten. Also, important queries may be difficult to find
|
|
among others. This is why both types of queries can also be
|
|
explicitly saved to files, from the GUI menus:
|
|
<menuchoice>
|
|
<guimenu>File</guimenu>
|
|
<guimenuitem>Save last query / Load last query</guimenuitem>
|
|
</menuchoice>
|
|
</para>
|
|
|
|
<para>The default location for saved queries is a subdirectory
|
|
of the current configuration directory, but saved queries are
|
|
ordinary files and can be written or moved anywhere.</para>
|
|
|
|
<para>Some of the saved query parameters are part of the
|
|
preferences (e.g. <literal>autophrase</literal> or the active
|
|
external indexes), and may differ when the query is
|
|
loaded from the time it was saved. In this case, &RCL; will warn
|
|
of the differences, but will not change the user
|
|
preferences.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.GUI.CUSTOM">
|
|
<title>Customizing the search interface</title>
|
|
|
|
<para>You can customize some aspects of the search interface by using
|
|
the <guimenu>GUI configuration</guimenu> entry in the
|
|
<guimenu>Preferences</guimenu> menu.</para>
|
|
|
|
<para>There are several tabs in the dialog, dealing with the
|
|
interface itself, the parameters used for searching and
|
|
returning results, and what indexes are searched.</para>
|
|
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.UI">
|
|
<title>User interface parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Highlight color for query
|
|
terms</guilabel>: Terms from the user query are highlighted in
|
|
the result list samples and the preview window. The color can
|
|
be chosen here. Any Qt color string should work (ie
|
|
<literal>red</literal>, <literal>#ff0000</literal>). The
|
|
default is <literal>blue</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Style sheet</guilabel>:
|
|
The name of a <application>Qt</application> style sheet
|
|
text file which is applied to the whole Recoll application
|
|
on startup. The default value is empty, but there is a
|
|
skeleton style sheet (<filename>recoll.qss</filename>)
|
|
inside the <filename>/usr/share/recoll/examples</filename>
|
|
directory. Using a style sheet, you can change most
|
|
<command>recoll</command> graphical parameters:
|
|
colors, fonts, etc. See the sample file for a few
|
|
simple examples.</para>
|
|
<para>You should be aware that parameters (e.g.: the
|
|
background color) set inside the &RCL; GUI style sheet
|
|
will override global system preferences, with possible
|
|
strange side effects: for example if you set the
|
|
foreground to a light color and the background to a
|
|
dark one in the desktop preferences, but only the
|
|
background is set inside the &RCL; style sheet, and it
|
|
is light too, then text will appear light-on-light
|
|
inside the &RCL; GUI.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Maximum text size highlighted for
|
|
preview</guilabel> Inserting highlights on search term inside
|
|
the text before inserting it in the preview window involves
|
|
quite a lot of processing, and can be disabled over the given
|
|
text size to speed up loading.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Prefer HTML to plain text for
|
|
preview</guilabel> if set, Recoll will display HTML as such
|
|
inside the preview window. If this causes problems with the Qt
|
|
HTML display, you can uncheck it to display the plain text
|
|
version instead. </para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Activate links in
|
|
preview</guilabel> if set, Recoll will turn HTTP links found
|
|
inside plain text into proper HTML anchors, and clicking a
|
|
link inside a preview window will start the default browser
|
|
on the link target.</para> </listitem>
|
|
|
|
<listitem><para><guilabel>Plain text to HTML line
|
|
style</guilabel>: when displaying plain text inside the
|
|
preview window, &RCL; tries to preserve some of the original
|
|
text line breaks and indentation. It can either use PRE HTML
|
|
tags, which will well preserve the indentation but will force
|
|
horizontal scrolling for long lines, or use BR tags to break
|
|
at the original line breaks, which will let the editor
|
|
introduce other line breaks according to the window width,
|
|
but will lose some of the original indentation. The third
|
|
option has been available in recent releases and is probably
|
|
now the best one: use PRE tags with line wrapping.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Choose editor
|
|
application</guilabel>: this opens a dialog which allows you
|
|
to select the application to be used to open each MIME
|
|
type. The default is to use the <command>xdg-open</command>
|
|
utility, but you can use this dialog to override it, setting
|
|
exceptions for MIME types that will still be opened according
|
|
to &RCL; preferences. This is useful for passing parameters
|
|
like page numbers or search strings to applications that
|
|
support them (e.g. <application>evince</application>). This
|
|
cannot be done with <command>xdg-open</command> which only
|
|
supports passing one parameter.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Disable Qt autocompletion in search
|
|
entry</guilabel>: this will disable the completion popup. Il
|
|
will only appear, and display the full history, either if you
|
|
enter only white space in the search area, or if you click
|
|
the clock button on the right of the area.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Document filter choice
|
|
style</guilabel>: this will let you choose if the document
|
|
categories are displayed as a list or a set of buttons, or a
|
|
menu.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Start with simple search
|
|
mode</guilabel>: this lets you choose the value of the simple
|
|
search type on program startup. Either a fixed value
|
|
(e.g. <literal>Query Language</literal>, or the value in use
|
|
when the program last exited.</para></listitem>
|
|
|
|
<listitem><para><guilabel>Start with advanced search dialog open
|
|
</guilabel>: If you use this dialog frequently, checking
|
|
the entries will get it to open when recoll starts.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Remember sort activation
|
|
state</guilabel> if set, Recoll will remember the sort tool
|
|
stat between invocations. It normally starts with sorting
|
|
disabled.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.RL">
|
|
<title>Result list parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Number of results in a result
|
|
page</guilabel></para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Result list font</guilabel>: There is
|
|
quite a lot of information shown in the result list, and you
|
|
may want to customize the font and/or font size. The rest of
|
|
the fonts used by &RCL; are determined by your generic Qt
|
|
config (try the <command>qtconfig</command> command).</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.RESULTPARA">
|
|
<para><guilabel>Edit result list paragraph format string</guilabel>:
|
|
allows you to change the presentation of each result list
|
|
entry. See the
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">result list customisation section</link>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.RESULTHEAD">
|
|
<para><guilabel>Edit result page HTML header insert</guilabel>:
|
|
allows you to define text inserted at the end of the result
|
|
page HTML header.
|
|
More detail in the
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">result list customisation section</link>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><guilabel>Date format</guilabel>: allows specifying the
|
|
format used for displaying dates inside the result list. This
|
|
should be specified as an strftime() string (man strftime).</para>
|
|
</listitem>
|
|
|
|
<listitem id="RCL.SEARCH.GUI.CUSTOM.ABSSEP">
|
|
<para><guilabel>Abstract snippet separator</guilabel>:
|
|
for synthetic abstracts built from index data, which are
|
|
usually made of several snippets from different parts of the
|
|
document, this defines the snippet separator, an ellipsis by
|
|
default. </para>
|
|
</listitem>
|
|
|
|
</itemizedlist></para>
|
|
</formalpara>
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.SEARCH">
|
|
<title>Search parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Hide duplicate results</guilabel>:
|
|
decides if result list entries are shown for identical
|
|
documents found in different places.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Stemming language</guilabel>:
|
|
stemming obviously depends on the document's language. This
|
|
listbox will let you chose among the stemming databases which
|
|
were built during indexing (this is set in the
|
|
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">main configuration file</link>),
|
|
or later added with
|
|
<command>recollindex -s</command> (See the recollindex
|
|
manual). Stemming languages
|
|
which are dynamically added will be deleted at the next
|
|
indexing pass unless they are also added in the configuration
|
|
file.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>
|
|
<guilabel>Automatically add phrase to simple searches</guilabel>:
|
|
a phrase will be automatically built and
|
|
added to simple searches when looking for
|
|
<literal>Any terms</literal>. This will give a relevance
|
|
boost to the results where the search terms appear as a
|
|
phrase (consecutive and in order).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Autophrase term frequency threshold
|
|
percentage</guilabel>: very frequent terms should not be included
|
|
in automatic phrase searches for performance reasons. The
|
|
parameter defines the cutoff percentage (percentage of the
|
|
documents where the term appears).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Replace abstracts from
|
|
documents</guilabel>: this decides if we should synthesize and
|
|
display an abstract in place of an explicit abstract found
|
|
within the document itself.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Dynamically build
|
|
abstracts</guilabel>: this decides if &RCL; tries to build
|
|
document abstracts (lists of <emphasis>snippets</emphasis>)
|
|
when displaying the result list. Abstracts are constructed by
|
|
taking context from the document information, around the search
|
|
terms.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Synthetic abstract size</guilabel>:
|
|
adjust to taste...</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Synthetic abstract context
|
|
words</guilabel>: how many words should be displayed around
|
|
each term occurrence.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Query language magic file name
|
|
suffixes</guilabel>: a list of words which automatically get
|
|
turned into <literal>ext:xxx</literal> file name suffix clauses
|
|
when starting a query language query (e.g.:
|
|
<literal>doc xls xlsx...</literal>).
|
|
This will save some typing for people who
|
|
use file types a lot when querying.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara id="RCL.SEARCH.GUI.CUSTOM.EXTRADB">
|
|
<title>External indexes:</title>
|
|
<para>This panel will let you browse for additional indexes
|
|
that you may want to search. External indexes are designated by
|
|
their database directory (ie:
|
|
<filename>/home/someothergui/.recoll/xapiandb</filename>,
|
|
<filename>/usr/local/recollglobal/xapiandb</filename>).</para>
|
|
</formalpara>
|
|
|
|
<para>Once entered, the indexes will appear in the
|
|
<guilabel>External indexes</guilabel> list, and you can
|
|
chose which ones you want to use at any moment by checking or
|
|
unchecking their entries.</para>
|
|
|
|
<para>Your main database (the one the current configuration
|
|
indexes to), is always implicitly active. If this is not
|
|
desirable, you can set up your configuration so that it indexes,
|
|
for example, an empty directory. An alternative indexer may also
|
|
need to implement a way of purging the index from stale data,
|
|
</para>
|
|
|
|
<sect3 id="RCL.SEARCH.GUI.CUSTOM.RESLIST">
|
|
<title>The result list format</title>
|
|
|
|
<para>Recoll normally uses a full function HTML processor to
|
|
display the result list and the
|
|
<link linkend="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">
|
|
snippets window</link>. Depending on the version, this may be
|
|
based on either Qt WebKit or Qt WebEngine.
|
|
It is then possible to completely customise the result list with full
|
|
support for CSS and Javascript.</para>
|
|
|
|
<para>It is also possible to build &RCL; to use a simpler Qt
|
|
QTextBrowser widget to display the HTML, which may be necessary
|
|
if the ones above are not ported on the system, or to reduce
|
|
the application size and dependencies. There are limits to what
|
|
you can do in this case, but it is still possible to decide
|
|
what data each result will contain, and how it will be
|
|
displayed.</para>
|
|
|
|
<para>The result list presentation can be customized
|
|
by adjusting two elements:
|
|
|
|
<itemizedlist>
|
|
<listitem><para>The paragraph format</para></listitem>
|
|
<listitem><para>HTML code inside the header section. For
|
|
versions 1.21 and later, this is also used for the
|
|
<link linkend="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">snippets window</link>.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
The paragraph format and the header fragment can be edited
|
|
from the <guilabel>Result list</guilabel> tab of the
|
|
<guilabel>GUI configuration</guilabel>.
|
|
</para>
|
|
|
|
<para>The header fragment is used both for the result list and
|
|
the snippets window. The snippets list is a table and has a
|
|
<literal>snippets</literal> class attribute. Each paragraph in
|
|
the result list is a table, with class
|
|
<literal>respar</literal>, but this can be changed by editing
|
|
the paragraph format.</para>
|
|
|
|
<para>There are a few examples on the
|
|
<ulink url="http://www.recoll.org/pages/custom.html">page about
|
|
customising the result list</ulink> on the &RCL; web site.</para>
|
|
|
|
<sect4 id="RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA">
|
|
<title>The paragraph format</title>
|
|
|
|
<para>This is an arbitrary HTML string where the following printf-like
|
|
<literal>%</literal> substitutions will be performed:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>%A</title><para>Abstract</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%D</title><para>Date</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%I</title><para>Icon image
|
|
name. This is normally determined from the MIME type. The
|
|
associations are defined inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMECONF"><filename>mimeconf</filename> configuration file</link>.
|
|
If a thumbnail for the file is found at
|
|
the standard Freedesktop location, this will be displayed
|
|
instead.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%K</title><para>Keywords (if
|
|
any)</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%L</title><para>Precooked Preview,
|
|
Edit, and possibly Snippets links</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%M</title><para>MIME
|
|
type</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%N</title><para>result Number inside
|
|
the result page</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%P</title><para>Parent folder
|
|
Url. In the case of an embedded document, this is the parent folder
|
|
for the top level container file.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%R</title><para>Relevance
|
|
percentage</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%S</title><para>Size
|
|
information</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%T</title><para>Title or Filename if
|
|
not set.</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%t</title><para>Title or empty.
|
|
</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%(filename)</title><para>File name.
|
|
</para></formalpara>
|
|
</listitem>
|
|
<listitem><formalpara><title>%U</title><para>Url</para></formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
The format of the Preview, Edit, and Snippets links is
|
|
<literal><a href="P%N"></literal>,
|
|
<literal><a href="E%N"></literal>
|
|
and
|
|
<literal><a href="A%N"></literal>
|
|
where <replaceable>docnum</replaceable> (%N) expands to the document
|
|
number inside the result page).</para>
|
|
|
|
<para>A link target defined as <literal>"F%N"</literal> will open
|
|
the document corresponding to the <literal>%P</literal> parent
|
|
folder expansion, usually creating a file manager window on the
|
|
folder where the container file resides. E.g.:
|
|
<programlisting><a href="F%N">%P</a></programlisting>
|
|
</para>
|
|
|
|
<para>A link target defined as
|
|
<literal>R%N|<replaceable>scriptname</replaceable></literal> will
|
|
run the corresponding script on the result file (if the document is
|
|
embedded, the script will be started on the top-level parent).
|
|
See the <link linkend="RCL.SEARCH.GUI.RUNSCRIPT">section about defining scripts</link>.</para>
|
|
|
|
<para>In addition to the predefined values above, all strings
|
|
like <literal>%(fieldname)</literal> will be replaced by the
|
|
value of the field named <literal>fieldname</literal> for this
|
|
document. Only stored fields can be accessed in this way, the
|
|
value of indexed but not stored fields is not known at this
|
|
point in the search process
|
|
(see <link linkend="RCL.PROGRAM.FIELDS">field configuration</link>). There are currently very few fields
|
|
stored by default, apart from the values above
|
|
(only <literal>author</literal>
|
|
and <literal>filename</literal>), so this feature will need
|
|
some custom local configuration to be useful. An example
|
|
candidate would be the <literal>recipient</literal> field
|
|
which is generated by the message input handlers.</para>
|
|
|
|
<para>The default value for the paragraph format string is:
|
|
<screen><![CDATA[
|
|
"<table class=\"respar\">\n"
|
|
"<tr>\n"
|
|
"<td><a href='%U'><img src='%I' width='64'></a></td>\n"
|
|
"<td>%L <i>%S</i> <b>%T</b><br>\n"
|
|
"<span style='white-space:nowrap'><i>%M</i> %D</span> <i>%U</i> %i<br>\n"
|
|
"%A %K</td>\n"
|
|
"</tr></table>\n"
|
|
]]></screen>
|
|
|
|
You may, for example, try the following for a more web-like
|
|
experience:
|
|
|
|
<screen><![CDATA[
|
|
<u><b><a href="P%N">%T</a></b></u><br>
|
|
%A<font color=#008000>%U - %S</font> - %L
|
|
]]></screen>
|
|
|
|
Note that the P%N link in the above paragraph makes the title a
|
|
preview link. Or the clean looking:
|
|
|
|
<screen><![CDATA[
|
|
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
|
<b>%T&</b><br>%S
|
|
<font color="#808080"><i>%U</i></font>
|
|
<table bgcolor="#e0e0e0">
|
|
<tr><td><div>%A</div></td></tr>
|
|
</table>%K
|
|
]]></screen>
|
|
</para>
|
|
|
|
<para>These samples, and some others are
|
|
<ulink url="http://www.recoll.org/pages/custom.html">on the web
|
|
site, with pictures to show how they look.</ulink></para>
|
|
|
|
<para>It is also possible to
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.ABSSEP">define the value of the snippet separator inside the abstract section</link>.</para>
|
|
</sect4>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
</sect1> <!-- search GUI -->
|
|
|
|
<sect1 id="RCL.SEARCH.KIO">
|
|
<title>Searching with the KDE KIO slave</title>
|
|
|
|
<sect2 id="RCL.SEARCH.KIO.INTRO">
|
|
<title>What's this</title>
|
|
|
|
<para>The &RCL; KIO slave allows performing a &RCL; search
|
|
by entering an appropriate URL in a KDE open dialog, or with an
|
|
HTML-based interface displayed in
|
|
<command>Konqueror</command>.</para>
|
|
|
|
<para>The HTML-based interface is similar to the Qt-based
|
|
interface, but slightly less powerful for now. Its advantage is
|
|
that you can perform your search while staying fully within the
|
|
KDE framework: drag and drop from the result list works normally
|
|
and you have your normal choice of applications for opening
|
|
files.</para>
|
|
|
|
<para>The alternative interface uses a directory view of search
|
|
results. Due to limitations in the current KIO slave interface,
|
|
it is currently not obviously useful (to me).</para>
|
|
|
|
<para>The interface is described in more detail inside a help
|
|
file which you can access by entering
|
|
<filename>recoll:/</filename> inside the
|
|
<command>konqueror</command> URL line (this works only if the
|
|
recoll KIO slave has been previously installed).</para>
|
|
|
|
|
|
<para>The instructions for building this module are located in the
|
|
source tree. See:
|
|
<filename>kde/kio/recoll/00README.txt</filename>. Some Linux
|
|
distributions do package the kio-recoll module, so check before
|
|
diving into the build process, maybe it's already out there ready for
|
|
one-click installation.</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.SEARCH.KIO.SEARCHABLEDOCS">
|
|
<title>Searchable documents</title>
|
|
|
|
<para>As a sample application, the &RCL; KIO slave could allow
|
|
preparing a set of HTML documents (for example a manual) so that
|
|
they become their own search interface inside
|
|
<command>konqueror</command>.</para>
|
|
|
|
<para>This can be done by either explicitly inserting
|
|
<literal><![CDATA[<a href="recoll://...">]]></literal> links
|
|
around some document areas, or automatically by adding a
|
|
very small <application>javascript</application> program to the
|
|
documents, like the following example, which would initiate a search by
|
|
double-clicking any term:</para>
|
|
|
|
<programlisting><script language="JavaScript">
|
|
function recollsearch() {
|
|
var t = document.getSelection();
|
|
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
|
encodeURIComponent(t);
|
|
}
|
|
</script>
|
|
....
|
|
<body ondblclick="recollsearch()">
|
|
|
|
</programlisting>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.COMMANDLINE">
|
|
<title>Searching on the command line</title>
|
|
|
|
<para>There are several ways to obtain search results as a text
|
|
stream, without a graphical interface:</para>
|
|
<itemizedlist>
|
|
<listitem><para>By passing option <option>-t</option> to the
|
|
<command>recoll</command> program, or by calling it as
|
|
<command>recollq</command> (through a link).</para>
|
|
</listitem>
|
|
<listitem><para>By using the <command>recollq</command> program.</para>
|
|
</listitem>
|
|
<listitem><para>By writing a custom
|
|
<application>Python</application> program, using the
|
|
<link linkend="RCL.PROGRAM.PYTHONAPI">Recoll Python API</link>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The first two methods work in the same way and accept/need the same
|
|
arguments (except for the additional <option>-t</option> to
|
|
<command>recoll</command>). The query to be executed is specified
|
|
as command line arguments.</para>
|
|
|
|
<para><command>recollq</command> is not always built by default. You
|
|
can use the <filename>Makefile</filename> in the
|
|
<filename>query</filename> directory to build it. This is a very
|
|
simple program, and if you can program a little c++, you may find it
|
|
useful to taylor its output format to your needs. Apart from being
|
|
easily customised, <command>recollq</command> is only really useful
|
|
on systems where the Qt libraries are not available, else it is
|
|
redundant with <literal>recoll -t</literal>.</para>
|
|
|
|
<para><command>recollq</command> has a
|
|
<ulink url="https://www.lesbonscomptes.com/recoll/manpages/recollq.1.html">man page</ulink>.
|
|
|
|
The Usage string follows:</para>
|
|
<programlisting><![CDATA[
|
|
recollq: usage:
|
|
-P: Show the date span for all the documents present in the index
|
|
[-o|-a|-f] [-q] <query string>
|
|
Runs a recoll query and displays result lines.
|
|
Default: will interpret the argument(s) as a xesam query string
|
|
Query elements:
|
|
* Implicit AND, exclusion, field spec: t1 -t2 title:t3
|
|
* OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4)
|
|
* Phrase: "t1 t2" (needs additional quoting on cmd line)
|
|
-o Emulate the GUI simple search in ANY TERM mode
|
|
-a Emulate the GUI simple search in ALL TERMS mode
|
|
-f Emulate the GUI simple search in filename mode
|
|
-q is just ignored (compatibility with the recoll GUI command line)
|
|
Common options:
|
|
-c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
|
|
-d also dump file contents
|
|
-n [first-]<cnt> define the result slice. The default value for [first]
|
|
is 0. Without the option, the default max count is 2000.
|
|
Use n=0 for no limit
|
|
-b : basic. Just output urls, no mime types or titles
|
|
-Q : no result lines, just the processed query and result count
|
|
-m : dump the whole document meta[] array for each result
|
|
-A : output the document abstracts
|
|
-S fld : sort by field <fld>
|
|
-D : sort descending
|
|
-s stemlang : set stemming language to use (must exist in index...)
|
|
Use -s "" to turn off stem expansion
|
|
-T <synonyms file>: use the parameter (Thesaurus) for word expansion
|
|
-i <dbdir> : additional index, several can be given
|
|
-e use url encoding (%xx) for urls
|
|
-F <field name list> : output exactly these fields for each result.
|
|
The field values are encoded in base64, output in one line and
|
|
separated by one space character. This is the recommended format
|
|
for use by other programs. Use a normal query with option -m to
|
|
see the field names. Use -F '' to output all fields, but you probably
|
|
also want option -N in this case
|
|
-N : with -F, print the (plain text) field names before the field values
|
|
]]></programlisting>
|
|
|
|
<para>Sample execution:</para>
|
|
<programlisting>
|
|
recollq 'ilur -nautique mime:text/html'
|
|
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11) OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
|
|
4 results
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
|
|
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
|
|
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
|
|
</programlisting>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.LANG">
|
|
<title>The query language</title>
|
|
|
|
<para>The query language processor is activated in the GUI
|
|
simple search entry when the search mode selector is set to
|
|
<guilabel>Query Language</guilabel>. It can also be used with the KIO
|
|
slave or the command line search. It broadly has the same
|
|
capabilities as the complex search interface in the
|
|
GUI.</para>
|
|
|
|
<para>The language was based on the now defunct
|
|
<ulink url="http://www.xesam.org/main/XesamUserSearchLanguage95">
|
|
Xesam</ulink> user search language specification.</para>
|
|
|
|
<para>If the results of a query language search puzzle you and you
|
|
doubt what has been actually searched for, you can use the GUI
|
|
<literal>Show Query</literal> link at the top of the result list to
|
|
check the exact query which was finally executed by Xapian.</para>
|
|
|
|
<para>Here follows a sample request that we are going to
|
|
explain:</para>
|
|
|
|
<programlisting>
|
|
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
|
</programlisting>
|
|
|
|
<para>This would search for all documents with
|
|
<replaceable>John Doe</replaceable>
|
|
appearing as a phrase in the author field (exactly what this is
|
|
would depend on the document type, ie: the
|
|
<literal>From:</literal> header, for an email message),
|
|
and containing either <replaceable>beatles</replaceable> or
|
|
<replaceable>lennon</replaceable> and either
|
|
<replaceable>live</replaceable> or
|
|
<replaceable>unplugged</replaceable> but not
|
|
<replaceable>potatoes</replaceable> (in any part of the document).</para>
|
|
|
|
<para>An element is composed of an optional field specification,
|
|
and a value, separated by a colon (the field separator is the last
|
|
colon in the element). Examples:
|
|
<replaceable>Eugenie</replaceable>,
|
|
<replaceable>author:balzac</replaceable>,
|
|
<replaceable>dc:title:grandet</replaceable>
|
|
<replaceable>dc:title:"eugenie grandet"</replaceable>
|
|
</para>
|
|
|
|
<para>The colon, if present, means "contains". Xesam defines other
|
|
relations, which are mostly unsupported for now (except in special
|
|
cases, described further down).</para>
|
|
|
|
<para>All elements in the search entry are normally combined
|
|
with an implicit AND. It is possible to specify that elements be
|
|
OR'ed instead, as in <replaceable>Beatles</replaceable>
|
|
<literal>OR</literal> <replaceable>Lennon</replaceable>. The
|
|
<literal>OR</literal> must be entered literally (capitals), and
|
|
it has priority over the AND associations:
|
|
<replaceable>word1</replaceable>
|
|
<replaceable>word2</replaceable> <literal>OR</literal>
|
|
<replaceable>word3</replaceable>
|
|
means
|
|
<replaceable>word1</replaceable> AND
|
|
(<replaceable>word2</replaceable> <literal>OR</literal>
|
|
<replaceable>word3</replaceable>)
|
|
not
|
|
(<replaceable>word1</replaceable> AND
|
|
<replaceable>word2</replaceable>) <literal>OR</literal>
|
|
<replaceable>word3</replaceable>. </para>
|
|
|
|
<para>&RCL; versions 1.21 and later, allow using parentheses to
|
|
group elements, which will sometimes make things clearer, and may
|
|
allow expressing combinations which would have been difficult
|
|
otherwise.</para>
|
|
|
|
<para>An element preceded by a <literal>-</literal> specifies a
|
|
term that should <emphasis>not</emphasis> appear.</para>
|
|
|
|
<para>As usual, words inside quotes define a phrase
|
|
(the order of words is significant), so that
|
|
<replaceable>title:"prejudice pride"</replaceable> is not the same as
|
|
<replaceable>title:prejudice title:pride</replaceable>, and is
|
|
unlikely to find a result.</para>
|
|
|
|
<para>Words inside phrases and capitalized words are not
|
|
stem-expanded. Wildcards may be used anywhere inside a term.
|
|
Specifying a wild-card on the left of a term can produce a very
|
|
slow search (or even an incorrect one if the expansion is
|
|
truncated because of excessive size). Also see
|
|
<link linkend="RCL.SEARCH.WILDCARDS">More about wildcards</link>.
|
|
</para>
|
|
|
|
<para>To save you some typing, recent &RCL; versions (1.20 and later)
|
|
interpret a comma-separated list of terms for a field as an AND list
|
|
inside the field. Use slash characters ('/') for an OR list. No white
|
|
space is allowed. So
|
|
<programlisting>author:john,lennon</programlisting> will search for
|
|
documents with <literal>john</literal> and <literal>lennon</literal>
|
|
inside the <literal>author</literal> field (in any order), and
|
|
<programlisting>author:john/ringo</programlisting> would search for
|
|
<literal>john</literal> or <literal>ringo</literal>. This behaviour
|
|
only happens for field queries (input without a field, comma- or
|
|
slash- separated input will produce a phrase search). You can use a
|
|
<literal>text</literal> field name to search the main text this
|
|
way.</para>
|
|
|
|
<para>Modifiers can be set on a double-quote value, for example to specify
|
|
a proximity search (unordered). See
|
|
<link linkend="RCL.SEARCH.LANG.MODIFIERS">the modifier section</link>.
|
|
No space must separate the final double-quote and the modifiers
|
|
value, e.g. <replaceable>"two one"po10</replaceable></para>
|
|
|
|
<para>&RCL; currently manages the following default fields:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>title</literal>,
|
|
<literal>subject</literal> or <literal>caption</literal> are
|
|
synonyms which specify data to be searched for in the
|
|
document title or subject.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>author</literal> or
|
|
<literal>from</literal> for searching the documents
|
|
originators.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>recipient</literal> or
|
|
<literal>to</literal> for searching the documents
|
|
recipients.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>keyword</literal> for searching the
|
|
document-specified keywords (few documents actually have
|
|
any).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>filename</literal> for the document's
|
|
file name. This is not necessarily set for all documents:
|
|
internal documents contained inside a compound one (for example
|
|
an EPUB section) do not inherit the container file name any more,
|
|
this was replaced by an explicit field (see next). Sub-documents
|
|
can still have a specific <literal>filename</literal>, if it is
|
|
implied by the document format, for example the attachment file
|
|
name for an email attachment.</para></listitem>
|
|
|
|
<listitem><para><literal>containerfilename</literal>. This is
|
|
set for all documents, both top-level and contained
|
|
sub-documents, and is always the name of the filesystem directory
|
|
entry which contains the data. The terms from this field can
|
|
only be matched by an explicit field specification (as opposed
|
|
to terms from <literal>filename</literal> which are also indexed
|
|
as general document content). This avoids getting matches for
|
|
all the sub-documents when searching for the container file
|
|
name.</para></listitem>
|
|
|
|
<listitem><para><literal>ext</literal> specifies the file
|
|
name extension
|
|
(Ex: <literal>ext:html</literal>).</para></listitem>
|
|
|
|
<listitem><para><literal>rclmd5</literal> the MD5 checksum for the
|
|
document. This is used for displaying the duplicates of a
|
|
search result (when querying with the option to collapse
|
|
duplicate results). Incidentally, this could be used to find
|
|
the duplicates of any given file by computing its MD5 checksum
|
|
and executing a query with just the <literal>rclmd5</literal>
|
|
value.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>&RCL; 1.20 and later have a way to specify aliases for the
|
|
field names, which will save typing, for example by aliasing
|
|
<literal>filename</literal> to <replaceable>fn</replaceable> or
|
|
<literal>containerfilename</literal> to
|
|
<replaceable>cfn</replaceable>. See the
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS">section about the <filename>fields</filename> file</link>.
|
|
</para>
|
|
|
|
<para>The document input handlers used while indexing have the
|
|
possibility to create other fields with arbitrary names, and
|
|
aliases may be defined in the configuration, so that the exact
|
|
field search possibilities may be different for you if someone
|
|
took care of the customisation.</para>
|
|
|
|
<para>The field syntax also supports a few field-like, but
|
|
special, criteria:</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>dir</literal> for filtering the
|
|
results on file location
|
|
(Ex: <literal>dir:/home/me/somedir</literal>).
|
|
<literal>-dir</literal>
|
|
also works to find results not in the specified directory
|
|
(release >= 1.15.8). Tilde expansion will be performed as
|
|
usual (except for a bug in versions 1.19 to
|
|
1.19.11p1). Wildcards will be expanded, but
|
|
please
|
|
<link linkend="RCL.SEARCH.WILDCARDS.PATH"> have a look</link>
|
|
at an important limitation of wildcards in path filters.</para>
|
|
|
|
<para>Relative paths also make sense, for example,
|
|
<literal>dir:share/doc</literal> would match either
|
|
<filename>/usr/share/doc</filename> or
|
|
<filename>/usr/local/share/doc</filename> </para>
|
|
|
|
<para>Several <literal>dir</literal> clauses can be specified,
|
|
both positive and negative. For example the following makes sense:
|
|
<programlisting>
|
|
dir:recoll dir:src -dir:utils -dir:common
|
|
</programlisting> This would select results which have both
|
|
<filename>recoll</filename> and <filename>src</filename> in the
|
|
path (in any order), and which have not either
|
|
<filename>utils</filename> or
|
|
<filename>common</filename>.</para>
|
|
|
|
<para>You can also use <literal>OR</literal> conjunctions
|
|
with <literal>dir:</literal> clauses.</para>
|
|
|
|
<para>A special aspect of <literal>dir</literal> clauses is
|
|
that the values in the index are not transcoded to UTF-8, and
|
|
never lower-cased or unaccented, but stored as binary. This means
|
|
that you need to enter the values in the exact lower or upper
|
|
case, and that searches for names with diacritics may sometimes
|
|
be impossible because of character set conversion
|
|
issues. Non-ASCII UNIX file paths are an unending source of
|
|
trouble and are best avoided.</para>
|
|
|
|
<para>You need to use double-quotes around the path value if it
|
|
contains space characters.</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem><para><literal>size</literal> for filtering the
|
|
results on file size. Example:
|
|
<literal>size<10000</literal>. You can use
|
|
<literal><</literal>, <literal>></literal> or
|
|
<literal>=</literal> as operators. You can specify a range like the
|
|
following: <literal>size>100 size<1000</literal>. The usual
|
|
<literal>k/K, m/M, g/G, t/T</literal> can be used as (decimal)
|
|
multipliers. Ex: <literal>size>1k</literal> to search for files
|
|
bigger than 1000 bytes.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>date</literal> for searching or filtering
|
|
on dates. The syntax for the argument is based on the ISO8601
|
|
standard for dates and time intervals. Only dates are supported, no
|
|
times. The general syntax is 2 elements separated by a
|
|
<literal>/</literal> character. Each element can be a date or a
|
|
period of time. Periods are specified as
|
|
<literal>P</literal><replaceable>n</replaceable><literal>Y</literal><replaceable>n</replaceable><literal>M</literal><replaceable>n</replaceable><literal>D</literal>.
|
|
The <replaceable>n</replaceable> numbers are the respective numbers
|
|
of years, months or days, any of which may be missing. Dates are
|
|
specified as
|
|
<replaceable>YYYY</replaceable>-<replaceable>MM</replaceable>-<replaceable>DD</replaceable>.
|
|
The days and months parts may be missing. If the
|
|
<literal>/</literal> is present but an element is missing, the
|
|
missing element is interpreted as the lowest or highest date in the
|
|
index. Examples:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>2001-03-01/2002-05-01</literal> the
|
|
basic syntax for an interval of dates.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001-03-01/P1Y2M</literal> the
|
|
same specified with a period.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001/</literal> from the beginning of
|
|
2001 to the latest date in the index.</para>
|
|
</listitem>
|
|
<listitem><para><literal>2001</literal> the whole year of
|
|
2001</para></listitem>
|
|
<listitem><para><literal>P2D/</literal> means 2 days ago up to
|
|
now if there are no documents with dates in the future.</para>
|
|
</listitem>
|
|
<listitem><para><literal>/2003</literal> all documents from
|
|
2003 or older.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>Periods can also be specified with small letters (ie:
|
|
p2y).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>mime</literal> or
|
|
<literal>format</literal> for specifying the
|
|
MIME type. These clauses are processed besides the normal
|
|
Boolean logic of the search. Multiple values will be OR'ed
|
|
(instead of the normal AND). You can specify types to be
|
|
excluded, with the usual <literal>-</literal>, and use
|
|
wildcards. Example: <replaceable>mime:text/*
|
|
-mime:text/plain</replaceable>
|
|
Specifying an explicit boolean
|
|
operator before a <literal>mime</literal> specification is not
|
|
supported and will produce strange results. </para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>type</literal> or
|
|
<literal>rclcat</literal> for specifying the category (as in
|
|
text/media/presentation/etc.). The classification of MIME
|
|
types in categories is defined in the &RCL; configuration
|
|
(<filename>mimeconf</filename>), and can be modified or
|
|
extended. The default category names are those which permit
|
|
filtering results in the main GUI screen. Categories are OR'ed
|
|
like MIME types above, and can be negated with
|
|
<literal>-</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>issub</literal>
|
|
for specifying that only standalone (<literal>issub:0</literal>) or
|
|
only embedded (<literal>issub:1</literal>) documents should be
|
|
returned as results.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<note><para>
|
|
<literal>mime</literal>, <literal>rclcat</literal>, <literal>size</literal>,
|
|
<literal>issub</literal> and <literal>date</literal> criteria always
|
|
affect the whole query (they are applied as a final filter), even if set
|
|
with other terms inside a parenthese.</para>
|
|
</note>
|
|
|
|
<note><para>
|
|
<literal>mime</literal> (or the equivalent
|
|
<literal>rclcat</literal>) is the <emphasis>only</emphasis>
|
|
field with an <literal>OR</literal> default. You do need to use
|
|
<literal>OR</literal> with <literal>ext</literal> terms for
|
|
example.</para> </note>
|
|
|
|
<sect2 id="RCL.SEARCH.LANG.RANGES">
|
|
<title>Range clauses</title>
|
|
|
|
<para>&RCL; 1.24 and later support range clauses on fields which
|
|
have been configured to support it. No default field uses them
|
|
currently, so this paragraph is only interesting if you modified
|
|
the fields configuration and possibly use a custom input
|
|
handler.</para>
|
|
|
|
<para>A range clause looks like one of the following:</para>
|
|
<programlisting><replaceable>myfield</replaceable>:<replaceable>small</replaceable>..<replaceable>big</replaceable>
|
|
<replaceable>myfield</replaceable>:<replaceable>small</replaceable>..
|
|
<replaceable>myfield</replaceable>:..<replaceable>big</replaceable>
|
|
</programlisting>
|
|
|
|
<para>The nature of the clause is indicated by the two dots
|
|
<literal>..</literal>, and the effect is to filter the results for
|
|
which the <replaceable>myfield</replaceable> value is in the
|
|
possibly open-ended interval.</para>
|
|
|
|
<para>See the section about the
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS"><filename>fields</filename> configuration file</link>
|
|
for the details of configuring a field for range searches (list
|
|
them in the [values] section).</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.SEARCH.LANG.MODIFIERS">
|
|
<title>Modifiers</title>
|
|
|
|
<para>Some characters are recognized as search modifiers when found
|
|
immediately after the closing double quote of a phrase, as in
|
|
<literal>"some term"modifierchars</literal>. The actual "phrase"
|
|
can be a single term of course. Supported modifiers:
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>l</literal> can be used to turn off
|
|
stemming (mostly makes sense with <literal>p</literal> because
|
|
stemming is off by default for phrases).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>s</literal> can be used to turn off
|
|
synonym expansion, if a synonyms file is in place (only for
|
|
&RCL; 1.22 and later).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>o</literal> can be used to specify a
|
|
"slack" for phrase and proximity searches: the number of
|
|
additional terms that may be found between the specified
|
|
ones. If <literal>o</literal> is followed by an integer number,
|
|
this is the slack, else the default is 10.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>p</literal> can be used to turn the
|
|
default phrase search into a proximity one
|
|
(unordered). Example: <literal>"order any in"p</literal></para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>C</literal> will turn on case
|
|
sensitivity (if the index supports it).</para></listitem>
|
|
|
|
<listitem><para><literal>D</literal> will turn on diacritics
|
|
sensitivity (if the index supports it).</para></listitem>
|
|
|
|
<listitem><para>A weight can be specified for a query element
|
|
by specifying a decimal value at the start of the
|
|
modifiers. Example: <literal>"Important"2.5</literal>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
|
|
</sect2> <!-- search modifiers -->
|
|
|
|
</sect1> <!-- rcl.search.lang -->
|
|
|
|
<sect1 id="RCL.SEARCH.ANCHORWILD">
|
|
<title>Anchored searches and wildcards</title>
|
|
|
|
<para>Some special characters are interpreted by &RCL; in search
|
|
strings to expand or specialize the search. Wildcards expand a root
|
|
term in controlled ways. Anchor characters can restrict a search to
|
|
succeed only if the match is found at or near the beginning of the
|
|
document or one of its fields.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.WILDCARDS">
|
|
<title>More about wildcards</title>
|
|
|
|
<para>All words entered in &RCL; search fields will be processed
|
|
for wildcard expansion before the request is finally
|
|
executed.</para>
|
|
|
|
<para>The wildcard characters are:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>*</literal> which matches 0 or more
|
|
characters.</para>
|
|
</listitem>
|
|
<listitem><para><literal>?</literal> which matches
|
|
a single character.</para>
|
|
</listitem>
|
|
<listitem><para><literal>[]</literal> which allow
|
|
defining sets of characters to be matched (ex:
|
|
<literal>[</literal><userinput>abc</userinput><literal>]</literal>
|
|
matches a single character which may be 'a' or 'b' or 'c',
|
|
<literal>[</literal><userinput>0-9</userinput><literal>]</literal>
|
|
matches any number.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>You should be aware of a few things when using
|
|
wildcards.</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>Using a wildcard character at the beginning of
|
|
a word can make for a slow search because &RCL; will have to
|
|
scan the whole index term list to find the
|
|
matches. However, this is much less a problem for field
|
|
searches, and queries
|
|
like <replaceable>author:*@domain.com</replaceable> can
|
|
sometimes be very useful.</para></listitem>
|
|
|
|
<listitem><para>For &RCL; version 18 only, when working with a
|
|
raw index (preserving character case and diacritics), the
|
|
literal part of a wildcard expression will be matched
|
|
exactly for case and diacritics. This is not true any
|
|
more for versions 19 and later.</para></listitem>
|
|
|
|
<listitem><para>Using a <literal>*</literal> at the end of a
|
|
word can produce more matches than you would think, and
|
|
strange search results. You can use the
|
|
<link linkend="RCL.SEARCH.GUI.TERMEXPLORER">term explorer</link>
|
|
tool to check what completions exist for
|
|
a given term. You can also see exactly what search was
|
|
performed by clicking on the link at the top of the result
|
|
list. In general, for natural language terms, stem
|
|
expansion will produce better results than an
|
|
ending <literal>*</literal> (stem expansion is turned off
|
|
when any wildcard character appears in the
|
|
term).</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<sect3 id="RCL.SEARCH.WILDCARDS.PATH">
|
|
<title>Wildcards and path filtering</title>
|
|
|
|
<para>Due to the way that &RCL; processes wildcards
|
|
inside <literal>dir</literal> path filtering clauses, they
|
|
will have a multiplicative effect on the query size. A clause
|
|
containing wildcards in several paths elements, like, for
|
|
example,
|
|
<literal>dir:</literal><replaceable>/home/me/*/*/docdir</replaceable>,
|
|
will almost certainly fail if your indexed tree is of any realistic
|
|
size.</para>
|
|
|
|
<para>Depending on the case, you may be able to work around
|
|
the issue by specifying the paths elements more narrowly, with
|
|
a constant prefix, or by using 2
|
|
separate <literal>dir:</literal> clauses instead of multiple
|
|
wildcards, as
|
|
in <literal>dir:</literal><replaceable>/home/me</replaceable> <literal>dir:</literal><replaceable>docdir</replaceable>. The
|
|
latter query is not equivalent to the initial one because it
|
|
does not specify a number of directory levels, but that's
|
|
the best we can do (and it may be actually more useful in
|
|
some cases).</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2> <!-- wildchars -->
|
|
|
|
<sect2 id="RCL.SEARCH.ANCHOR">
|
|
<title>Anchored searches</title>
|
|
|
|
<para>Two characters are used to specify that a search hit should
|
|
occur at the beginning or at the end of the
|
|
text. <literal>^</literal> at the beginning of a term or phrase
|
|
constrains the search to happen at the start, <literal>$</literal>
|
|
at the end force it to happen at the end.</para>
|
|
|
|
<para>As this function is implemented as a phrase search it is
|
|
possible to specify a maximum distance at which the hit should
|
|
occur, either through the controls of the advanced search panel, or
|
|
using the query language, for example, as in:
|
|
<programlisting>"^someterm"o10</programlisting> which would force
|
|
<literal>someterm</literal> to be found within 10 terms of the
|
|
start of the text. This can be combined with a field search as in
|
|
<literal>somefield:"^someterm"o10</literal> or
|
|
<literal>somefield:someterm$</literal>.</para>
|
|
|
|
<para>This feature can also be used with an actual phrase search,
|
|
but in this case, the distance applies to the whole phrase and
|
|
anchor, so that, for example,
|
|
<literal>bla bla my unexpected term</literal> at the
|
|
beginning of the text would be a match for
|
|
<literal>"^my term"o5</literal>.</para>
|
|
|
|
<para>Anchored searches can be very useful for searches inside
|
|
somewhat structured documents like scientific articles, in case
|
|
explicit metadata has not been supplied (a most frequent case), for
|
|
example for looking for matches inside the abstract or the list of
|
|
authors (which occur at the top of the document).</para>
|
|
|
|
|
|
</sect2>
|
|
|
|
</sect1> <!-- wildchars and anchors -->
|
|
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.SYNONYMS">
|
|
<title>Using Synonyms (1.22)</title>
|
|
|
|
<formalpara><title>Term synonyms and text search:</title> <para>in
|
|
general, there are two main ways to use term synonyms for
|
|
searching text:
|
|
<itemizedlist>
|
|
<listitem><para>At index creation time, they can be used to alter the
|
|
indexed terms, either increasing or decreasing their number, by
|
|
expanding the original terms to all synonyms, or by
|
|
reducing all synonym terms to a canonical one.</para></listitem>
|
|
<listitem><para>At query time, they can be used to match texts
|
|
containing terms which are synonyms of the ones specified by the user,
|
|
either by expanding the query for all synonyms, or by reducing the user
|
|
entry to canonical terms (the latter only works if the corresponding
|
|
processing has been performed while creating the
|
|
index).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<para>&RCL; only uses synonyms at query time. A user query term which
|
|
part of a synonym group will be optionally expanded into an
|
|
<literal>OR</literal> query for all terms in the group.</para>
|
|
|
|
<para>Synonym groups are defined inside ordinary text files. Each line
|
|
in the file defines a group.</para>
|
|
|
|
<para>Example:
|
|
<programlisting>
|
|
hi hello "good morning"
|
|
|
|
# not sure about "au revoir" though. Is this english ?
|
|
bye goodbye "see you" \
|
|
"au revoir"
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>As usual, lines beginning with a <literal>#</literal> are comments,
|
|
empty lines are ignored, and lines can be continued by ending them with
|
|
a backslash.
|
|
</para>
|
|
|
|
<para>Multi-word synonyms are supported, but be aware that these will
|
|
generate phrase queries, which may degrade performance and will disable
|
|
stemming expansion for the phrase terms.</para>
|
|
|
|
<para>The contents of the synonyms file must be casefolded (not only
|
|
lowercased), because this is what expected at the point in the query
|
|
processing where it is used. There are a few cases where this makes a
|
|
difference, for example, German sharp s should be expressed as
|
|
<literal>ss</literal>, Greek final sigma as sigma. For reference,
|
|
Python3 has an easy way to casefold words (str.casefold()).</para>
|
|
|
|
<para>The synonyms file can be specified in the <guilabel>Search
|
|
parameters</guilabel> tab of the <guilabel>GUI configuration</guilabel>
|
|
<guilabel>Preferences</guilabel> menu entry, or as an option for
|
|
command-line searches.</para>
|
|
|
|
<para>Once the file is defined, the use of synonyms can be enabled or
|
|
disabled directly from the <guilabel>Preferences</guilabel>
|
|
menu.</para>
|
|
|
|
<para>The synonyms are searched for matches with user terms after the
|
|
latter are stem-expanded, but the contents of the synonyms file itself
|
|
is not subjected to stem expansion. This means that a match will not be
|
|
found if the form present in the synonyms file is not present anywhere
|
|
in the document set (same with accents when using a raw index).</para>
|
|
|
|
<para>The synonyms function is probably not going to help you find your
|
|
letters to Mr. Smith. It is best used for domain-specific searches. For
|
|
example, it was initially suggested by a user performing searches among
|
|
historical documents: the synonyms file would contains nicknames and
|
|
aliases for each of the persons of interest.</para>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.PTRANS">
|
|
<title>Path translations</title>
|
|
|
|
<para>In some cases, the document paths stored inside the index do
|
|
not match the actual ones, so that document
|
|
previews and accesses will fail. This can occur in a number of
|
|
circumstances:</para>
|
|
<itemizedlist>
|
|
<listitem><para>When using multiple indexes it is a relatively common
|
|
occurrence that some will actually reside on a remote volume, for
|
|
example mounted via NFS. In this case, the paths used to access
|
|
the documents on the local machine are not necessarily the same
|
|
than the ones used while indexing on the remote machine. For
|
|
example, <filename>/home/me</filename> may have been used as
|
|
a <literal>topdirs</literal> elements while indexing, but the
|
|
directory might be mounted
|
|
as <filename>/net/server/home/me</filename> on the local
|
|
machine.</para></listitem>
|
|
|
|
<listitem><para>The case may also occur with removable
|
|
disks. It is perfectly possible to configure an index to
|
|
live with the documents on the removable disk, but it may
|
|
happen that the disk is not mounted at the same place so
|
|
that the documents paths from the index are
|
|
invalid.</para></listitem>
|
|
|
|
<listitem><para>As a last example, one could imagine that a big
|
|
directory has been moved, but that it is currently
|
|
inconvenient to run the indexer.</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>&RCL; has a facility for rewriting access paths when
|
|
extracting the data from the index. The translations can be
|
|
defined for the main index and for any additional query
|
|
index.</para>
|
|
|
|
<para>The path translation facility will be useful
|
|
whenever the documents paths seen by the indexer are not the same
|
|
as the ones which should be used at query time.</para>
|
|
|
|
<para>In the above NFS example, &RCL; could be instructed to
|
|
rewrite any <filename>file:///home/me</filename> URL from the
|
|
index to <filename>file:///net/server/home/me</filename>,
|
|
allowing accesses from the client.</para>
|
|
|
|
<para>The translations are defined in the
|
|
<link linkend="RCL.INSTALL.CONFIG.PTRANS"><filename>ptrans</filename></link>
|
|
configuration file, which
|
|
can be edited by hand or from the GUI external indexes
|
|
configuration dialog: <menuchoice>
|
|
<guimenu>Preferences</guimenu>
|
|
<guimenuitem>External index dialog</guimenuitem>
|
|
</menuchoice>, then click the <guilabel>Paths
|
|
translations</guilabel> button on the right below the index
|
|
list.</para>
|
|
|
|
<note><para>Due to a current bug, the GUI must be restarted
|
|
after changing the <filename>ptrans</filename> values (even when they
|
|
were changed from the GUI).</para></note>
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.SEARCH.CASEDIAC">
|
|
<title>Search case and diacritics sensitivity</title>
|
|
|
|
<para>For &RCL; versions 1.18 and later, and <emphasis>when working
|
|
with a raw index</emphasis> (not the default), searches can be
|
|
sensitive to character case and diacritics. How this happens
|
|
is controlled by configuration variables and what search data is
|
|
entered.</para>
|
|
|
|
<para>The general default is that searches entered without upper-case
|
|
or accented characters are insensitive to case and diacritics. An
|
|
entry of <literal>resume</literal> will match any of
|
|
<literal>Resume</literal>, <literal>RESUME</literal>,
|
|
<literal>résumé</literal>, <literal>Résumé</literal> etc.</para>
|
|
|
|
<para>Two configuration variables can automate switching on
|
|
sensitivity (they were documented but actually did nothing until
|
|
&RCL; 1.22):</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>autodiacsens</term><listitem><para>If this is set, search
|
|
sensitivity to diacritics will be turned on as soon as an
|
|
accented character exists in a search term. When the variable
|
|
is set to true, <literal>resume</literal> will start a
|
|
diacritics-unsensitive search, but <literal>résumé</literal>
|
|
will be matched exactly. The default value is
|
|
<emphasis>false</emphasis>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>autocasesens</term><listitem><para>If this is set, search
|
|
sensitivity to character case will be turned on as soon as an
|
|
upper-case character exists in a search term <emphasis>except
|
|
for the first one</emphasis>. When the variable is set to
|
|
true, <literal>us</literal> or <literal>Us</literal> will
|
|
start a diacritics-unsensitive search, but
|
|
<literal>US</literal> will be matched exactly. The default
|
|
value is <emphasis>true</emphasis> (contrary to
|
|
<literal>autodiacsens</literal>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>As in the past, capitalizing the first letter of a word will
|
|
turn off its stem expansion and have no effect on
|
|
case-sensitivity.</para>
|
|
|
|
<para>You can also explicitly activate case and diacritics
|
|
sensitivity by using modifiers with the query
|
|
language. <literal>C</literal> will make the term case-sensitive, and
|
|
<literal>D</literal> will make it
|
|
diacritics-sensitive. Examples:</para>
|
|
<programlisting>
|
|
"us"C
|
|
</programlisting>
|
|
|
|
<para>will search for the term <literal>us</literal> exactly
|
|
(<literal>Us</literal> will not be a match).</para>
|
|
|
|
<programlisting>
|
|
"resume"D
|
|
</programlisting>
|
|
<para>will search for the term <literal>resume</literal> exactly
|
|
(<literal>résumé</literal> will not be a match).</para>
|
|
|
|
|
|
<para>When either case or diacritics sensitivity is activated, stem
|
|
expansion is turned off. Having both does not make much sense.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.SEARCH.DESKTOP">
|
|
<title>Desktop integration</title>
|
|
|
|
<para>Being independent of the desktop type has its drawbacks: &RCL;
|
|
desktop integration is minimal. However there are a few tools
|
|
available:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Users of recent Ubuntu-derived distributions, or
|
|
any other Gnome desktop systems (e.g. Fedora) can install the
|
|
<ulink
|
|
url="https://www.lesbonscomptes.com/recoll/pages/download.html#gssp">
|
|
Recoll GSSP</ulink> (Gnome Shell Search Provider).</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The <application>KDE</application> KIO Slave was described
|
|
in a <link linkend="RCL.SEARCH.KIO">previous
|
|
section</link>. It can provide search results
|
|
inside <command>Dolphin</command>. </para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>If you use an oldish version of Ubuntu Linux, you may
|
|
find the <ulink url="&FAQS;UnityLens">Ubuntu Unity
|
|
Lens</ulink> module useful.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>There is also an independently developed
|
|
<ulink
|
|
url="http://kde-apps.org/content/show.php/recollrunner?content=128203">
|
|
Krunner plugin</ulink>.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>Here follow a few other things that may help.</para>
|
|
|
|
<sect2 id="RCL.SEARCH.SHORTCUT">
|
|
<title>Hotkeying recoll</title>
|
|
|
|
<para>It is surprisingly convenient to be able to show or hide the
|
|
&RCL; GUI with a single keystroke. Recoll comes with a small
|
|
Python script, based on the <application>libwnck</application> window
|
|
manager interface library, which will allow you to do just
|
|
this. The detailed instructions are on
|
|
<ulink url="&FAQS;HotRecoll">this wiki page</ulink>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.KICKER-APPLET">
|
|
<title>The KDE Kicker Recoll applet</title>
|
|
|
|
<para>This is probably obsolete now. Anyway:</para>
|
|
<para>The &RCL; source tree contains the source code to the
|
|
<application>recoll_applet</application>, a small application derived
|
|
from the <application>find_applet</application>. This can be used to
|
|
add a small &RCL; launcher to the KDE panel.</para>
|
|
|
|
<para>The applet is not automatically built with the main &RCL;
|
|
programs, nor is it included with the main source distribution
|
|
(because the KDE build boilerplate makes it relatively big). You can
|
|
download its source from the recoll.org download page. Use the
|
|
omnipotent <userinput>configure;make;make install</userinput>
|
|
incantation to build and install.</para>
|
|
|
|
<para>You can then add the applet to the panel by right-clicking the
|
|
panel and choosing the <guilabel>Add applet</guilabel> entry.</para>
|
|
|
|
<para>The <application>recoll_applet</application> has a small text
|
|
window where you can type a &RCL; query (in query language form),
|
|
and an icon which can be used to restrict the search to certain
|
|
types of files. It is quite primitive, and launches a new recoll
|
|
GUI instance every time (even if it is already running). You may
|
|
find it useful anyway.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1> <!-- rcl.search.desktop -->
|
|
|
|
</chapter> <!-- Search -->
|
|
|
|
<chapter id="RCL.PROGRAM">
|
|
<title>Programming interface</title>
|
|
|
|
<para>&RCL; has an Application Programming Interface, usable both
|
|
for indexing and searching, currently accessible from the
|
|
<application>Python</application> language.</para>
|
|
|
|
<para>Another less radical way to extend the application is to
|
|
write input handlers for new types of documents.</para>
|
|
|
|
<para>The processing of metadata attributes for documents
|
|
(<literal>fields</literal>) is highly configurable.</para>
|
|
|
|
|
|
|
|
<sect1 id="RCL.PROGRAM.FILTERS">
|
|
<title>Writing a document input handler</title>
|
|
|
|
<note><title>Terminology</title><para>The small programs or pieces
|
|
of code which handle the processing of the different document
|
|
types for &RCL; used to be called <literal>filters</literal>,
|
|
which is still reflected in the name of the directory which
|
|
holds them and many configuration variables. They were named
|
|
this way because one of their primary functions is to filter
|
|
out the formatting directives and keep the text
|
|
content. However these modules may have other behaviours, and
|
|
the term <literal>input handler</literal> is now progressively
|
|
substituted in the documentation. <literal>filter</literal> is
|
|
still used in many places though.</para></note>
|
|
|
|
<para>&RCL; input handlers cooperate to translate from the multitude
|
|
of input document formats, simple ones as
|
|
<application>opendocument</application>,
|
|
<application>acrobat</application>, or compound ones such as
|
|
<application>Zip</application> or <application>Email</application>,
|
|
into the final &RCL; indexing input format, which is plain text (in
|
|
many cases the processing pipeline has an intermediary HTML step,
|
|
which may be used for better previewing presentation). Most input
|
|
handlers are executable programs or scripts. A few handlers are coded
|
|
in C++ and live inside <command>recollindex</command>. This latter
|
|
kind will not be described here.</para>
|
|
|
|
<para>There are two kinds of external executable input handlers:
|
|
<itemizedlist>
|
|
<listitem><para>Simple <literal>exec</literal> handlers
|
|
run once and exit. They can be bare programs like
|
|
<command>antiword</command>, or scripts using other
|
|
programs. They are very simple to write, because they just
|
|
need to print the converted document to the standard
|
|
output. Their output can be plain text or HTML. HTML is
|
|
usually preferred because it can store metadata fields and
|
|
it allows preserving some of the formatting for the GUI
|
|
preview. However, these handlers have limitations:
|
|
<itemizedlist>
|
|
<listitem><para>They can only process one document
|
|
per file.</para></listitem>
|
|
<listitem><para>The output MIME type must be known and
|
|
fixed.</para></listitem>
|
|
<listitem><para>The character encoding, if relevant, must be
|
|
known and fixed (or possibly just depending on
|
|
location).</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
<listitem><para>Multiple <literal>execm</literal> handlers can
|
|
process multiple files (sparing the process startup time which can
|
|
be very significant), or multiple documents per file (e.g.: for
|
|
archives or multi-chapter publications). They communicate with the
|
|
indexer through a simple protocol, but are nevertheless a bit more
|
|
complicated than the older kind. Most of the new handlers are
|
|
written in <application>Python</application> (exception:
|
|
<command>rclimg</command> which is written in Perl because
|
|
<literal>exiftool</literal> has no real Python equivalent). The
|
|
Python handlers use common modules to factor out the boilerplate,
|
|
which can make them very simple in favorable cases. The
|
|
subdocuments output by these handlers can be directly indexable
|
|
(text or HTML), or they can be other simple or compound documents
|
|
that will need to be processed by another handler.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>In both cases, handlers deal with regular file system
|
|
files, and can process either a single document, or a
|
|
linear list of documents in each file. &RCL; is responsible
|
|
for performing up to date checks, deal with more complex
|
|
embedding and other upper level issues.</para>
|
|
|
|
<para>A simple handler returning a
|
|
document in <literal>text/plain</literal> format, can transfer
|
|
no metadata to the indexer. Generic metadata, like document
|
|
size or modification date, will be gathered and stored by
|
|
the indexer.</para>
|
|
|
|
<para>Handlers that produce <literal>text/html</literal>
|
|
format can return an arbitrary amount of metadata inside HTML
|
|
<literal>meta</literal> tags. These will be processed
|
|
according to the directives found in
|
|
the <link linkend="RCL.PROGRAM.FIELDS"><filename>fields</filename> configuration file</link>.
|
|
</para>
|
|
|
|
<para>The handlers that can handle multiple documents per file
|
|
return a single piece of data to identify each document inside
|
|
the file. This piece of data, called
|
|
an <literal>ipath</literal> will be sent back by
|
|
&RCL; to extract the document at query time, for previewing,
|
|
or for creating a temporary file to be opened by a
|
|
viewer. These handlers can also return metadata either as HTML
|
|
<literal>meta</literal> tags, or as named data through the
|
|
communication protocol.</para>
|
|
|
|
<para>The following section describes the simple
|
|
handlers, and the next one gives a few explanations about
|
|
the <literal>execm</literal> ones. You could conceivably
|
|
write a simple handler with only the elements in the
|
|
manual. This will not be the case for the other ones, for
|
|
which you will have to look at the code.</para>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.SIMPLE">
|
|
<title>Simple input handlers</title>
|
|
|
|
<para>&RCL; simple handlers are usually shell-scripts, but this is in
|
|
no way necessary. Extracting the text from the native format is the
|
|
difficult part. Outputting the format expected by &RCL; is
|
|
trivial. Happily enough, most document formats have translators or
|
|
text extractors which can be called from the handler. In some cases
|
|
the output of the translating program is completely appropriate,
|
|
and no intermediate shell-script is needed.</para>
|
|
|
|
<para>Input handlers are called with a single argument which is the
|
|
source file name. They should output the result to stdout.</para>
|
|
|
|
<para>When writing a handler, you should decide if it will output
|
|
plain text or HTML. Plain text is simpler, but you will not be able
|
|
to add metadata or vary the output character encoding (this will be
|
|
defined in a configuration file). Additionally, some formatting may
|
|
be easier to preserve when previewing HTML. Actually the deciding factor
|
|
is metadata: &RCL; has a way to
|
|
<link linkend="RCL.PROGRAM.FILTERS.HTML">extract metadata from the HTML header and use it for field searches.</link>.
|
|
</para>
|
|
|
|
<para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment
|
|
variable (values <literal>yes</literal>, <literal>no</literal>)
|
|
tells the handler if the operation is for indexing or
|
|
previewing. Some handlers use this to output a slightly different
|
|
format, for example stripping uninteresting repeated keywords (ie:
|
|
<literal>Subject:</literal> for email) when indexing. This is not
|
|
essential.</para>
|
|
|
|
<para>You should look at one of the simple handlers, for example
|
|
<command>rclps</command> for a starting point.</para>
|
|
|
|
<para>Don't forget to make your handler executable before
|
|
testing !</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.MULTIPLE">
|
|
<title>"Multiple" handlers</title>
|
|
|
|
<para>If you can program and want to write
|
|
an <literal>execm</literal> handler, it should not be too
|
|
difficult to make sense of one of the existing handlers.</para>
|
|
|
|
<para>The existing handlers differ in the amount of helper code
|
|
which they are using:
|
|
<itemizedlist>
|
|
<listitem><para><literal>rclimg</literal> is written in Perl and
|
|
handles the execm protocol all by itself (showing how trivial it
|
|
is).</para></listitem> <listitem><para>All the Python handlers
|
|
share at least the <filename>rclexecm.py</filename> module, which
|
|
handles the communication. Have a look at, for
|
|
example, <filename>rclzip</filename> for a handler which
|
|
uses <filename>rclexecm.py</filename>
|
|
directly.</para></listitem> <listitem><para>Most Python handlers
|
|
which process single-document files by executing another command
|
|
are further abstracted by using
|
|
the <filename>rclexec1.py</filename> module. See for
|
|
example <filename>rclrtf.py</filename> for a simple one,
|
|
or <filename>rcldoc.py</filename> for a slightly more complicated
|
|
one (possibly executing several
|
|
commands).</para></listitem> <listitem><para>Handlers which
|
|
extract text from an XML document by using an XSLT style sheet
|
|
are now executed inside <command>recollindex</command>, with only
|
|
the style sheet stored in the <filename>filters/</filename>
|
|
directory. These can use a single style sheet
|
|
(e.g. <filename>abiword.xsl</filename>), or two sheets for the
|
|
data and metadata (e.g. <filename>opendoc-body.xsl</filename>
|
|
and <filename>opendoc-meta.xsl</filename>). The <filename>mimeconf</filename>
|
|
configuration file defines how the sheets are used, have a
|
|
look. Before the C++ import, the xsl-based handlers used a common
|
|
module <filename>rclgenxslt.py</filename>, it is still around but
|
|
unused at the moment. The handler for OpenXML presentations is
|
|
still the Python version because the format did not fit with what
|
|
the C++ code does. It would be a good base for another similar
|
|
issue.</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>There is a sample trivial handler based on
|
|
<filename>rclexecm.py</filename>, with many comments, not actually
|
|
used by &RCL;. It would index a text file as one document per
|
|
line. Look for <filename>rcltxtlines.py</filename> in the
|
|
<filename>src/filters</filename> directory in the online &RCL;
|
|
<ulink url="https://framagit.org/medoc92/recoll">Git
|
|
repository</ulink> (the sample not in the distributed release at
|
|
the moment).</para>
|
|
|
|
<para>You can also have a look at the slightly more complex
|
|
<command>rclzip</command> which uses Zip
|
|
file paths as identifiers (<literal>ipath</literal>).</para>
|
|
|
|
<para><literal>execm</literal> handlers sometimes need to make
|
|
a choice for the nature of the <literal>ipath</literal>
|
|
elements that they use in communication with the
|
|
indexer. Here are a few guidelines:
|
|
<itemizedlist>
|
|
<listitem><para>Use ASCII or UTF-8 (if the identifier is an
|
|
integer print it, for example, like printf %d would
|
|
do).</para></listitem>
|
|
<listitem><para>If at all possible, the data should make some
|
|
kind of sense when printed to a log file to help with
|
|
debugging.</para></listitem>
|
|
<listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
|
|
separator to store a complex path internally (for
|
|
deeper embedding). Colons inside
|
|
the <literal>ipath</literal> elements output by a
|
|
handler will be escaped, but would be a bad choice as a
|
|
handler-specific separator (mostly, again, for
|
|
debugging issues).</para></listitem>
|
|
</itemizedlist>
|
|
In any case, the main goal is that it should
|
|
be easy for the handler to extract the target document, given
|
|
the file name and the <literal>ipath</literal>
|
|
element.</para>
|
|
|
|
<para><literal>execm</literal> handlers will also produce
|
|
a document with a null <literal>ipath</literal>
|
|
element. Depending on the type of document, this may have
|
|
some associated data (e.g. the body of an email message), or
|
|
none (typical for an archive file). If it is empty, this
|
|
document will be useful anyway for some operations, as the
|
|
parent of the actual data documents.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.ASSOCIATION">
|
|
<title>Telling &RCL; about the handler</title>
|
|
|
|
<para>There are two elements that link a file to the handler which
|
|
should process it: the association of file to MIME type and the
|
|
association of a MIME type with a handler.</para>
|
|
|
|
<para>The association of files to MIME types is mostly based on
|
|
name suffixes. The types are defined inside the
|
|
<link linkend="RCL.INSTALL.CONFIG.MIMEMAP"><filename>mimemap</filename> file</link>.
|
|
Example:
|
|
<programlisting>
|
|
|
|
.doc = application/msword
|
|
</programlisting>
|
|
If no suffix association is found for the file name, &RCL; will try
|
|
to execute a system command (typically <command>file -i</command> or
|
|
<command>xdg-mime</command>) to determine a MIME type.</para>
|
|
|
|
<para>The second element is the association of MIME types to handlers
|
|
in the <link linkend="RCL.INSTALL.CONFIG.MIMECONF"><filename>mimeconf</filename> file</link>.
|
|
A sample will probably be better than a long explanation:</para>
|
|
<programlisting>
|
|
|
|
[index]
|
|
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
mimetype = text/plain ; charset=utf-8
|
|
|
|
application/ogg = exec rclogg
|
|
|
|
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
|
application/x-chm = execm rclchm
|
|
</programlisting>
|
|
|
|
<para>The fragment specifies that:
|
|
|
|
<itemizedlist>
|
|
<listitem><para><literal>application/msword</literal> files
|
|
are processed by executing the <command>antiword</command>
|
|
program, which outputs
|
|
<literal>text/plain</literal> encoded in
|
|
<literal>utf-8</literal>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>application/ogg</literal> files are
|
|
processed by the <command>rclogg</command> script, with
|
|
default output type (<literal>text/html</literal>, with
|
|
encoding specified in the header, or <literal>utf-8</literal>
|
|
by default).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>text/rtf</literal> is processed by
|
|
<command>unrtf</command>, which outputs
|
|
<literal>text/html</literal>. The
|
|
<literal>iso-8859-1</literal> encoding is specified because it
|
|
is not the <literal>utf-8</literal> default, and not output by
|
|
<command>unrtf</command> in the HTML header section.</para>
|
|
</listitem>
|
|
<listitem><para><literal>application/x-chm</literal> is processed
|
|
by a persistent handler. This is determined by the
|
|
<literal>execm</literal> keyword.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.HTML">
|
|
<title>Input handler output</title>
|
|
|
|
<para>Both the simple and persistent input handlers can return any
|
|
MIME type to Recoll, which will further process the data according
|
|
to the MIME configuration.</para>
|
|
|
|
<para>Most input filters filters produce either
|
|
<literal>text/plain</literal> or <literal>text/html</literal>
|
|
data. There are exceptions, for example, filters which process
|
|
archive file (<literal>zip</literal>, <literal>tar</literal>, etc.)
|
|
will usually return the documents as they are found, without
|
|
processing them further.</para>
|
|
|
|
<para>There is nothing to say about <literal>text/plain</literal>
|
|
output, except that its character encoding should be consistent
|
|
with what is specified in the <filename>mimeconf</filename>
|
|
file.</para>
|
|
|
|
<para>For filters producing HTML, the output could be very minimal
|
|
like the following example:
|
|
<programlisting><![CDATA[
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
|
|
</head>
|
|
<body>
|
|
Some text content
|
|
</body>
|
|
</html>
|
|
]]></programlisting>
|
|
</para>
|
|
|
|
<para>You should take care to escape some
|
|
characters inside the text by transforming them into
|
|
appropriate entities. At the very minimum,
|
|
"<literal>&</literal>" should be transformed into
|
|
"<literal>&amp;</literal>", "<literal><</literal>"
|
|
should be transformed into
|
|
"<literal>&lt;</literal>". This is not always properly
|
|
done by external helper programs which output HTML, and of
|
|
course never by those which output plain text. </para>
|
|
|
|
<para>When encapsulating plain text in an HTML body,
|
|
the display of a preview may be improved by enclosing the
|
|
text inside <literal><pre></literal> tags.</para>
|
|
|
|
<para>The character set needs to be specified in the
|
|
header. It does not need to be UTF-8 (&RCL; will take care
|
|
of translating it), but it must be accurate for good
|
|
results.</para>
|
|
|
|
<para>&RCL; will process <literal>meta</literal> tags inside
|
|
the header as possible document fields candidates. Documents
|
|
fields can be processed by the indexer in different ways,
|
|
for searching or displaying inside query results. This is
|
|
described in a <link linkend="RCL.PROGRAM.FIELDS">following section.</link>
|
|
</para>
|
|
|
|
<para>By default, the indexer will process the standard header
|
|
fields if they are present: <literal>title</literal>,
|
|
<literal>meta/description</literal>,
|
|
and <literal>meta/keywords</literal> are both indexed and stored
|
|
for query-time display.</para>
|
|
|
|
<para>A predefined non-standard <literal>meta</literal> tag
|
|
will also be processed by &RCL; without further
|
|
configuration: if a <literal>date</literal> tag is present
|
|
and has the right format, it will be used as the document
|
|
date (for display and sorting), in preference to the file
|
|
modification date. The date format should be as follows:
|
|
<programlisting>
|
|
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
|
|
or
|
|
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
|
|
</programlisting>
|
|
Example:
|
|
<programlisting>
|
|
<meta name="date" content="2013-02-24 17:50:00">
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>Input handlers also have the possibility to "invent" field
|
|
names. This should also be output as meta tags:</para>
|
|
|
|
<programlisting>
|
|
<meta name="somefield" content="Some textual data" />
|
|
</programlisting>
|
|
|
|
<para>You can embed HTML markup inside the content of custom
|
|
fields, for improving the display inside result lists. In this
|
|
case, add a (wildly non-standard) <literal>markup</literal>
|
|
attribute to tell &RCL; that the value is HTML and should not
|
|
be escaped for display.</para>
|
|
|
|
<programlisting>
|
|
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
|
|
</programlisting>
|
|
|
|
<para>As written above, the processing of fields is described
|
|
in a <link linkend="RCL.PROGRAM.FIELDS">further section</link>.</para>
|
|
|
|
|
|
<para>Persistent filters can use another, probably simpler,
|
|
method to produce metadata, by calling the
|
|
<literal>setfield()</literal> helper method. This avoids the
|
|
necessity to produce HTML, and any issue with HTML quoting. See,
|
|
for example, <filename>rclaudio</filename> in &RCL; 1.23 and
|
|
later for an example of handler which outputs
|
|
<literal>text/plain</literal> and uses
|
|
<literal>setfield()</literal> to produce metadata.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.FILTERS.PAGES">
|
|
<title>Page numbers</title>
|
|
|
|
<para>The indexer will interpret <literal>^L</literal> characters
|
|
in the handler output as indicating page breaks, and will record
|
|
them. At query time, this allows starting a viewer on the right
|
|
page for a hit or a snippet. Currently, only the PDF, Postscript
|
|
and DVI handlers generate page breaks.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.PROGRAM.FIELDS">
|
|
<title>Field data processing</title>
|
|
|
|
<para><literal>Fields</literal> are named pieces of information
|
|
in or about documents, like <literal>title</literal>,
|
|
<literal>author</literal>, <literal>abstract</literal>.</para>
|
|
|
|
<para>The field values for documents can appear in several ways
|
|
during indexing: either output by input handlers
|
|
as <literal>meta</literal> fields in the HTML header section, or
|
|
extracted from file extended attributes, or added as attributes
|
|
of the <literal>Doc</literal> object when using the API, or
|
|
again synthetized internally by &RCL;.</para>
|
|
|
|
<para>The &RCL; query language allows searching for text in a
|
|
specific field.</para>
|
|
|
|
<para>&RCL; defines a number of default fields. Additional
|
|
ones can be output by handlers, and described in the
|
|
<filename>fields</filename> configuration file.</para>
|
|
|
|
<para>Fields can be:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><literal>indexed</literal>, meaning that their
|
|
terms are separately stored in inverted lists (with a specific
|
|
prefix), and that a field-specific search is possible.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><literal>stored</literal>, meaning that their
|
|
value is recorded in the index data record for the document,
|
|
and can be returned and displayed with search results.</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>A field can be either or both indexed and stored. This and
|
|
other aspects of fields handling is defined inside the
|
|
<filename>fields</filename> configuration file.</para>
|
|
|
|
<para>Some fields may also designated as supporting range queries,
|
|
meaning that the results may be selected for an interval of its
|
|
values. See the <link linkend="RCL.INSTALL.CONFIG.FIELDS">configuration section</link> for more details.</para>
|
|
|
|
<para>The sequence of events for field processing is as follows:
|
|
<itemizedlist>
|
|
<listitem><para>During indexing,
|
|
<command>recollindex</command> scans all <literal>meta</literal>
|
|
fields in HTML documents (most document types are transformed
|
|
into HTML at some point). It compares the name for each element
|
|
to the configuration defining what should be done with fields
|
|
(the <filename>fields</filename> file)</para>
|
|
</listitem>
|
|
<listitem><para>If the name for the <literal>meta</literal>
|
|
element matches one for a field that should be indexed, the
|
|
contents are processed and the terms are entered into the index
|
|
with the prefix defined in the <filename>fields</filename>
|
|
file.</para>
|
|
</listitem>
|
|
<listitem><para>If the name for the <literal>meta</literal> element
|
|
matches one for a field that should be stored, the content of the
|
|
element is stored with the document data record, from which it
|
|
can be extracted and displayed at query time.</para>
|
|
</listitem>
|
|
<listitem><para>At query time, if a field search is performed, the
|
|
index prefix is computed and the match is only performed against
|
|
appropriately prefixed terms in the index.</para>
|
|
</listitem>
|
|
<listitem><para>At query time, the field can be displayed inside
|
|
the result list by using the appropriate directive in the
|
|
definition of the
|
|
<link linkend="RCL.SEARCH.GUI.CUSTOM.RESLIST">result list paragraph format</link>.
|
|
All fields are displayed on the fields screen of
|
|
the preview window (which you can reach through the right-click
|
|
menu). This is independent of the fact that the search which
|
|
produced the results used the field or not.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>You can find more information in the
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS">section about the <filename>fields</filename> file</link>,
|
|
or in comments inside the file.</para>
|
|
|
|
<para>You can also have a look at the
|
|
<ulink url="&FAQS;HandleCustomField">example in the FAQs area</ulink>,
|
|
detailing how one could add a <emphasis>page count</emphasis> field
|
|
to pdf documents for displaying inside result lists.</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.PROGRAM.PYTHONAPI">
|
|
<title>Python API</title>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.INTRO">
|
|
<title>Introduction</title>
|
|
|
|
<para>The &RCL; Python programming interface can be used both for
|
|
searching and for creating/updating an index. Bindings exist for
|
|
Python2 and Python3 (Jan 2021: python2 support will be dropped
|
|
soon).</para>
|
|
|
|
<para>The search interface is used in a number of active projects:
|
|
the <ulink
|
|
url="https://www.lesbonscomptes.com/recoll/pages/download.html#gssp">
|
|
&RCL; <application>Gnome Shell Search Provider</application>
|
|
</ulink>,
|
|
the <ulink url="https://framagit.org/medoc92/recollwebui">
|
|
&RCL; Web UI</ulink>, and the
|
|
<ulink
|
|
url="https://www.lesbonscomptes.com/upmpdcli/upmpdcli-manual.html#UPRCL">
|
|
upmpdcli UPnP Media Server</ulink>, in addition
|
|
to many small scripts.</para>
|
|
|
|
<para>The index update section of the API may be used to create and
|
|
update &RCL; indexes on specific configurations (separate from the
|
|
ones created by <command>recollindex</command>). The resulting
|
|
databases can be queried alone, or in conjunction with regular
|
|
ones, through the GUI or any of the query interfaces.</para>
|
|
|
|
<para>The search API is modeled along the Python database API
|
|
version 2.0 specification (early versions used the version 1.0 spec).</para>
|
|
|
|
<para>The <literal>recoll</literal> package contains two modules:</para>
|
|
<itemizedlist>
|
|
<listitem><para>The <literal>recoll</literal> module contains
|
|
functions and classes used to query (or update) the
|
|
index.</para></listitem>
|
|
|
|
<listitem><para>The <literal>rclextract</literal> module contains
|
|
functions and classes used at query time to access document
|
|
data. The <literal>recoll</literal> module must be imported
|
|
before <literal>rclextract</literal></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>There is a good chance that your system repository has
|
|
packages for the Recoll Python API, sometimes in a package separate
|
|
from the main one (maybe named something like python-recoll). Else
|
|
refer to the <link linkend="RCL.INSTALL.BUILDING">Building from source chapter</link>.</para>
|
|
|
|
<para>As an introduction, the following small sample will run a
|
|
query and list the title and url for each of the results. The
|
|
<filename>python/samples</filename> source directory contains
|
|
several examples of Python programming with &RCL;, exercising the
|
|
extension more completely, and especially its data extraction
|
|
features.</para>
|
|
|
|
<programlisting><![CDATA[
|
|
#!/usr/bin/python3
|
|
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
query = db.query()
|
|
nres = query.execute("some query")
|
|
results = query.fetchmany(20)
|
|
for doc in results:
|
|
print("%s %s" % (doc.url, doc.title))
|
|
]]></programlisting>
|
|
|
|
<para>You can also take a look at the source for the
|
|
<ulink
|
|
url="https://framagit.org/medoc92/recollwebui/-/blob/master/webui.py">
|
|
Recoll WebUI</ulink>, the
|
|
<ulink url="https://framagit.org/medoc92/upmpdcli/-/blob/master/src/mediaserver/cdplugins/uprcl/uprclfolders.py">
|
|
upmpdcli local media server</ulink>, or the
|
|
<ulink
|
|
url="https://framagit.org/medoc92/recoll-gssp/-/blob/master/gssp-recoll.py">
|
|
Gnome Shell Search Provider</ulink>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.ELEMENTS">
|
|
<title>Interface elements</title>
|
|
|
|
<para>A few elements in the interface are specific and and need
|
|
an explanation.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH">
|
|
<term>ipath</term>
|
|
<listitem><para>This data value (set as a field in the Doc
|
|
object) is stored, along with the URL, but not indexed by
|
|
&RCL;. Its contents are not interpreted by the index layer, and
|
|
its use is up to the application. For example, the &RCL; file
|
|
system indexer uses the <literal>ipath</literal> to store the
|
|
part of the document access path internal to (possibly
|
|
imbricated) container documents. <literal>ipath</literal> in
|
|
this case is a vector of access elements (e.g, the first part
|
|
could be a path inside a zip file to an archive member which
|
|
happens to be an mbox file, the second element would be the
|
|
message sequential number inside the mbox
|
|
etc.). <literal>url</literal> and <literal>ipath</literal> are
|
|
returned in every search result and define the access to the
|
|
original document. <literal>ipath</literal> is empty for
|
|
top-level document/files (e.g. a PDF document which is a
|
|
filesystem file). The &RCL; GUI knows about the structure of the
|
|
<literal>ipath</literal> values used by the filesystem indexer,
|
|
and uses it for such functions as opening the parent of a given
|
|
document.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">
|
|
<term>udi</term>
|
|
<listitem><para>An <literal>udi</literal> (unique document
|
|
identifier) identifies a document. Because of limitations inside
|
|
the index engine, it is restricted in length (to 200 bytes),
|
|
which is why a regular URI cannot be used. The structure and
|
|
contents of the <literal>udi</literal> is defined by the
|
|
application and opaque to the index engine. For example, the
|
|
internal file system indexer uses the complete document path
|
|
(file path + internal path), truncated to length, the suppressed
|
|
part being replaced by a hash value. The <literal>udi</literal>
|
|
is not explicit in the query interface (it is used "under the
|
|
hood" by the <filename>rclextract</filename> module), but it is
|
|
an explicit element of the update interface.</para> </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">
|
|
<term>parent_udi</term>
|
|
<listitem><para>If this attribute is set on a document when
|
|
entering it in the index, it designates its physical container
|
|
document. In a multilevel hierarchy, this may not be the
|
|
immediate parent. <literal>parent_udi</literal> is optional, but
|
|
its use by an indexer may simplify index maintenance, as &RCL;
|
|
will automatically delete all children defined by
|
|
<literal>parent_udi == udi</literal> when the document designated
|
|
by <literal>udi</literal> is destroyed. e.g. if a
|
|
<literal>Zip</literal> archive contains entries which are
|
|
themselves containers, like <literal>mbox</literal> files, all
|
|
the subdocuments inside the <literal>Zip</literal> file (mbox,
|
|
messages, message attachments, etc.) would have the same
|
|
<literal>parent_udi</literal>, matching the
|
|
<literal>udi</literal> for the <literal>Zip</literal> file, and
|
|
all would be destroyed when the <literal>Zip</literal> file
|
|
(identified by its <literal>udi</literal>) is removed from the
|
|
index. The standard filesystem indexer uses
|
|
<literal>parent_udi</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Stored and indexed fields</term>
|
|
<listitem><para>The
|
|
<link linkend="RCL.INSTALL.CONFIG.FIELDS"><filename>fields</filename> file</link>
|
|
inside the &RCL; configuration defines which
|
|
document fields are either <literal>indexed</literal>
|
|
(searchable), <literal>stored</literal> (retrievable with
|
|
search results), or both. Apart from a few standard/internal
|
|
fields, only the <literal>stored</literal> fields are
|
|
retrievable through the Python search interface.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.LOG">
|
|
<title>Log messages for Python scripts</title>
|
|
|
|
<para>Two specific configuration variables:
|
|
<literal>pyloglevel</literal> and <literal>pylogfilename</literal>
|
|
allow overriding the generic values for Python programs. Set
|
|
<literal>pyloglevel</literal> to 2 to suppress default startup messages
|
|
(printed at level 3).</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.SEARCH">
|
|
<title>Python search interface</title>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.RECOLL">
|
|
<title>The recoll module</title>
|
|
|
|
<simplesect id="RCL.PROGRAM.PYTHONAPI.RECOLL.CONNECT">
|
|
<title>connect(confdir=None, extra_dbs=None, writable = False)</title>
|
|
|
|
<para>The <literal>connect()</literal> function connects to
|
|
one or several &RCL; index(es) and returns
|
|
a <literal>Db</literal> object.</para>
|
|
<para>This call initializes the recoll module, and it should
|
|
always be performed before any other call or object
|
|
creation.</para>
|
|
<itemizedlist>
|
|
<listitem><para><literal>confdir</literal> may specify
|
|
a configuration directory. The usual defaults
|
|
apply.</para></listitem>
|
|
<listitem><para><literal>extra_dbs</literal> is a list of
|
|
additional indexes (Xapian directories).</para></listitem>
|
|
<listitem><para><literal>writable</literal> decides if
|
|
we can index new data through this
|
|
connection.</para></listitem>
|
|
</itemizedlist>
|
|
</simplesect>
|
|
|
|
<simplesect id="RCL.PROGRAM.PYTHONAPI.RECOLL.DB">
|
|
<title>The Db class</title>
|
|
|
|
<para>A Db object is created by a <literal>connect()</literal>
|
|
call and holds a connection to a Recoll index.</para>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>Db.close()</term>
|
|
<listitem><para>Closes the connection. You can't do anything
|
|
with the <literal>Db</literal> object after
|
|
this.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Db.query(), Db.cursor()</term> <listitem><para>These
|
|
aliases return a blank <literal>Query</literal> object
|
|
for this index.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Db.setAbstractParams(maxchars,
|
|
contextwords)</term> <listitem><para>Set the parameters used
|
|
to build snippets (sets of keywords in context text
|
|
fragments). <literal>maxchars</literal> defines the
|
|
maximum total size of the abstract.
|
|
<literal>contextwords</literal> defines how many
|
|
terms are shown around the keyword.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Db.termMatch(match_type, expr, field='', maxlen=-1, casesens=False, diacsens=False, lang='english')</term>
|
|
<listitem><para>Expand an expression against the
|
|
index term list. Performs the basic function from the
|
|
GUI term explorer tool. <literal>match_type</literal>
|
|
can be either
|
|
of <literal>wildcard</literal>, <literal>regexp</literal>
|
|
or <literal>stem</literal>. Returns a list of terms
|
|
expanded from the input expression.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</simplesect>
|
|
<simplesect id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY">
|
|
<title>The Query class</title>
|
|
|
|
<para>A <literal>Query</literal> object (equivalent to a
|
|
cursor in the Python DB API) is created by
|
|
a <literal>Db.query()</literal> call. It is used to
|
|
execute index searches.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Query.sortby(fieldname, ascending=True)</term>
|
|
<listitem><para>Sort results
|
|
by <replaceable>fieldname</replaceable>, in ascending
|
|
or descending order. Must be called before executing
|
|
the search.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.execute(query_string, stemming=1, stemlang="english", fetchtext=False, collapseduplicates=False)</term>
|
|
<listitem><para>Starts a search
|
|
for <replaceable>query_string</replaceable>, a &RCL;
|
|
search language string. If the index stores the document
|
|
texts and <literal>fetchtext</literal> is True, store the
|
|
document extracted text in
|
|
<literal>doc.text</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.executesd(SearchData, fetchtext=False, collapseduplicates=False)</term>
|
|
<listitem><para>Starts a search for the query defined by
|
|
the SearchData object. If the index stores the document
|
|
texts and <literal>fetchtext</literal> is True, store the
|
|
document extracted text in
|
|
<literal>doc.text</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.fetchmany(size=query.arraysize)</term>
|
|
<listitem><para>Fetches
|
|
the next <literal>Doc</literal> objects in the current
|
|
search results, and returns them as an array of the
|
|
required size, which is by default the value of
|
|
the <literal>arraysize</literal> data member.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.fetchone()</term> <listitem><para>Fetches the
|
|
next <literal>Doc</literal> object from the current
|
|
search results. Generates a StopIteration exception if
|
|
there are no results left.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.close()</term>
|
|
<listitem><para>Closes the query. The object is unusable
|
|
after the call.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.scroll(value, mode='relative')</term>
|
|
<listitem><para>Adjusts the position in the current result
|
|
set. <literal>mode</literal> can
|
|
be <literal>relative</literal>
|
|
or <literal>absolute</literal>. </para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getgroups()</term>
|
|
<listitem><para>Retrieves the expanded query terms as a list
|
|
of pairs. Meaningful only after executexx In each
|
|
pair, the first entry is a list of user terms (of size
|
|
one for simple terms, or more for group and phrase
|
|
clauses), the second a list of query terms as derived
|
|
from the user terms and used in the Xapian
|
|
Query.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getxquery()</term>
|
|
<listitem><para>Return the Xapian query description as a
|
|
Unicode string.
|
|
Meaningful only after executexx.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.highlight(text, ishtml = 0, methods = object)</term>
|
|
<listitem><para>Will insert <span "class=rclmatch">,
|
|
</span> tags around the match areas in the input text
|
|
and return the modified text. <literal>ishtml</literal>
|
|
can be set to indicate that the input text is HTML and
|
|
that HTML special characters should not be escaped.
|
|
<literal>methods</literal> if set should be an object
|
|
with methods startMatch(i) and endMatch() which will be
|
|
called for each match and should return a begin and end
|
|
tag</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.makedocabstract(doc, methods = object))</term>
|
|
<listitem><para>Create a snippets abstract
|
|
for <literal>doc</literal> (a <literal>Doc</literal>
|
|
object) by selecting text around the match terms.
|
|
If methods is set, will also perform highlighting. See
|
|
the highlight method.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.getsnippets(doc, maxoccs = -1, ctxwords = -1,
|
|
sortbypage=False, methods = object)</term>
|
|
<listitem><para>Will return a list of extracts from the result
|
|
document by selecting text around the match terms. Each
|
|
entry in the result list is a triple: page number, term,
|
|
text. By default, the most relevants snippets appear first
|
|
in the list. Set <literal>sortbypage</literal> to sort by
|
|
page number instead. If <literal>methods</literal> is set,
|
|
the fragments will be highlighted (see the highlight
|
|
method). If <literal>maxoccs</literal> is set, it defines
|
|
the maximum result list
|
|
length. <literal>ctxwords</literal> allows adjusting the
|
|
individual snippet context size. </para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.__iter__() and Query.next()</term>
|
|
<listitem><para>So that things like
|
|
<literal>for doc in query:</literal>
|
|
will work.</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Query.arraysize</term>
|
|
<listitem><para>Default number of records processed by
|
|
fetchmany (r/w).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.rowcount</term>
|
|
<listitem><para>Number of records returned by the last
|
|
execute.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Query.rownumber</term>
|
|
<listitem><para>Next index to be fetched from
|
|
results. Normally increments after each fetchone() call, but
|
|
can be set/reset before the call to effect seeking
|
|
(equivalent to using <literal>scroll()</literal>). Starts at
|
|
0.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</simplesect>
|
|
<simplesect id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC">
|
|
<title>The Doc class</title>
|
|
|
|
<para>A <literal>Doc</literal> object contains index data
|
|
for a given document. The data is extracted from the
|
|
index when searching, or set by the indexer program when
|
|
updating. The Doc object has many attributes to be read or
|
|
set by its user. It mostly matches the Rcl::Doc C++
|
|
object. Some of the attributes are predefined, but,
|
|
especially when indexing, others can be set, the name of
|
|
which will be processed as field names by the indexing
|
|
configuration. Inputs can be specified as Unicode or
|
|
strings. Outputs are Unicode objects. All dates are
|
|
specified as Unix timestamps, printed as strings. Please
|
|
refer to the <filename>rcldb/rcldoc.cpp</filename> C++ file
|
|
for a full description of the predefined attributes. Here
|
|
follows a short list.</para>
|
|
|
|
<para><itemizedlist>
|
|
<listitem><para><literal>url</literal> the document URL but
|
|
see also <literal>getbinurl()</literal></para></listitem>
|
|
|
|
<listitem><para><literal>ipath</literal> the document
|
|
<literal>ipath</literal> for embedded
|
|
documents.</para></listitem>
|
|
|
|
<listitem><para><literal>fbytes, dbytes</literal> the document
|
|
file and text sizes.</para></listitem>
|
|
<listitem><para><literal>fmtime, dmtime</literal> the document
|
|
file and document times.</para></listitem>
|
|
|
|
<listitem><para><literal>xdocid</literal> the document
|
|
Xapian document ID. This is useful if you want to access
|
|
the document through a direct Xapian
|
|
operation.</para></listitem>
|
|
|
|
<listitem><para><literal>mtype</literal> the document
|
|
MIME type.</para></listitem>
|
|
|
|
<listitem><para>Fields stored by default:
|
|
<literal>author</literal>, <literal>filename</literal>,
|
|
<literal>keywords</literal>,
|
|
<literal>recipient</literal></para></listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>At query time, only the fields that are defined as
|
|
<literal>stored</literal> either by default or in the
|
|
<filename>fields</filename> configuration file will be meaningful
|
|
in the <literal>Doc</literal> object. The document processed text
|
|
may be present or not, depending if the index stores the text at
|
|
all, and if it does, on the <literal>fetchtext</literal> query
|
|
execute option. See also the <literal>rclextract</literal> module
|
|
for accessing document contents.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>get(key), [] operator</term>
|
|
|
|
<listitem><para>Retrieve the named document
|
|
attribute. You can also use
|
|
<literal>getattr(doc, key)</literal> or
|
|
<literal>doc.key</literal>.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>doc.key = value</term>
|
|
|
|
<listitem><para>Set the the named document attribute. You
|
|
can also use
|
|
<literal>setattr(doc, key, value)</literal>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>getbinurl()</term>
|
|
|
|
<listitem><para>Retrieve the URL in byte array format (no
|
|
transcoding), for use as parameter to a system
|
|
call.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>setbinurl(url)</term>
|
|
|
|
<listitem><para>Set the URL in byte array format (no
|
|
transcoding).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>items()</term>
|
|
<listitem><para>Return a dictionary of doc object
|
|
keys/values</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>keys()</term>
|
|
<listitem><para>list of doc object keys (attribute
|
|
names).</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</simplesect> <!-- Doc -->
|
|
|
|
<simplesect id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA">
|
|
<title>The SearchData class</title>
|
|
|
|
<para>A <literal>SearchData</literal> object allows building
|
|
a query by combining clauses, for execution
|
|
by <literal>Query.executesd()</literal>. It can be used
|
|
in replacement of the query language approach. The
|
|
interface is going to change a little, so no detailed doc
|
|
for now...</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
qstring=string, slack=0, field='', stemming=1,
|
|
subSearch=SearchData)</term>
|
|
<listitem><para></para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
|
|
</simplesect> <!-- SearchData -->
|
|
|
|
</sect3> <!-- Recoll module -->
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
|
|
<title>The rclextract module</title>
|
|
|
|
|
|
<para>Prior to &RCL; 1.25, index queries could not provide document
|
|
content because it was never stored. &RCL; 1.25 and later usually
|
|
store the document text, which can be optionally retrieved when
|
|
running a query (see <literal>query.execute()</literal>
|
|
above - the result is always plain text).</para>
|
|
|
|
<para>The <literal>rclextract</literal> module can give access to
|
|
the original document and to the document text content (if not
|
|
stored by the index, or to access an HTML version of the text).
|
|
Accessing the original document is particularly useful if it is
|
|
embedded (e.g. an email attachment).</para>
|
|
|
|
<para>You need to import the <literal>recoll</literal> module
|
|
before the <literal>rclextract</literal> module.</para>
|
|
|
|
<simplesect id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
|
|
<title>The Extractor class</title>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Extractor(doc)</term>
|
|
<listitem><para>An <literal>Extractor</literal> object is
|
|
built from a <literal>Doc</literal> object, output
|
|
from a query.</para></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Extractor.textextract(ipath)</term>
|
|
<listitem><para>Extract document defined by
|
|
<replaceable>ipath</replaceable> and return a
|
|
<literal>Doc</literal> object. The
|
|
<literal>doc.text</literal> field has the document text
|
|
converted to either text/plain or text/html according to
|
|
<literal>doc.mimetype</literal>. The typical use would be
|
|
as follows:</para>
|
|
<programlisting>
|
|
from recoll import recoll, rclextract
|
|
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
doc = extractor.textextract(qdoc.ipath)
|
|
# use doc.text, e.g. for previewing</programlisting>
|
|
|
|
<para>Passing <literal>qdoc.ipath</literal> to
|
|
<literal>textextract()</literal> is redundant, but
|
|
reflects the fact that the <literal>Extractor</literal>
|
|
object actually has the capability to access the other
|
|
entries in a compound document.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
|
|
<listitem><para>Extracts document into an output file,
|
|
which can be given explicitly or will be created as a
|
|
temporary file to be deleted by the caller. Typical
|
|
use:</para>
|
|
<programlisting>
|
|
from recoll import recoll, rclextract
|
|
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
|
|
|
<para>In all cases the output is a copy, even if the
|
|
requested document is a regular system file, which may be
|
|
wasteful in some cases. If you want to avoid this, you
|
|
can test for a simple file document as follows:
|
|
<programlisting>
|
|
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
|
|
</programlisting>
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</simplesect>
|
|
</sect3> <!-- rclextract module -->
|
|
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
|
|
<title>Search API usage example</title>
|
|
|
|
<para>The following sample would query the index with a user
|
|
language string. See the <filename>python/samples</filename>
|
|
directory inside the &RCL; source for other
|
|
examples. The <filename>recollgui</filename> subdirectory
|
|
has a very embryonic GUI which demonstrates the
|
|
highlighting and data extraction functions.</para>
|
|
|
|
<programlisting><![CDATA[
|
|
#!/usr/bin/python3
|
|
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
db.setAbstractParams(maxchars=80, contextwords=4)
|
|
|
|
query = db.query()
|
|
nres = query.execute("some user question")
|
|
print("Result count: %d" % nres)
|
|
if nres > 5:
|
|
nres = 5
|
|
for i in range(nres):
|
|
doc = query.fetchone()
|
|
print("Result #%d" % (query.rownumber))
|
|
for k in ("title", "size"):
|
|
print("%s : %s" % (k, getattr(doc, k)))
|
|
print("%s\n" % db.makeDocAbstract(doc, query))
|
|
]]></programlisting>
|
|
|
|
</sect3>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.UPDATE">
|
|
<title>Creating Python external indexers</title>
|
|
|
|
<para>The update API can be used to create an index from data which
|
|
is not accessible to the regular &RCL; indexer, or structured to
|
|
present difficulties to the &RCL; input handlers.</para>
|
|
|
|
<para>An indexer created using this API will be have equivalent work
|
|
to do as the the Recoll file system indexer: look for modified
|
|
documents, extract their text, call the API for indexing it, take
|
|
care of purging the index out of data from documents which do not
|
|
exist in the document store any more.</para>
|
|
|
|
<para>The data for such an external indexer should be stored in an
|
|
index separate from any used by the &RCL; internal file system
|
|
indexer. The reason is that the main document indexer purge pass
|
|
(removal of deleted documents) would also remove all the documents
|
|
belonging to the external indexer, as they were not seen during the
|
|
filesystem walk. The main indexer documents would also probably be a
|
|
problem for the external indexer own purge operation.</para>
|
|
|
|
<para>While there would be ways to enable multiple foreign indexers
|
|
to cooperate on a single index, it is just simpler to use separate
|
|
ones, and use the multiple index access capabilities of the query
|
|
interface, if needed.</para>
|
|
|
|
<para>There are two parts in the update interface:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>Methods inside the <filename>recoll</filename>
|
|
module allow inserting data into the index, to make it accessible by
|
|
the normal query interface.</para></listitem>
|
|
<listitem><para>An interface based on scripts execution is defined
|
|
to allow either the GUI or the <filename>rclextract</filename>
|
|
module to access original document data for previewing or
|
|
editing.</para></listitem>
|
|
</itemizedlist>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE">
|
|
<title>Python update interface</title>
|
|
|
|
<para>The update methods are part of the
|
|
<filename>recoll</filename> module described above. The connect()
|
|
method is used with a <literal>writable=true</literal> parameter to
|
|
obtain a writable <literal>Db</literal> object. The following
|
|
<literal>Db</literal> object methods are then available.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>addOrUpdate(udi, doc, parent_udi=None)</term>
|
|
<listitem><para>Add or update index data for a given document
|
|
The
|
|
<literal><link linkend="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">udi</link></literal>
|
|
string must define a unique id for
|
|
the document. It is an opaque interface element and not
|
|
interpreted inside Recoll. <literal>doc</literal> is a
|
|
<literal><link linkend="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC">Doc</link></literal>
|
|
object, created from the data to be
|
|
indexed (the main text should be in
|
|
<literal>doc.text</literal>). If
|
|
<literal><link linkend="RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">parent_udi</link></literal>
|
|
is set, this is a unique identifier for the top-level
|
|
container (e.g. for the filesystem indexer, this would
|
|
be the one which is an actual file).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>delete(udi)</term>
|
|
<listitem><para>Purge index from all data for
|
|
<literal>udi</literal>, and all documents (if any) which have a
|
|
matrching <literal>parent_udi</literal>. </para> </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>needUpdate(udi, sig)</term>
|
|
<listitem><para>Test if the index needs to be updated for the
|
|
document identified by <literal>udi</literal>. If this call is
|
|
to be used, the <literal>doc.sig</literal> field should contain
|
|
a signature value when calling
|
|
<literal>addOrUpdate()</literal>. The
|
|
<literal>needUpdate()</literal> call then compares its
|
|
parameter value with the stored <literal>sig</literal> for
|
|
<literal>udi</literal>. <literal>sig</literal> is an opaque
|
|
value, compared as a string.</para>
|
|
<para>The filesystem indexer uses a
|
|
concatenation of the decimal string values for file size and
|
|
update time, but a hash of the contents could also be
|
|
used.</para>
|
|
<para>As a side effect, if the return value is false (the index
|
|
is up to date), the call will set the existence flag for the
|
|
document (and any subdocument defined by its
|
|
<literal>parent_udi</literal>), so that a later
|
|
<literal>purge()</literal> call will preserve them).</para>
|
|
<para>The use of <literal>needUpdate()</literal> and
|
|
<literal>purge()</literal> is optional, and the indexer may use
|
|
another method for checking the need to reindex or to delete
|
|
stale entries.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>purge()</term>
|
|
<listitem><para>Delete all documents that were not touched
|
|
during the just finished indexing pass (since
|
|
open-for-write). These are the documents for the needUpdate()
|
|
call was not performed, indicating that they no longer exist in
|
|
the primary storage system.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.ACCESS">
|
|
<title>Query data access for external indexers (1.23)</title>
|
|
|
|
<para>&RCL; has internal methods to access document data for its
|
|
internal (filesystem) indexer. An external indexer needs to provide
|
|
data access methods if it needs integration with the GUI
|
|
(e.g. preview function), or support for the
|
|
<filename>rclextract</filename> module.</para>
|
|
|
|
<para>The index data and the access method are linked by the
|
|
<literal>rclbes</literal> (recoll backend storage)
|
|
<literal>Doc</literal> field. You should set this to a short string
|
|
value identifying your indexer (e.g. the filesystem indexer uses either
|
|
"FS" or an empty value, the Web history indexer uses "BGL").</para>
|
|
|
|
<para>The link is actually performed inside a
|
|
<filename>backends</filename> configuration file (stored in the
|
|
configuration directory). This defines commands to execute to
|
|
access data from the specified indexer. Example, for the mbox
|
|
indexing sample found in the Recoll source (which sets
|
|
<literal>rclbes="MBOX"</literal>):</para>
|
|
<programlisting>[MBOX]
|
|
fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch
|
|
makesig = path/to/recoll/src/python/samples/rclmbox.py makesig
|
|
</programlisting>
|
|
|
|
<para><literal>fetch</literal> and <literal>makesig</literal>
|
|
define two commands to execute to respectively retrieve the
|
|
document text and compute the document signature (the example
|
|
implementation uses the same script with different first parameters
|
|
to perform both operations).</para>
|
|
|
|
<para>The scripts are called with three additional arguments:
|
|
<literal>udi</literal>, <literal>url</literal>,
|
|
<literal>ipath</literal>, stored with the document when it was
|
|
indexed, and may use any or all to perform the requested
|
|
operation. The caller expects the result data on
|
|
<literal>stdout</literal>.</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.PROGRAM.PYTHONAPI.UPDATE.SAMPLES">
|
|
<title>External indexer samples</title>
|
|
|
|
<para>The Recoll source tree has two samples of external indexers
|
|
in the <filename>src/python/samples</filename> directory. The more
|
|
interesting one is <filename>rclmbox.py</filename> which indexes a
|
|
directory containing <literal>mbox</literal> folder files. It
|
|
exercises most features in the update interface, and has a data
|
|
access interface.</para>
|
|
|
|
<para>See the comments inside the file for more information.</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.PROGRAM.PYTHONAPI.COMPAT">
|
|
<title>Package compatibility with the previous version</title>
|
|
|
|
<para>The following code fragments can be used to ensure that
|
|
code can run with both the old and the new API (as long as it
|
|
does not use the new abilities of the new API of
|
|
course).</para>
|
|
|
|
<para>Adapting to the new package structure:</para>
|
|
<programlisting><![CDATA[
|
|
try:
|
|
from recoll import recoll
|
|
from recoll import rclextract
|
|
hasextract = True
|
|
except:
|
|
import recoll
|
|
hasextract = False
|
|
]]></programlisting>
|
|
|
|
<para>Adapting to the change of nature of
|
|
the <literal>next</literal> <literal>Query</literal>
|
|
member. The same test can be used to choose to use
|
|
the <literal>scroll()</literal> method (new) or set
|
|
the <literal>next</literal> value (old).</para>
|
|
|
|
<programlisting><![CDATA[rownum = query.next if type(query.next) == int else query.rownumber]]></programlisting>
|
|
|
|
</sect2> <!-- compat with previous version -->
|
|
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="RCL.INSTALL">
|
|
<title>Installation and configuration</title>
|
|
|
|
<sect1 id="RCL.INSTALL.BINARY">
|
|
<title>Installing a binary copy</title>
|
|
|
|
|
|
<para>&RCL; binary copies are always distributed as regular
|
|
packages for your system. They can be obtained either through
|
|
the system's normal software distribution framework (e.g.
|
|
<application>Debian/Ubuntu apt</application>,
|
|
<application>FreeBSD</application> ports, etc.), or from some type
|
|
of "backports" repository providing versions newer than the standard
|
|
ones, or found on the &RCL; Web site in some
|
|
cases. The most up-to-date information about Recoll packages can
|
|
usually be found on the
|
|
<ulink url="http://www.recoll.org/pages/download.html">
|
|
<application>Recoll</application> Web site downloads
|
|
page</ulink></para>
|
|
|
|
<para>The &WIN; version of Recoll comes in a self-contained setup
|
|
file, there is nothing else to install.</para>
|
|
|
|
<para>On &LIN;, the package management tools will automatically
|
|
install hard dependencies for packages obtained from a proper package
|
|
repository. You will have to deal with them by hand for downloaded
|
|
packages (for example, when <command>dpkg</command> complains about
|
|
missing dependencies).</para>
|
|
|
|
<para>In all cases, you will have to check or install
|
|
<link linkend="RCL.INSTALL.EXTERNAL">supporting applications</link>
|
|
for the file types that you want to index beyond those that are
|
|
natively processed by &RCL; (text, HTML, email files, and a few
|
|
others).</para>
|
|
|
|
<para>You should also maybe have a look at the
|
|
<link linkend="RCL.INSTALL.CONFIG">configuration section</link>
|
|
(but this may not be necessary for a quick test with default
|
|
parameters). Most parameters can be more conveniently set from the
|
|
GUI interface.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INSTALL.EXTERNAL">
|
|
<title>Supporting packages</title>
|
|
|
|
<note><para>The &WIN; installation of &RCL; is self-contained.
|
|
&WIN; users can skip this section.</para></note>
|
|
|
|
<para>&RCL; uses external applications to index some file
|
|
types. You need to install them for the file types that you wish to
|
|
have indexed (these are run-time optional dependencies. None is
|
|
needed for building or running &RCL; except for indexing their
|
|
specific file type).</para>
|
|
|
|
<para>After an indexing pass, the commands that were found
|
|
missing can be displayed from the <command>recoll</command>
|
|
<guilabel>File</guilabel> menu. The list is stored in the
|
|
<filename>missing</filename> text file inside the configuration
|
|
directory.</para>
|
|
|
|
<para>The past has proven that I was unable to maintain an up to date
|
|
application list in this manual. Please check &RCLAPPS; for a
|
|
complete list along with links to the home pages or best
|
|
source/patches pages, and misc tips. What follows is only a
|
|
very short extract of the stable essentials.</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para>PDF files need <command>pdftotext</command>
|
|
which is part of <application>Poppler</application> (usually
|
|
comes with the <literal>poppler-utils</literal>
|
|
package). Avoid the original one from
|
|
<application>Xpdf</application>.</para></listitem>
|
|
|
|
<listitem><para>MS Word documents need
|
|
<command>antiword</command>. It is also useful to have
|
|
<command>wvWare</command> installed as it may be
|
|
be used as a fallback for some files which
|
|
<command>antiword</command> does not handle.</para></listitem>
|
|
|
|
<listitem><para>RTF files need <command>unrtf</command>,
|
|
which, in its older versions, has much trouble with
|
|
non-western character sets. Many Linux distributions carry
|
|
outdated <command>unrtf</command> versions. Check
|
|
&RCLAPPS; for details.</para></listitem>
|
|
|
|
<listitem><para>Pictures: &RCL; uses the
|
|
<application>Exiftool</application>
|
|
<application>Perl</application> package to extract tag
|
|
information. Most image file formats are
|
|
supported.</para></listitem>
|
|
|
|
<listitem><para>Up to &RCL; 1.24, many XML-based formats need the
|
|
<command>xsltproc</command> command, which usually comes with
|
|
<application>libxslt</application>. These are: abiword, fb2
|
|
ebooks, kword, openoffice, opendocument svg. &RCL; 1.25 and later
|
|
process them internally (using libxslt).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="RCL.INSTALL.BUILDING">
|
|
<title>Building from source</title>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.PREREQS">
|
|
<title>Prerequisites</title>
|
|
|
|
<para>The following prerequisites are described in broad terms and
|
|
not as specific package names (which will depend on the exact
|
|
platform). The dependencies should be available as packages on most
|
|
common Unix derivatives, and it should be quite uncommon that you
|
|
would have to build one of them.</para>
|
|
|
|
<para>If you do not need the GUI, you can avoid all GUI
|
|
dependencies by disabling its build. (See the configure section
|
|
further).</para>
|
|
|
|
<para>The shopping list:</para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para>If you start from git code, you will need the
|
|
<command>autoconf</command>, <command>automake</command> and
|
|
<command>libtool</command> triad. They are not needed for
|
|
building from tar distributions.</para></listitem>
|
|
|
|
<listitem><para>C++ compiler. Recent versions require C++11
|
|
compatibility (1.23 and later).</para></listitem>
|
|
|
|
<listitem><para><command>bison</command> command (for &RCL; 1.21
|
|
and later).</para></listitem>
|
|
|
|
<listitem><para>For building the documentation: the
|
|
<command>xsltproc</command> command, and the Docbook XML and
|
|
style sheet files. You can avoid this dependency by disabling
|
|
documentation building with the
|
|
<literal>--disable-userdoc</literal> <command>configure</command>
|
|
option.</para></listitem>
|
|
|
|
<listitem><para>Development files
|
|
for <ulink url="http://www.xapian.org"> <application>Xapian
|
|
core</application></ulink>.</para>
|
|
<important>
|
|
<para>If you are
|
|
building Xapian for an older CPU (before Pentium 4 or Athlon
|
|
64), you need to add the <option>--disable-sse</option> flag
|
|
to the configure command. Else all Xapian application will
|
|
crash with an <literal>illegal instruction</literal>
|
|
error.</para>
|
|
</important>
|
|
</listitem>
|
|
|
|
<listitem> <para>Development files for
|
|
<ulink url="http://qt-project.org/downloads"><application>Qt 5</application> </ulink>.
|
|
and its own dependencies (X11 etc.)</para> </listitem>
|
|
|
|
<listitem><para>Development files for libxslt</para></listitem>
|
|
|
|
<listitem><para>Development files for
|
|
<application>zlib</application>.</para> </listitem>
|
|
|
|
<listitem><para>Development files for
|
|
<application>Python</application> (or use
|
|
<literal>--disable-python-module</literal>).</para></listitem>
|
|
|
|
<listitem><para>Development files for libchm</para></listitem>
|
|
|
|
<listitem><para>You may also need
|
|
<ulink url="http://www.gnu.org/software/libiconv/">libiconv</ulink>.
|
|
|
|
On <application>Linux</application> systems, the iconv
|
|
interface is part of libc and you should not need to do
|
|
anything special.</para></listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>Check the <ulink url="http://www.recoll.org/pages/download.html">
|
|
&RCL; download page</ulink> for up to date version
|
|
information.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.BUILDING">
|
|
<title>Building</title>
|
|
|
|
<para>&RCL; has been built on Linux, FreeBSD, Mac OS X, and Solaris,
|
|
most versions after 2005 should be ok, maybe some older ones too
|
|
(Solaris 8 used to be ok). If you build on another system, and
|
|
need to modify things,
|
|
<ulink url="mailto:jfd@recoll.org">I would
|
|
very much welcome patches</ulink>.</para>
|
|
|
|
|
|
<formalpara>
|
|
<title>Configure options:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><option>--without-aspell</option>
|
|
will disable the code for phonetic matching of search
|
|
terms. </para></listitem>
|
|
|
|
<listitem><para><option>--with-fam</option> or
|
|
<option>--with-inotify</option> will enable the code for real
|
|
time indexing. Inotify support is enabled by default on Linux
|
|
systems.</para></listitem>
|
|
|
|
|
|
<listitem><para><option>--with-qzeitgeist</option> will
|
|
enable sending <application>Zeitgeist</application>
|
|
events about the visited search results, and needs
|
|
the <application>qzeitgeist</application>
|
|
package.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-webkit</option> is available
|
|
from version 1.17 to implement the result list with a
|
|
<application>Qt</application> QTextBrowser instead of a
|
|
WebKit widget if you do not or can't depend on the
|
|
latter.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-qtgui</option> Disable the Qt
|
|
interface. Will allow building the indexer and the command line
|
|
search program in absence of a Qt environment.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><option>--enable-webengine</option> Enable the
|
|
use of Qt Webengine (only meaningful if the Qt GUI
|
|
is enabled), in place or Qt Webkit.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-idxthreads</option> is available
|
|
from version 1.19 to suppress multithreading inside the
|
|
indexing process. You can also use the run-time
|
|
configuration to restrict <command>recollindex</command>
|
|
to using a single thread, but the compile-time option
|
|
may disable a few more unused locks. This only applies
|
|
to the use of multithreading for the core index
|
|
processing (data input). The &RCL; monitor mode always
|
|
uses at least two threads of execution.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-python-module</option> will
|
|
avoid building the <application>Python</application>
|
|
module.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-python-chm</option> will
|
|
avoid building the Python libchm interface used to index CHM
|
|
files.</para></listitem>
|
|
|
|
<listitem><para><option>--enable-camelcase</option> will enable
|
|
splitting <replaceable>camelCase</replaceable> words. This
|
|
is not enabled by default as it has the unfortunate
|
|
side-effect of making some phrase searches quite
|
|
confusing: ie, <literal>"MySQL manual"</literal> would be
|
|
matched by <literal>"MySQL manual"</literal> and
|
|
<literal>"my sql manual"</literal> but not
|
|
<literal>"mysql manual"</literal> (only inside phrase
|
|
searches).</para>
|
|
</listitem>
|
|
|
|
<listitem><para><option>--with-file-command</option> Specify
|
|
the version of the 'file' command to use (ie:
|
|
--with-file-command=/usr/local/bin/file). Can be useful to
|
|
enable the gnu version on systems where the native one is
|
|
bad.</para> </listitem>
|
|
|
|
<listitem><para><option>--disable-x11mon</option> Disable
|
|
<application>X11</application> connection monitoring
|
|
inside recollindex. Together with --disable-qtgui, this
|
|
allows building recoll without
|
|
<application>Qt</application> and
|
|
<application>X11</application>.</para> </listitem>
|
|
|
|
<listitem><para><option>--disable-userdoc</option>
|
|
will avoid building the user manual. This avoids having to
|
|
install the Docbook XML/XSL files and the TeX toolchain used for
|
|
translating the manual to PDF.</para></listitem>
|
|
|
|
<listitem><para><option>--enable-recollq</option> Enable
|
|
building the <command>recollq</command> command line query
|
|
tool (recoll -t without need for Qt). This is done by
|
|
default if --disable-qtgui is set but this option
|
|
enables forcing it.</para></listitem>
|
|
|
|
<listitem><para><option>--disable-pic</option> (&RCL; versions up
|
|
to 1.21 only) will compile
|
|
&RCL; with position-dependant code. This is incompatible with
|
|
building the KIO or the <application>Python</application>
|
|
or <application>PHP</application> extensions, but might
|
|
yield very marginally faster code.</para></listitem>
|
|
|
|
<listitem><para>Of course the usual
|
|
<application>autoconf</application> <command>configure</command>
|
|
options, like <option>--prefix</option> apply.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<para>Normal procedure (for source extracted from a tar
|
|
distribution):</para>
|
|
<screen>
|
|
<userinput>cd recoll-xxx</userinput>
|
|
<userinput>./configure</userinput>
|
|
<userinput>make</userinput>
|
|
<userinput>(practices usual hardship-repelling invocations)</userinput>
|
|
</screen>
|
|
|
|
<para>When building from source cloned from the git repository,
|
|
you also need to install <application>autoconf</application>,
|
|
<application>automake</application>, and
|
|
<application>libtool</application> and you must execute
|
|
<literal>sh autogen.sh</literal> in the top source directory
|
|
before running <literal>configure</literal>.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.INSTALL">
|
|
<title>Installing</title>
|
|
|
|
<para>Use <userinput>make install</userinput>
|
|
in the root
|
|
of the source tree. This will copy the commands to
|
|
<filename><replaceable>prefix</replaceable>/bin</filename>
|
|
and the sample configuration files, scripts and other shared
|
|
data to
|
|
<filename><replaceable>prefix</replaceable>/share/recoll</filename>.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.PYTHON">
|
|
<title>Python API package</title>
|
|
|
|
<para>The Python interface can be found in the source tree,
|
|
under the <filename>python/recoll</filename> directory.</para>
|
|
|
|
<para>As of &RCL; 1.19, the module can be compiled for
|
|
Python3.</para>
|
|
|
|
<para>The normal &RCL; build procedure (see above) installs the API
|
|
package for the default system version (python) along with the main
|
|
code. The package for other Python versions (e.g. python3 if the
|
|
system default is python2) must be explicitly built and
|
|
installed.</para>
|
|
|
|
<para>The <filename>python/recoll/</filename> directory contains
|
|
the usual <filename>setup.py</filename>. After configuring and
|
|
building the main &RCL; code, you can use the script to build and
|
|
install the Python module:
|
|
<screen>
|
|
<userinput>cd recoll-xxx/python/recoll</userinput>
|
|
<userinput>pythonX setup.py build</userinput>
|
|
<userinput>sudo pythonX setup.py install</userinput>
|
|
</screen>
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.BUILDING.SOLARIS">
|
|
<title>Building on Solaris</title>
|
|
|
|
<para>We did not test building the GUI on Solaris for recent
|
|
versions. You will need at least Qt 4.4. There are some hints
|
|
on <ulink url="http://www.recoll.org/download-1.14.html">an old
|
|
web site page</ulink>, they may still be valid.</para>
|
|
|
|
<para>Someone did test the 1.19 indexer and Python module build,
|
|
they do work, with a few minor glitches. Be sure to use
|
|
GNU <command>make</command> and <command>install</command>.</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="RCL.INSTALL.CONFIG">
|
|
<title>Configuration overview</title>
|
|
|
|
<para>Most of the parameters specific to the
|
|
<command>recoll</command> GUI are set through the
|
|
<guilabel>Preferences</guilabel> menu and stored in the standard Qt
|
|
place (<filename>$HOME/.config/Recoll.org/recoll.conf</filename>).
|
|
You probably do not want to edit this by hand.</para>
|
|
|
|
<para>&RCL; indexing options are set inside text configuration
|
|
files located in a configuration directory. There can be
|
|
several such directories, each of which defines the parameters
|
|
for one index.</para>
|
|
|
|
<para>The configuration files can be edited by hand or through
|
|
the <guilabel>Index configuration</guilabel> dialog
|
|
(<guilabel>Preferences</guilabel> menu). The GUI tool will try
|
|
to respect your formatting and comments as much as possible,
|
|
so it is quite possible to use both approaches on the same
|
|
configuration.</para>
|
|
|
|
<para>The most accurate documentation for the
|
|
configuration parameters is given by comments inside the default
|
|
files, and we will just give a general overview here.</para>
|
|
|
|
<para>For each index, there are at least two sets of
|
|
configuration files. System-wide configuration files are kept
|
|
in a directory named
|
|
like <filename>/usr/share/recoll/examples</filename>,
|
|
and define default values, shared by all indexes. For each
|
|
index, a parallel set of files defines the customized
|
|
parameters.</para>
|
|
|
|
<para>The default location of the customized configuration is the
|
|
<filename>.recoll</filename>
|
|
directory in your home. Most people will only use this
|
|
directory.</para>
|
|
|
|
<para>This location can be changed, or others can be added with the
|
|
<envar>RECOLL_CONFDIR</envar> environment variable or the
|
|
<option>-c</option> option parameter to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>In addition (as of &RCL; version 1.19.7), it is possible
|
|
to specify two additional configuration directories which will
|
|
be stacked before and after the user configuration
|
|
directory. These are defined by
|
|
the <envar>RECOLL_CONFTOP</envar>
|
|
and <envar>RECOLL_CONFMID</envar> environment
|
|
variables. Values from configuration files inside the top
|
|
directory will override user ones, values from configuration
|
|
files inside the middle directory will override system ones
|
|
and be overridden by user ones. These two variables may be of
|
|
use to applications which augment &RCL; functionality, and
|
|
need to add configuration data without disturbing the user's
|
|
files. Please note that the two, currently single, values will
|
|
probably be interpreted as colon-separated lists in the
|
|
future: do not use colon characters inside the directory
|
|
paths.</para>
|
|
|
|
<para>If the <filename>.recoll</filename> directory does not
|
|
exist when <command>recoll</command> or
|
|
<command>recollindex</command> are started, it will be created
|
|
with a set of empty configuration files.
|
|
<command>recoll</command> will give you a chance to edit the
|
|
configuration file before starting
|
|
indexing. <command>recollindex</command> will proceed
|
|
immediately. To avoid mistakes, the automatic directory
|
|
creation will only occur for the
|
|
default location, not if <option>-c</option> or
|
|
<envar>RECOLL_CONFDIR</envar> were used (in the latter
|
|
cases, you will have to create the directory).</para>
|
|
|
|
<para>All configuration files share the same format. For
|
|
example, a short extract of the main configuration file might
|
|
look as follows:</para>
|
|
<programlisting>
|
|
# Space-separated list of files and directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
</programlisting>
|
|
|
|
<para>There are three kinds of lines: </para>
|
|
<itemizedlist>
|
|
<listitem><para>Comment (starts with
|
|
<emphasis>#</emphasis>) or empty.</para>
|
|
</listitem>
|
|
<listitem><para>Parameter affectation (<emphasis>name =
|
|
value</emphasis>).</para>
|
|
</listitem>
|
|
<listitem><para>Section definition
|
|
([<emphasis>somedirname</emphasis>]).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Long lines can be broken by ending each incomplete part with
|
|
a backslash (<literal>\</literal>).</para>
|
|
|
|
<para>Depending on the type of configuration file, section
|
|
definitions either separate groups of parameters or allow
|
|
redefining some parameters for a directory sub-tree. They stay
|
|
in effect until another section definition, or the end of
|
|
file, is encountered. Some of the parameters used for indexing
|
|
are looked up hierarchically from the current directory
|
|
location upwards. Not all parameters can be meaningfully
|
|
redefined, this is specified for each in the next
|
|
section. </para>
|
|
|
|
<important>
|
|
<para>Global parameters <emphasis>must not</emphasis> be defined in
|
|
a directory subsection, else they will not be found at all by the
|
|
&RCL; code, which looks for them at the top level
|
|
(e.g. <literal>skippedPaths</literal>).</para>
|
|
</important>
|
|
|
|
<para>When found at the beginning of a file path, the tilde
|
|
character (~) is expanded to the name of the user's home
|
|
directory, as a shell would do.</para>
|
|
|
|
<para>Some parameters are lists of strings. White space is used for
|
|
separation. List elements with embedded spaces can be quoted using
|
|
double-quotes. Double quotes inside these elements can be escaped
|
|
with a backslash.</para>
|
|
|
|
<para>No value inside a configuration file can contain a newline
|
|
character. Long lines can be continued by escaping the
|
|
physical newline with backslash, even inside quoted strings.</para>
|
|
<programlisting>
|
|
astringlist = "some string \
|
|
with spaces"
|
|
thesame = "some string with spaces"
|
|
</programlisting>
|
|
|
|
<para>Parameters which are not part of string lists can't be
|
|
quoted, and leading and trailing space characters are
|
|
stripped before the value is used.</para>
|
|
|
|
<formalpara>
|
|
<title>Encoding issues</title>
|
|
<para>Most of the configuration parameters are plain ASCII. Two
|
|
particular sets of values may cause encoding issues:</para>
|
|
</formalpara>
|
|
|
|
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para>File path parameters may contain non-ascii
|
|
characters and should use the exact same byte values as found in
|
|
the file system directory. Usually, this means that the
|
|
configuration file should use the system default locale
|
|
encoding.</para>
|
|
</listitem>
|
|
<listitem><para>The <envar>unac_except_trans</envar> parameter
|
|
should be encoded in UTF-8. If your system locale is not UTF-8, and
|
|
you need to also specify non-ascii file paths, this poses a
|
|
difficulty because common text editors cannot handle multiple
|
|
encodings in a single file. In this relatively unlikely case, you
|
|
can edit the configuration file as two separate text files with
|
|
appropriate encodings, and concatenate them to create the complete
|
|
configuration.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.ENVIR">
|
|
<title>Environment variables</title>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><varname>RECOLL_CONFDIR</varname></term>
|
|
<listitem><para>Defines the main configuration
|
|
directory.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_TMPDIR, TMPDIR</varname></term>
|
|
<listitem><para>Locations for temporary files, in this order
|
|
of priority. The default if none of these is set is to use
|
|
<filename>/tmp</filename>. Big temporary files may be created
|
|
during indexing, mostly for decompressing, and also for
|
|
processing, e.g. email attachments.</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_CONFTOP, RECOLL_CONFMID</varname></term>
|
|
<listitem><para>Allow adding configuration directories with
|
|
priorities below and above the user directory (see above the
|
|
Configuration overview section for details).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_EXTRA_DBS, RECOLL_ACTIVE_EXTRA_DBS</varname></term>
|
|
<listitem><para>
|
|
Help for setting up external indexes. See
|
|
<link linkend="RCL.SEARCH.GUI.MULTIDB">this paragraph</link> for
|
|
explanations.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_DATADIR</varname></term>
|
|
<listitem><para>Defines replacement for the default location
|
|
of Recoll data files, normally found in, e.g.,
|
|
<filename>/usr/share/recoll</filename>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>RECOLL_FILTERSDIR</varname></term>
|
|
<listitem><para>Defines replacement for the default location
|
|
of Recoll filters, normally found in, e.g.,
|
|
<filename>/usr/share/recoll/filters</filename>).</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><varname>ASPELL_PROG</varname></term>
|
|
<listitem><para><command>aspell</command> program to use for
|
|
creating the spelling dictionary. The result has to be
|
|
compatible with the <filename>libaspell</filename> which &RCL;
|
|
is using.</para></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
<!-- <sect2 id="RCL.INSTALL.CONFIG.RECOLLCONF"> -->
|
|
&RCLCONF;
|
|
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.FIELDS">
|
|
<title>The fields file</title>
|
|
|
|
<para>This file contains information about dynamic fields handling
|
|
in &RCL;. Some very basic fields have hard-wired behaviour,
|
|
and, mostly, you should not change the original data inside the
|
|
<filename>fields</filename> file. But you can create custom fields
|
|
fitting your data and handle them just like they were native
|
|
ones.</para>
|
|
|
|
<para>The <filename>fields</filename> file has several sections,
|
|
which each define an aspect of fields processing. Quite often,
|
|
you'll have to modify several sections to obtain the desired
|
|
behaviour.</para>
|
|
|
|
<para>We will only give a short description here, you should refer
|
|
to the comments inside the default file for more detailed
|
|
information.</para>
|
|
|
|
<para>Field names should be lowercase alphabetic ASCII.</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>[prefixes]</term>
|
|
<listitem><para>A field becomes indexed (searchable) by having
|
|
a prefix defined in this section. There is a more complete
|
|
explanation of what prefixes are in used by a standard recoll
|
|
installation. In a nutshell: extension prefixes should be all
|
|
caps, begin with XY, and short. E.g. XYMFLD.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[values]</term>
|
|
<listitem><para>Fields listed in this section will be stored as
|
|
&XAP; <literal>values</literal> inside the index. This makes
|
|
them available for range queries, allowing to filter results
|
|
according to the field value. This feature currently supports
|
|
string and integer data. See the comments in the file for more
|
|
detail</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[stored]</term>
|
|
<listitem><para>A field becomes stored (displayable inside
|
|
results) by having its name listed in this section (typically
|
|
with an empty value).</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[aliases]</term>
|
|
<listitem><para>This section defines lists of synonyms for the
|
|
canonical names used inside the <literal>[prefixes]</literal>
|
|
and <literal>[stored]</literal> sections</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>[queryaliases]</term>
|
|
<listitem><para>This section also defines aliases for the
|
|
canonic field names, with the difference that the substitution
|
|
will only be used at query time, avoiding any possibility that
|
|
the value would pick-up random metadata from documents.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>handler-specific sections</term>
|
|
<listitem><para>Some input handlers may need specific
|
|
configuration for handling fields. Only the email message handler
|
|
currently has such a section (named
|
|
<literal>[mail]</literal>). It allows indexing arbitrary email
|
|
headers in addition to the ones indexed by default. Other such
|
|
sections may appear in the future.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
<para>Here follows a small example of a personal
|
|
<filename>fields</filename>
|
|
file. This would extract a specific email header and
|
|
use it as a searchable field, with data displayable inside result
|
|
lists. (Side note: as the email handler does no decoding on the values,
|
|
only plain ascii headers can be indexed, and only the
|
|
first occurrence will be used for headers that occur several times).
|
|
|
|
<programlisting>[prefixes]
|
|
# Index mailmytag contents (with the given prefix)
|
|
mailmytag = XMTAG
|
|
|
|
[stored]
|
|
# Store mailmytag inside the document data record (so that it can be
|
|
# displayed - as %(mailmytag) - in result lists).
|
|
mailmytag =
|
|
|
|
[queryaliases]
|
|
filename = fn
|
|
containerfilename = cfn
|
|
|
|
[mail]
|
|
# Extract the X-My-Tag mail header, and use it internally with the
|
|
# mailmytag field name
|
|
x-my-tag = mailmytag
|
|
</programlisting>
|
|
</para>
|
|
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.FIELDS.XATTR">
|
|
<title>Extended attributes in the fields file</title>
|
|
|
|
<para>&RCL; versions 1.19 and later process user extended
|
|
file attributes as documents fields by default.</para>
|
|
|
|
<para>Attributes are processed as fields of the same name,
|
|
after removing the <literal>user</literal> prefix on
|
|
Linux.</para>
|
|
|
|
<para>The <literal>[xattrtofields]</literal>
|
|
section of the <filename>fields</filename> file allows
|
|
specifying translations from extended attributes names to
|
|
&RCL; field names. An empty translation disables use of the
|
|
corresponding attribute data.</para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMEMAP">
|
|
<title>The mimemap file</title>
|
|
|
|
<para><filename>mimemap</filename> specifies the
|
|
file name extension to MIME type mappings.</para>
|
|
|
|
<para>For file names without an extension, or with an unknown one,
|
|
a system command (<command>file</command> <option>-i</option>, or
|
|
<command>xdg-mime</command>) will be executed to determine the MIME
|
|
type (this can be switched off, or the command changed inside the
|
|
main configuration file).</para>
|
|
|
|
<para>All extension values in <filename>mimemap</filename> must be
|
|
entered in lower case. File names extensions are lower-cased for
|
|
comparison during indexing, meaning that an upper case
|
|
<filename>mimemap</filename> entry will never be matched.</para>
|
|
|
|
<para>The mappings can be specified on a per-subtree basis,
|
|
which may be useful in some cases. Example:
|
|
<application>okular</application> notes have a
|
|
<filename>.xml</filename> extension but
|
|
should be handled specially, which is possible because they
|
|
are usually all located in one place. Example:
|
|
<programlisting>[~/.kde/share/apps/okular/docdata]
|
|
.xml = application/x-okular-notes</programlisting></para>
|
|
|
|
<para>The <varname>recoll_noindex</varname>
|
|
<filename>mimemap</filename> variable has been moved to
|
|
<filename>recoll.conf</filename> and renamed to
|
|
<varname>noContentSuffixes</varname>, while keeping the same
|
|
function, as of &RCL; version 1.21. For older &RCL; versions,
|
|
see the documentation for <varname>noContentSuffixes</varname>
|
|
but use <varname>recoll_noindex</varname> in
|
|
<filename>mimemap</filename>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMECONF">
|
|
<title>The mimeconf file</title>
|
|
|
|
<para>The main purpose of the <filename>mimeconf</filename> file is
|
|
to specify how the different MIME types are handled for
|
|
indexing. This is done in the <literal>[index]</literal>
|
|
section, which should not be modified casually. See the comments in
|
|
the file.</para>
|
|
|
|
<para>The file also contains other definitions which affect the
|
|
query language and the GUI, and which, in retrospect, should have
|
|
been stored elsewhere.</para>
|
|
|
|
<para>The <literal>[icons]</literal> section allows you to change
|
|
the icons which are displayed by the <command>recoll</command> GUI
|
|
in the result lists (the values are the basenames of the
|
|
<literal>png</literal> images inside the
|
|
<filename>iconsdir</filename> directory (which is itself defined
|
|
in <filename>recoll.conf</filename>).</para>
|
|
|
|
<para>The <literal>[categories]</literal> section defines the
|
|
groupings of MIME types into <literal>categories</literal> as used
|
|
when adding an <literal>rclcat</literal> clause to a
|
|
<link linkend="RCL.SEARCH.LANG">query language</link>
|
|
query. <literal>rclcat</literal> clauses are also used by the
|
|
default <literal>guifilters</literal> buttons in the GUI (see
|
|
next).</para>
|
|
|
|
<para>The filter controls appear at the top of the
|
|
<command>recoll</command> GUI, either as checkboxes just above the
|
|
result list, or as a dropbox in the tool area.</para>
|
|
|
|
<para>By default, they are labeled: <literal>media</literal>,
|
|
<literal>message</literal>, <literal>other</literal>,
|
|
<literal>presentation</literal>, <literal>spreadsheet</literal> and
|
|
<literal>text</literal>, and each maps to a document category. This
|
|
is determined in the <literal>[guifilters]</literal> section, where
|
|
each control is defined by a variable naming a query language
|
|
fragment.</para>
|
|
|
|
<para>A simple example will hopefully make things clearer.</para>
|
|
|
|
<programlisting>[guifilters]
|
|
|
|
Big Books = dir:"~/My Books" size>10K
|
|
My Docs = dir:"~/My Documents"
|
|
Small Books = dir:"~/My Books" size<10K
|
|
System Docs = dir:/usr/share/doc
|
|
</programlisting>
|
|
|
|
|
|
<para>The above definition would create four filter checkboxes,
|
|
labelled <literal>Big Books</literal>, <literal>My Docs</literal>,
|
|
etc.</para>
|
|
|
|
<para>The text after the equal sign must be a valid query language
|
|
fragment, and, when the button is checked, it will be combined with
|
|
the rest of the query with an AND conjunction.</para>
|
|
|
|
<para>Any name text before a colon character will be erased in the
|
|
display, but used for sorting. You can use this to display the
|
|
checkboxes in any order you like. For example, the following would
|
|
do exactly the same as above, but ordering the checkboxes in the
|
|
reverse order.</para>
|
|
|
|
<programlisting>[guifilters]
|
|
|
|
d:Big Books = dir:"~/My Books" size>10K
|
|
c:My Docs = dir:"~/My Documents"
|
|
b:Small Books = dir:"~/My Books" size<10K
|
|
a:System Docs = dir:/usr/share/doc
|
|
</programlisting>
|
|
|
|
<para>As you may have guessed, The default
|
|
<literal>[guifilters]</literal> section looks like:</para>
|
|
<programlisting>[guifilters]
|
|
text = rclcat:text
|
|
spreadsheet = rclcat:spreadsheet
|
|
presentation = rclcat:presentation
|
|
media = rclcat:media
|
|
message = rclcat:message
|
|
other = rclcat:other
|
|
</programlisting>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.MIMEVIEW">
|
|
<title>The mimeview file</title>
|
|
|
|
<para><filename>mimeview</filename> specifies which programs
|
|
are started when you click on an <guilabel>Open</guilabel> link
|
|
in a result list. Ie: HTML is normally displayed using
|
|
<application>firefox</application>, but you may prefer
|
|
<application>Konqueror</application>, your
|
|
<application>openoffice.org</application>
|
|
program might be named <command>oofice</command> instead of
|
|
<command>openoffice</command> etc.</para>
|
|
|
|
<para>Changes to this file can be done by direct editing, or
|
|
through the <command>recoll</command> GUI preferences dialog.</para>
|
|
|
|
<para>If <guilabel>Use desktop preferences to choose document
|
|
editor</guilabel> is checked in the &RCL; GUI preferences, all
|
|
<filename>mimeview</filename> entries will be ignored except the
|
|
one labelled <literal>application/x-all</literal> (which is set to
|
|
use <command>xdg-open</command> by default).</para>
|
|
|
|
<para>In this case, the <literal>xallexcepts</literal> top level
|
|
variable defines a list of MIME type exceptions which
|
|
will be processed according to the local entries instead of being
|
|
passed to the desktop. This is so that specific &RCL; options
|
|
such as a page number or a search string can be passed to
|
|
applications that support them, such as the
|
|
<application>evince</application> viewer.</para>
|
|
|
|
<para>As for the other configuration files, the normal usage
|
|
is to have a <filename>mimeview</filename> inside your own
|
|
configuration directory, with just the non-default entries,
|
|
which will override those from the central configuration
|
|
file.</para>
|
|
|
|
<para>All viewer definition entries must be placed under a
|
|
<literal>[view]</literal> section.</para>
|
|
|
|
<para>The keys in the file are normally MIME types. You can add an
|
|
application tag to specialize the choice for an area of the
|
|
filesystem (using a <varname>localfields</varname> specification
|
|
in <filename>mimeconf</filename>). The syntax for the key is
|
|
<replaceable>mimetype</replaceable><literal>|</literal><replaceable>tag</replaceable></para>
|
|
|
|
<para>The <varname>nouncompforviewmts</varname> entry, (placed at
|
|
the top level, outside of the <literal>[view]</literal> section),
|
|
holds a list of MIME types that should not be uncompressed before
|
|
starting the viewer (if they are found compressed, ie:
|
|
<replaceable>mydoc.doc.gz</replaceable>).</para>
|
|
|
|
<para>The right side of each assignment holds a command to be
|
|
executed for opening the file. The following substitutions are
|
|
performed:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara><title>%D</title>
|
|
<para>Document date</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%f</title>
|
|
<para>File name. This may be the name of a temporary file if
|
|
it was necessary to create one (ie: to extract a subdocument
|
|
from a container).</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%i</title>
|
|
<para>Internal path, for subdocuments of containers. The
|
|
format depends on the container type. If this appears in the
|
|
command line, &RCL; will not create a temporary file to
|
|
extract the subdocument, expecting the called application
|
|
(possibly a script) to be able to handle it.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%M</title>
|
|
<para>MIME type</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%p</title>
|
|
<para>Page index. Only significant for a subset of document
|
|
types, currently only PDF, Postscript and DVI files. Can be
|
|
used to start the editor at the right page for a match or
|
|
snippet.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%s</title>
|
|
<para>Search term. The value will only be set for documents
|
|
with indexed page numbers (ie: PDF). The value will be one of
|
|
the matched search terms. It would allow pre-setting the
|
|
value in the "Find" entry inside Evince for example, for easy
|
|
highlighting of the term.</para></formalpara>
|
|
</listitem>
|
|
|
|
<listitem><formalpara><title>%u</title>
|
|
<para>Url.</para></formalpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>In addition to the predefined values above, all strings like
|
|
<literal>%(fieldname)</literal> will be replaced by the value of
|
|
the field named <literal>fieldname</literal> for the
|
|
document. This could be used in combination with field
|
|
customisation to help with opening the document.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.PTRANS">
|
|
<title>The <filename>ptrans</filename> file</title>
|
|
|
|
<para><filename>ptrans</filename> specifies query-time path
|
|
translations. These can be useful
|
|
in <link linkend="RCL.SEARCH.PTRANS">multiple cases</link>.
|
|
</para>
|
|
|
|
<para>The file has a section for any index which needs
|
|
translations, either the main one or additional query
|
|
indexes. The sections are named with the &XAP; index
|
|
directory names. No slash character should exist at the end
|
|
of the paths (all comparisons are textual). An example
|
|
should make things sufficiently clear</para>
|
|
|
|
<programlisting>
|
|
[/home/me/.recoll/xapiandb]
|
|
/this/directory/moved = /to/this/place
|
|
|
|
[/path/to/additional/xapiandb]
|
|
/server/volume1/docdir = /net/server/volume1/docdir
|
|
/server/volume2/docdir = /net/server/volume2/docdir
|
|
</programlisting>
|
|
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="RCL.INSTALL.CONFIG.EXAMPLES">
|
|
<title>Examples of configuration adjustments</title>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW">
|
|
<title>Adding an external viewer for an non-indexed type</title>
|
|
|
|
<para>Imagine that you have some kind of file which does not
|
|
have indexable content, but for which you would like to have a
|
|
functional <guilabel>Open</guilabel> link in the result list
|
|
(when found by file name). The file names end in
|
|
<replaceable>.blob</replaceable> and can be displayed by
|
|
application <replaceable>blobviewer</replaceable>.</para>
|
|
|
|
<para>You need two entries in the configuration files for this
|
|
to work:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>In <filename>$RECOLL_CONFDIR/mimemap</filename>
|
|
(typically <filename>~/.recoll/mimemap</filename>), add the
|
|
following line:<programlisting>
|
|
.blob = application/x-blobapp
|
|
</programlisting>
|
|
Note that the MIME type is made up here, and you could
|
|
call it <replaceable>diesel/oil</replaceable> just the
|
|
same.</para>
|
|
</listitem>
|
|
<listitem><para>In <filename>$RECOLL_CONFDIR/mimeview</filename>
|
|
under the <literal>[view]</literal> section, add:</para>
|
|
<programlisting>
|
|
application/x-blobapp = blobviewer %f
|
|
</programlisting>
|
|
<para>We are supposing
|
|
that <replaceable>blobviewer</replaceable> wants a file
|
|
name parameter here, you would use <literal>%u</literal> if
|
|
it liked URLs better.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>If you just wanted to change the application used by
|
|
&RCL; to display a MIME type which it already knows, you
|
|
would just need to edit <filename>mimeview</filename>. The
|
|
entries you add in your personal file override those in the
|
|
central configuration, which you do not need to
|
|
alter. <filename>mimeview</filename> can also be modified
|
|
from the Gui.</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX">
|
|
<title>Adding indexing support for a new file type</title>
|
|
|
|
<para>Let us now imagine that the above
|
|
<replaceable>.blob</replaceable> files actually contain
|
|
indexable text and that you know how to extract it with a
|
|
command line program. Getting &RCL; to index the files is
|
|
easy. You need to perform the above alteration, and also to
|
|
add data to the <filename>mimeconf</filename> file
|
|
(typically in <filename>~/.recoll/mimeconf</filename>):</para>
|
|
<itemizedlist>
|
|
<listitem><para>Under the <literal>[index]</literal>
|
|
section, add the following line (more about the
|
|
<replaceable>rclblob</replaceable> indexing script
|
|
later):<programlisting>
|
|
application/x-blobapp = exec rclblob</programlisting>
|
|
Or if the files are mostly text and you don't need to process them
|
|
for indexing:<programlisting>
|
|
application/x-blobapp = internal text/plain</programlisting>
|
|
</para>
|
|
</listitem>
|
|
<listitem><para>Under the <literal>[icons]</literal>
|
|
section, you should choose an icon to be displayed for the
|
|
files inside the result lists. Icons are normally 64x64
|
|
pixels PNG files which live in
|
|
<filename>/usr/share/recoll/images</filename>.</para>
|
|
</listitem>
|
|
<listitem><para>Under the <literal>[categories]</literal>
|
|
section, you should add the MIME type where it makes sense
|
|
(you can also create a category). Categories may be used
|
|
for filtering in advanced search.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <replaceable>rclblob</replaceable> handler should
|
|
be an executable program or script which exists inside
|
|
<filename>/usr/share/recoll/filters</filename>. It
|
|
will be given a file name as argument and should output the
|
|
text or html contents on the standard output.</para>
|
|
|
|
<para>The <link linkend="RCL.PROGRAM.FILTERS">filter programming</link>
|
|
section describes in more detail how to write an input handler.</para>
|
|
|
|
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
</chapter>
|
|
</book>
|