1194 lines
50 KiB
Plaintext
1194 lines
50 KiB
Plaintext
<!DOCTYPE BOOK PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
|
|
<!ENTITY RCL "<application>Recoll</application>">
|
|
<!ENTITY XAP "<application>Xapian</application>">
|
|
|
|
]>
|
|
|
|
<book lang="en">
|
|
|
|
<bookinfo>
|
|
<title>Recoll user manual</title>
|
|
|
|
|
|
<author>
|
|
<firstname>Jean-Francois</firstname>
|
|
<surname>Dockes</surname>
|
|
<affiliation>
|
|
<address><email>jean-francois.dockes@wanadoo.fr</email></address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<copyright>
|
|
<year>2005</year>
|
|
<holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
|
|
Dockes</holder>
|
|
</copyright>
|
|
|
|
<releaseinfo>$Id: usermanual.sgml,v 1.15 2006-09-08 09:02:47 dockes Exp $</releaseinfo>
|
|
|
|
<abstract>
|
|
<para>This document introduces full text search notions
|
|
and describes the installation and use of the &RCL; application.</para>
|
|
</abstract>
|
|
|
|
|
|
</bookinfo>
|
|
|
|
<chapter id="rcl.introduction">
|
|
<title>Introduction</title>
|
|
|
|
<sect1 id="rcl.introduction.tryit">
|
|
<title>Giving it a try</title>
|
|
|
|
<para>If you do not like reading manuals (who does?) and would
|
|
like to give &RCL; a try, just perform <link
|
|
linkend="rcl.install">installation</link> and start the
|
|
<command>recoll</command> user interface, which will index your
|
|
home directory by default, allowing you to search immediately after
|
|
indexing completes.</para>
|
|
|
|
<para>Do not do this if your home has a huge
|
|
number of documents and you do not want to wait or are very
|
|
short on disk space. In this case, you may want to edit the <link
|
|
linkend="rcl.indexing.config">configuration file</link> first to
|
|
restrict the indexed area.</para>
|
|
|
|
<para>Also be aware that you will need to install the
|
|
appropriate <link linkend="rcl.install.external">
|
|
supporting applications</link> for document types that need
|
|
them (for example <application>antiword</application> for
|
|
ms-word files).</para>
|
|
|
|
<sect1 id="rcl.introduction.search">
|
|
<title>Full text search</title>
|
|
|
|
<para>&RCL; is a full text search application. Full text search
|
|
applications let you find your data by content rather
|
|
than by external attributes (like a file name). More
|
|
specifically, they will let you specify words (terms) that
|
|
should or should not appear in the text you are looking for,
|
|
and return a list of matching documents, ordered so that the
|
|
most <emphasis>relevant</emphasis> documents will appear
|
|
first.</para>
|
|
|
|
<para>You do not need to remember in what file or email message you
|
|
stored a given piece of information. You just ask for related
|
|
terms, and the tool will return a list of documents where
|
|
those terms are prominent, in a similar way to internet search
|
|
engines.</para>
|
|
|
|
<para>&RCL; tries to determine which documents are most relevant to
|
|
the search terms you provide. Computer algorithms for determining
|
|
relevance can be very complex, and in general are inferior to the
|
|
power of the human mind to rapidly determine relevance. The quality
|
|
of relevance guessing by the search tool is probably the most
|
|
important element for a search application.</para>
|
|
|
|
<para>In many cases, you are looking for all the forms of a
|
|
word, not for a specific form or spelling. These different
|
|
forms may include plurals, different tenses for a verb, or
|
|
terms derived from the same root or <emphasis>stem</emphasis>
|
|
(exemple: floor, floors, floored, floorings...). &RCL; will by
|
|
default expand queries to all such related terms (words that
|
|
reduce to the same stem). This expansion can be disabled at
|
|
search time.</para>
|
|
|
|
<para>Stemming, by itself, does not accomodate for misspellings or
|
|
phonetic searches. &RCL; currently does not support these
|
|
features.</para>
|
|
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.introduction.recoll">
|
|
<title>Recoll overview</title>
|
|
|
|
<para>&RCL; uses the
|
|
<ulink url="http://www.xapian.org">&XAP;</ulink> information retrieval
|
|
library as its storage and retrieval engine. &XAP; is a very
|
|
mature package using <ulink
|
|
url="http://www.xapian.org/docs/intro_ir.html">a sophisticated
|
|
probabilistic ranking model</ulink>. &RCL; provides the interface
|
|
to get data into (indexing) and out (searching) of the system.</para>
|
|
|
|
<para>In practice, &XAP; works by remembering where terms appear
|
|
in your document files. The acquisition process is called
|
|
indexing. </para>
|
|
|
|
<para>The resulting index can be big (roughly the size of the
|
|
original document set), but it is not a document
|
|
archive. &RCL; can only display documents that still exist at
|
|
the place from which they were indexed. (Actually, there is a
|
|
way to reconstruct a document from the information in the
|
|
index, but the result is not nice, as all formatting,
|
|
punctuation and capitalisation are lost).</para>
|
|
|
|
<para>&RCL; stores all internal data in <application>Unicode
|
|
UTF-8</application> format, and it can index files with
|
|
different character sets, encodings, and languages into the same
|
|
index. It has input filters for many document types.</para>
|
|
|
|
<para>Stemming depends on the document language. &RCL; stores
|
|
the unstemmed versions of terms and uses auxiliary databases for
|
|
term expansion. It can switch stemming languages, or add a
|
|
language, without reindexing. Storing documents in different
|
|
languages in the same index is possible, and useful in
|
|
practice, but does introduce possibilities of confusion. &RCL;
|
|
currently makes no attempt at automatic language recognition.</para>
|
|
|
|
<para>&RCL; has many parameters which define exactly what to
|
|
index, and how to classify and decode the source
|
|
documents. These are kept in a <link
|
|
linkend="rcl.indexing.config">configuration file</link>. A
|
|
default configuration is copied into a standard location
|
|
(usually something like
|
|
<filename>/usr/[local/]share/recoll/examples</filename>)
|
|
during installation. The default parameters from this file may
|
|
be overriden by values that you set inside your personal
|
|
configuration, found by default in the
|
|
<filename>.recoll</filename> subdirectory of your home
|
|
directory. The default configuration will index your home
|
|
directory with default parameters and should be sufficient for
|
|
giving &RCL; a try, but you may want to adjust it
|
|
later.</para>
|
|
|
|
<para><link linkend="rcl.indexing.exec">Indexing</link> is started
|
|
automatically the first time you execute the
|
|
<command>recoll</command> search graphical user interface, or by
|
|
executing the <command>recollindex</command> command.</para>
|
|
|
|
<para><link linkend="rcl.search">Searches</link> are
|
|
performed inside the <command>recoll</command>
|
|
program, which has many options to help you find what you are
|
|
looking for.</para>
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
|
|
<chapter id="rcl.indexing">
|
|
<title>Indexing</title>
|
|
|
|
<sect1 id="rcl.indexing.introduction">
|
|
<title>Introduction</title>
|
|
|
|
<para>Indexing is the process by which the set of documents is
|
|
analyzed and the data entered into the database. &RCL; indexing
|
|
is normally incremental: documents will only be processed if
|
|
they have been modified. On the first execution, of course, all
|
|
documents will need processing. A full index build can be forced
|
|
later on by specifying an option to the indexing command
|
|
(<command>recollindex -z</command>).</para>
|
|
|
|
<para>&RCL; indexing takes place at discrete times. There is
|
|
currently no interface to real time file modification
|
|
monitors. The typical usage is to have a nightly indexing run
|
|
<link linkend="rcl.indexing.automat">programmed</link> into your
|
|
<command>cron</command> file.</para>
|
|
|
|
<sidebar><para>Side note: there is nothing in &RCL; and &XAP;
|
|
that would prevent interfacing with a real time file
|
|
modification monitor, but this would tend to consume significant
|
|
system resources for dubious gain, because you rarely need a
|
|
full text search to find documents you just
|
|
modified. <command>recollindex -i</command> can be used to add
|
|
individual files to the index if you want to play with this, see
|
|
the manual page.</para>
|
|
</sidebar>
|
|
|
|
|
|
<para>&RCL; knows about quite a few different document
|
|
types. The parameters for document types recognition and
|
|
processing are set in
|
|
<link linkend="rcl.indexing.config">configuration files</link>
|
|
Most file types, like HTML or word processing files, only hold
|
|
one document. Some file types, like mail folder files can hold
|
|
many individually indexed documents.
|
|
</para>
|
|
|
|
<para>&RCL; indexing processes plain text, HTML, openoffice
|
|
and e-mail files internally. Other types (ie: postscript, pdf,
|
|
ms-word, rtf) need external applications for preprocessing. The
|
|
list is in the <link
|
|
linkend="rcl.install.building.prereqs">installation</link>
|
|
section.</para>
|
|
|
|
<para>Without further configuration, &RCL; will index all
|
|
appropriate files from your home directory, with a reasonable
|
|
set of defaults.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.indexing.storage">
|
|
<title>Index storage</title>
|
|
|
|
<para>The default location for the index data is the
|
|
<filename>$HOME/.recoll/xapiandb/</filename> directory. This can
|
|
be changed by setting the <literal>RECOLL_CONFDIR</literal>
|
|
environment variable, or by specifying the
|
|
<literal>dbdir</literal> parameter in the configuration file
|
|
(see the <link linkend="rcl.install.config">configuration
|
|
section</link>).</para>
|
|
|
|
<para>The size of the index is determined by the size of the set
|
|
of documents, but the ratio can vary a lot. For a typical mixed
|
|
set of documents, the index size will often be close to
|
|
the data set size. In specific cases (a set of compressed
|
|
mbox files for example), the index can become much bigger than
|
|
the documents. It may also be much smaller if the documents
|
|
contain a lot of images or other non-indexed data (an extreme
|
|
example being a set of mp3 files where only the tags would be
|
|
indexed).</para>
|
|
|
|
<para>Of course, images, sound and video do not increase the
|
|
index size, which means that it will be quite typical nowadays
|
|
(2006), that even a big index will be negligible against the
|
|
total amount of data on the computer.</para>
|
|
|
|
<para>The index data directory only contains data that will be
|
|
rebuilt by an index run, so that it can be destroyed safely.</para>
|
|
|
|
<sect2 id="rcl.indexing.storage.security">
|
|
<title>Security aspects</title>
|
|
|
|
<para>The &RCL; index does not hold copies of the indexed
|
|
documents. But it does hold enough data to allow for an almost
|
|
complete reconstruction. If confidential data is indexed,
|
|
access to the database directory should be restricted. </para>
|
|
|
|
<para>As of version 1.4, &RCL; will create the configuration
|
|
directory with a mode of 0700 (access by owner only). As the
|
|
index directory is by default a subdirectory of the
|
|
configuration directory, this should result in appropriate
|
|
protection. </para>
|
|
|
|
<para>If you use another setup, you should think of the kind
|
|
of protection you need for your index, and set the directory
|
|
access modes appropriately.</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.indexing.config">
|
|
<title>The indexing configuration</title>
|
|
|
|
<para>Values set in the system-wide configuration file (named
|
|
like
|
|
<filename>/usr/[local/]share/recoll/examples/recoll.conf</filename>)
|
|
can be overriden by those set in the personal one, named
|
|
<filename>$HOME/.recoll/recoll.conf</filename> by default or
|
|
<filename>$RECOLL_CONFDIR/recoll.conf</filename> if
|
|
RECOLL_CONFDIR is set.</para>
|
|
|
|
<para>The most accurate documentation for editing the file is
|
|
given by comments inside the central one. If you want to adjust
|
|
the configuration before indexing, just click
|
|
<guilabel>Cancel</guilabel> when the program asks if it should
|
|
start initial indexing. This will have created a
|
|
<filename>.recoll</filename> directory containing empty
|
|
configuration files.</para>
|
|
|
|
<para>The configuration is also documented inside the <link
|
|
linkend="rcl.install.config.recollconf">installation chapter</link> of
|
|
this document, or in the recoll.conf(5) man page.</para>
|
|
|
|
<para>The applications needed to index file types other than
|
|
text, html or email (ie: pdf, postscript, ms-word...) are
|
|
described in the <link linkend="rcl.install.external">external
|
|
packages section</link></para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.indexing.exec">
|
|
<title>Starting indexing</title>
|
|
|
|
<para>Indexing is performed either by the
|
|
<command>recollindex</command> program, or by the
|
|
indexing thread inside the <command>recoll</command>
|
|
program (use the <guimenu>File</guimenu> menu).
|
|
|
|
<para>If the <command>recoll</command> program finds no index
|
|
when it starts, it will automatically start indexing (except
|
|
if cancelled).</para>
|
|
|
|
<para>It is best to avoid interrupting the indexing process, as
|
|
this may sometimes leave the database in a bad state. This is
|
|
not a serious problem, as you then just need to clear
|
|
everything and restart the indexing: the index files are
|
|
normally stored in the <filename>$HOME/.recoll/xapiandb</filename>
|
|
directory,
|
|
which you can just delete if needed. Alternatively, you can
|
|
start <command>recollindex -z</command>, which will
|
|
reset the database before indexing.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.indexing.automat">
|
|
<title>Using <command>cron</command> to automate
|
|
indexing</title>
|
|
|
|
<para>The most common way to set up indexing is to have a cron
|
|
task execute it every night. For example the following
|
|
<filename>crontab</filename> entry would do it every day at
|
|
3:30AM (supposing <command>recollindex</command> is in your PATH):</para>
|
|
|
|
<programlisting>30 3 * * * recollindex > /tmp/recolltrace 2>&1</programlisting>
|
|
|
|
<para>The usual command to edit your
|
|
<filename>crontab</filename> is
|
|
<userinput>crontab -e</userinput> (which will usually start the
|
|
<command>vi</command> editor to edit the file). You may have
|
|
more sophisticated tools available on your system.</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
<chapter id="rcl.search">
|
|
<title>Search</title>
|
|
|
|
<para>The <command>recoll</command> program provides the user
|
|
interface for searching. It is based on the
|
|
<application>QT</application> library.</para>
|
|
|
|
<sect1 id="rcl.search.simple">
|
|
<title>Simple search</title>
|
|
|
|
<procedure>
|
|
<step><para>Start the <command>recoll</command> program.</para>
|
|
</step>
|
|
<step><para>Possibly choose a search mode: <guilabel>Any
|
|
term</guilabel> or <guilabel>All terms</guilabel> or
|
|
<guilabel>File name</guilabel>.</para>
|
|
</step>
|
|
<step><para>Enter search term(s) in the text field at the top of the
|
|
window.</para>
|
|
</step>
|
|
<step><para>Click the <guilabel>Search</guilabel> button or
|
|
hit the <keycap>Enter</keycap> key to start the search.</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>The initial default search mode is <guilabel>Any
|
|
term</guilabel>. This will look for documents with any of the
|
|
search terms (the ones with more terms will get better scores).
|
|
<guilabel>All terms</guilabel> will ensure
|
|
that only documents with all the terms will be
|
|
returned. <guilabel>File name</guilabel> will specifically
|
|
look for file names, and allows using wildcards
|
|
(<literal>*</literal>, <literal>?</literal> ,
|
|
<literal>[]</literal>). </para>
|
|
|
|
<para>&RCL; remembers the last few searches that you
|
|
performed. You can use the simple search text entry widget (a
|
|
combobox) to recall them (click on the thing at the right of the
|
|
text field). Please note, however, that only the search texts
|
|
are remembered, not the mode (all/any/filename).</para>
|
|
|
|
<para>You can use the <guilabel>Tools</guilabel> / <guilabel>Advanced
|
|
search</guilabel> dialog for more complex searches.</para>
|
|
|
|
<para>After starting a search, a list of results will instantly
|
|
be displayed in the main list window. Clicking on the
|
|
<literal>Preview</literal> link for an entry will open an
|
|
internal preview window for the document. Clicking the
|
|
<literal>Edit</literal> link will attempt to start an external
|
|
viewer (have a look at the <filename>mimeconf</filename>
|
|
configuration file to see how these are configured).</para>
|
|
|
|
<para>By default, the document list is presented in order of
|
|
relevance (how well the system estimates that the document
|
|
matches the query). You can specify a different ordering by
|
|
using the <link linkend="rcl.search.sort"><guilabel>Tools</guilabel>
|
|
/ <guilabel>Sort parameters</guilabel></link> dialog.</para>
|
|
|
|
<para>The <literal>Preview</literal> and <literal>Edit</literal>
|
|
edit links may not be present for all entries, meaning that
|
|
&RCL; has no configured way to preview a given file type (which
|
|
was indexed by name only), or no configured external viewer for
|
|
the file type. This can sometimes be adjusted simply by tweaking
|
|
the <link linkend="rclinstall.config.mimemap">
|
|
<filename>mimemap</filename></link> and
|
|
<link linkend="rclinstall.config.mimeconf">
|
|
<filename>mimeconf</filename></link> configuration files.</para>
|
|
|
|
<para>You can click on the <literal>Query details</literal> link
|
|
at the top of the results page to see the query actually
|
|
performed, after stem expansion and other processing.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.complex">
|
|
<title>Complex/advanced search</title>
|
|
|
|
<para>The advanced search dialog has fields that will allow a more
|
|
refined search, looking for documents with all given words, a
|
|
given exact phrase, none of the given words, or a given file
|
|
name (with wildcard expansion). All relevant fields will be
|
|
combined by an implicit AND clause.</para>
|
|
|
|
<para>It will let you search for documents of specific mime
|
|
types (ie: only <literal>text/plain</literal>, or
|
|
<literal>text/html</literal> or
|
|
<literal>application/pdf</literal> etc...)</para>
|
|
|
|
<para>It will let you restrict the search results to a subtree of
|
|
the indexed area.</para>
|
|
|
|
<para>Click on the <guilabel>Start Search</guilabel> button in
|
|
the advanced search dialog to start the search. The button in
|
|
the main window always performs a simple search.</para>
|
|
|
|
<para>Click on the <literal>Show query details</literal> link at
|
|
the top of the result page to see the query expansion.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.multidb">
|
|
<title>Multiple databases</title>
|
|
|
|
<para>Your &RCL; configuration always defines a main index. This
|
|
is what gets updated, for example, when you execute
|
|
<command>recollindex</command>. </para>
|
|
|
|
<para>You can use the <link
|
|
linkend="rcl.search.custom.extradb">search configuration
|
|
tool</link> to define additional databases to be searched. These
|
|
databases can be made active or inactive at any moment.</para>
|
|
|
|
<para>The typical use of this feature is for a system
|
|
administrator to set up a central index, that you may choose to
|
|
search, or not, in addition to your personal data. Of course,
|
|
there are other possibilities.</para>
|
|
|
|
<para>The main index (defined by your personal configuration) is
|
|
always active.</para>
|
|
|
|
<para>The list of searchable databases may also be defined by
|
|
the <literal>RECOLL_EXTRA_DBS</literal> environment
|
|
variable. This should hold a colon-separated list of index
|
|
directories, ie:
|
|
<screen>export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</screen>
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.history">
|
|
<title>Document history</title>
|
|
|
|
<para>Documents that you actually view (with the internal preview
|
|
or an external tool) are entered into the document history,
|
|
which is remembered. You can display the history list by using
|
|
the <guilabel>Tools/</guilabel><guilabel>Doc History</guilabel> menu
|
|
entry.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.sort">
|
|
<title>Result list sorting</title>
|
|
|
|
<para>The documents in a result list are normally sorted in
|
|
order of relevance. It is possible to specify different sort
|
|
parameters by using the <guimenu>Sort parameters</guimenu>
|
|
dialog (located in the <guimenu>Tools</guimenu>
|
|
menu).</para>
|
|
|
|
<para>The tool sorts a specified number of the most
|
|
relevant documents in the result list, according to
|
|
specified criteria. The currently available criteria are
|
|
<emphasis>date</emphasis> and <emphasis>mime type</emphasis>.</para>
|
|
|
|
<para>The sort parameters stay in effect until they are explicitely
|
|
reset, or the program exits. An activated sort is indicated in
|
|
the result list header.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.resultlist">
|
|
<title>Additional result list functionality</title>
|
|
|
|
<para>Apart from the preview and edit links, you can display a
|
|
popup menu by right-clicking over a paragraph in the result
|
|
list. This menu has the following entries:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para><guilabel>Preview</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Edit</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy File Name</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Copy Url</guilabel></para></listitem>
|
|
<listitem><para><guilabel>Find similar</guilabel></para></listitem>
|
|
</itemizedlist>
|
|
|
|
<para>The <guilabel>Preview</guilabel> and
|
|
<guilabel>Edit</guilabel> entries do the same thing as the
|
|
corresponding links. The two following entries will copy either
|
|
an url or the file path to the clipboard, for pasting into
|
|
another application.</para>
|
|
|
|
<para>The <guilabel>Find similar</guilabel> entry will select
|
|
a number of relevant term from the current document and enter
|
|
them into the simple search field. You can then start a simple
|
|
search, with a good chance of finding documents related to the
|
|
current result.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.tips">
|
|
<title>Search tips, shortcuts</title>
|
|
|
|
<formalpara><title>Disabling stem expansion</title>
|
|
<para>Entering a capitalized word in any search field will prevent
|
|
stem expansion (no search for
|
|
<literal>gardening</literal> if you enter
|
|
<literal>Garden</literal> instead of
|
|
<literal>garden</literal>). This is the only case where
|
|
character case should make a difference for a &RCL;
|
|
search.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Phrases</title>
|
|
<para>A phrase can be looked for by enclosing it in double
|
|
quotes. Example: <literal>"user manual"</literal> will look
|
|
only for occurrences of <literal>user</literal> immediately
|
|
followed by <literal>manual</literal>. You can use the
|
|
<guilabel>This exact phrase</guilabel> field of the advanced
|
|
search dialog to the same effect.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Term completion</title>
|
|
<para>Typing <keycap>^TAB</keycap> (Control+Tab) in the simple
|
|
search entry field while entering a word will either complete
|
|
the current word if its beginning matches a unique term in the
|
|
index, or open a window to propose a list of completions</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Picking up new terms for search from displayed
|
|
documents</title>
|
|
<para>Double-clicking on a word in the result list or in a
|
|
preview window will copy it to the simple search entry field.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Finding related documents</title>
|
|
<para>Selecting the <guilabel>More like this</guilabel> entry
|
|
in the result list paragraph right-click menu will select a
|
|
set of "interesting" terms from the current result, and insert
|
|
them into the simple search entry field. You can then possibly
|
|
edit the list and start a search to find documents which may
|
|
be apparented to the current result.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Query explanation</title>
|
|
<para>You can get an exact description of what the query
|
|
looked for, including stem expansion, and boolean operators
|
|
used, by clicking on the result list header.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>File names</title>
|
|
<para>File names are added as terms during indexing, and you can
|
|
specify them as ordinary terms in normal search fields (&RCL; used
|
|
to index all directories in the file path as terms. This has been
|
|
abandonned as it did not seem really useful). Alternatively, you
|
|
can use specific file name search which will
|
|
<emphasis>only</emphasis> look for file names and can use wildcard
|
|
expansion.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Quitting</title>
|
|
<para>Entering <keycap>^Q</keycap> almost anywhere will
|
|
close the application.</para>
|
|
</formalpara>
|
|
|
|
<formalpara><title>Closing previews</title>
|
|
<para>Entering <keycap>Esc</keycap> will close the preview
|
|
window and all its tabs. Entering <keycap>^W</keycap> in a tab will
|
|
close it (and, for the last tab, close the preview window).</para>
|
|
</formalpara>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.search.custom">
|
|
<title>Customising the search interface</title>
|
|
|
|
<para>It is possible to customise some aspects of the search
|
|
interface by using <guimenu>Query configuration</guimenu> entry
|
|
in the <guimenu>Preferences</guimenu> menu.</para>
|
|
|
|
<para>There are two tabs in the dialog, dealing with the
|
|
interface itself, and with the parameters used for searching and
|
|
returning results.</para>
|
|
|
|
<formalpara><title>User interface parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
|
|
<listitem><para><guilabel>Number of results in a result
|
|
page</guilabel></para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Result list font</guilabel>: There
|
|
is quite a lot of information shown in the result list, and
|
|
you may want to customise the font and/or font size. The rest
|
|
of the fonts used by &RCL; are determined by your generic QT
|
|
config (try the <command>qtconfig</command> command.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Html help browser</guilabel>: this
|
|
will let you chose your preferred browser which will be
|
|
started from the <guimenu>Help</guimenu> menu to read the user
|
|
manual. You can enter a simple name if the command is in your
|
|
PATH, or browse for a full pathname.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Show document type icons in result
|
|
list</guilabel>: icons in the result list can be turned
|
|
off. They take quite a lot of space and convey relatively
|
|
little useful information.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Auto-start simple search on
|
|
whitespace entry</guilabel>: if this is checked, a search will
|
|
be executed each time you enter a space in the simple search
|
|
input field. This lets you look at the result list as you
|
|
enter new terms. This is off by default, you may like it or
|
|
not...</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
|
|
<formalpara><title>Search parameters:</title>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para><guilabel>Stemming language</guilabel>:
|
|
stemming obviously depends on the document's language. This
|
|
listbox will let you chose among the stemming databases which
|
|
were built during indexing (this is set in the <link
|
|
linkend="rcl.install.config.recollconf">main configuration
|
|
file</link>), or later added with
|
|
<command>recollindex -s</command> (See the recollindex
|
|
manual). Stemming languages which are dynamically added will be
|
|
deleted at the next indexing pass unless they are also added in
|
|
the configuration file.</para>
|
|
</listitem>
|
|
|
|
<listitem><para><guilabel>Dynamically build
|
|
abstracts</guilabel>: this decides if &RCL; tries to build
|
|
document abstracts when displaying the result list. Abstracts
|
|
are constructed by taking context from the document
|
|
information, around the search terms. This can slow down
|
|
result list display significantly for big documents, and you
|
|
may want to turn it off.</para>
|
|
</listitem>
|
|
<listitem><para><guilabel>Replace abstracts from
|
|
documents</guilabel>: this decides if we should synthetize and
|
|
display an abstract in place of an explicit abstract found
|
|
within the document itself.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</formalpara>
|
|
|
|
<formalpara id="rcl.search.custom.extradb"><title>Extra
|
|
databases:</title>
|
|
<para></para>
|
|
</formalpara>
|
|
<para>This panel will let you browse for additional databases
|
|
that you may want to search. Extra databases are designated by
|
|
their database directory (ie:
|
|
<filename>/home/someothergui/.recoll/xapiandb</filename>,
|
|
<filename>/usr/local/recollglobal/xapiandb</filename>).</para>
|
|
|
|
<para>Once entered, the databases will appear in the
|
|
<guilabel>All extra databases</guilabel> list, and you can
|
|
chose which ones you want to use at any moment by tranferring
|
|
them to/from the <guilabel>Active extra databases</guilabel>
|
|
list.</para>
|
|
<para>Your main database (the one the current configuration
|
|
indexes to), is always implicitely active. If this is not
|
|
desirable, you can set up your configuration so that it indexes,
|
|
for example, an empty directory.</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|
|
|
|
|
|
<chapter id="rcl.install">
|
|
<title>Installation</title>
|
|
|
|
<sect1 id="rcl.install.building">
|
|
<title>Building from source</title>
|
|
|
|
<sect2 id="rcl.install.building.prereqs">
|
|
<title>Prerequisites</title>
|
|
|
|
<para>At the very least, you will need to download and install the
|
|
<ulink url="http://www.xapian.org">xapian core package</ulink>
|
|
(&RCL; development currently uses version 0.9.5), and the <ulink
|
|
url="http://www.trolltech.com/products/qt/index.html">qt
|
|
runtime and development packages</ulink> (&RCL; development
|
|
currently uses version 3.3.5, but any 3.3 version is
|
|
probably ok).</para>
|
|
|
|
<para>You will most probably be able to find a binary package for
|
|
<application>qt</application> for your system. You may have to
|
|
compile &XAP; but this is not difficult (if you are using
|
|
<application>FreeBSD</application>, there is a port).</para>
|
|
|
|
<para>You may also need
|
|
<ulink
|
|
url="http://www.gnu.org/software/libiconv/">libiconv</ulink>. &RCL;
|
|
currently uses version 1.9 (this should not be critical). On
|
|
<application>Linux</application> systems, the iconv interface
|
|
is part of libc and you should not need to do anything
|
|
special.</para>
|
|
|
|
<sect2 id="rcl.install.building.build">
|
|
<title>Building</title>
|
|
|
|
<para>&RCL; has been built on
|
|
Linux (redhat7.3, mandriva 2005, Fedora Core 3), FreeBSD and
|
|
Solaris 8. If you build on another system, <ulink
|
|
url="mailto:jean-francois.dockes@wanadoo.fr">I would very much
|
|
welcome patches</ulink>.</para>
|
|
|
|
<para>Depending on the <application>qt</application>
|
|
configuration on your system, you may have to set the
|
|
<literal>QTDIR</literal> and <literal>QMAKESPECS</literal>
|
|
variables in your environment:</para>
|
|
<itemizedlist>
|
|
<listitem><para><literal>QTDIR</literal> should point to the
|
|
directory above the one that holds the qt include files (ie:
|
|
qt.h).</para>
|
|
</listitem>
|
|
<listitem><para><literal>QMAKESPECS</literal> should
|
|
be set to the name of one of the
|
|
<application>qt</application> mkspecs subdirectories (ie:
|
|
linux-g++).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>On many Linux systems, <literal>QTDIR</literal> is set
|
|
by the login scripts, and <literal>QMAKESPECS</literal> is not
|
|
needed because there is a <filename>default</filename> link in
|
|
<filename>mkspecs/</filename>.</para>
|
|
|
|
<para>The &RCL; <command>configure</command> script does a
|
|
better job of checking these variables after release
|
|
1.1.1. Before this, unexplained errors will occur during
|
|
compilation if the environment is not set up. Also, for 1.1.0 the
|
|
<command>qmake</command> command should be in your PATH (later
|
|
releases can also find it in
|
|
<filename>$QTDIR/bin</filename>).</para>
|
|
|
|
<para>Normal procedure:</para>
|
|
<screen>
|
|
<userinput>cd recoll-xxx</userinput>
|
|
<userinput>configure</userinput>
|
|
<userinput>make</userinput>
|
|
<userinput>(practises usual hardship-repelling invocations)</userinput>
|
|
</screen>
|
|
|
|
|
|
<para>There little autoconfiguration. The
|
|
<command>configure</command> script will mainly link one of
|
|
the system-specific files in the <filename>mk</filename>
|
|
directory to <filename>mk/sysconf</filename>. If your system
|
|
is not known yet, it will tell you as much, and you may want
|
|
to manually copy and modify one of the existing files (the new
|
|
file name should be the output of <command>uname -s</command>).</para>
|
|
</sect2>
|
|
|
|
<sect2 id="rcl.install.building.install">
|
|
<title>Installation</title>
|
|
|
|
<para>Either type <userinput>make install</userinput> or execute
|
|
<userinput>recollinstall
|
|
<replaceable>prefix</replaceable></userinput>, in the root
|
|
of the source tree. This will copy the commands to
|
|
<filename><replaceable>prefix</replaceable>/bin</filename>
|
|
and the sample configuration files, scripts and other shared
|
|
data to
|
|
<filename><replaceable>prefix</replaceable>/share/recoll</filename>.</para>
|
|
<para>You can then proceed to <link
|
|
linkend="rcl.install.config">configuration</link>. </para>
|
|
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.install.binary">
|
|
<title>Installing a prebuilt copy</title>
|
|
|
|
<sect2 id="rcl.install.binary.package">
|
|
<title>Installing through a package system</title>
|
|
|
|
<para>If you are lucky enough to be using a port system or a
|
|
prebuilt package (RPM or other), just follow the usual
|
|
procedure, and have a look at the <link
|
|
linkend="rcl.install.config">configuration
|
|
section</link>.</para>
|
|
</sect2>
|
|
|
|
<sect2 id="rcl.install.binary.rcl">
|
|
<title>Installing a prebuilt &RCL;</title>
|
|
|
|
<para>The unpackaged binary versions are just compressed tar
|
|
files of a build
|
|
tree, where only the useful parts were kept (executables and
|
|
sample configuration).</para>
|
|
|
|
<para>The executable binary files are built with a static link to
|
|
libxapian and libiconv, to make installation easier (no
|
|
dependencies). However, this also means that you cannot change
|
|
the versions which are used.</para>
|
|
|
|
<para>After extracting the tar file, you can proceed with
|
|
<link
|
|
linkend="rcl.install.building.install">installation</link> as
|
|
if you had built the package from source.</para>
|
|
</sect2>
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="rcl.install.external">
|
|
<title>Packages needed for external file types</title>
|
|
|
|
<para>&RCL; uses external applications
|
|
to index some file types. You need to install them for the
|
|
file types that you wish to have indexed (these are run-time
|
|
dependencies. None is needed for building &RCL;):</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem><para>PDF: pdftotext is part of the <ulink
|
|
url="http://www.foolabs.com/xpdf/">Xpdf</ulink> package.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>Postscript: <ulink
|
|
url="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
|
|
pstotext</ulink>.</para>
|
|
</listitem>
|
|
|
|
<listitem><para>MS Word: <ulink url="http://www.winfield.demon.nl">
|
|
antiword</ulink>.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>RTF: <ulink
|
|
url="http://www.gnu.org/software/unrtf/unrtf.html">unrtf</ulink>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>dvi: <ulink
|
|
url="http://www.radicaleye.com/dvips.html">dvips</ulink></para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>djvu:
|
|
<ulink
|
|
url="http://djvulibre.djvuzone.org/doc/index.html">DjVuLibre
|
|
</ulink></para>
|
|
</listitem>
|
|
|
|
<listitem><para>MP3: &RCL; will use the
|
|
<command>id3info</command> command from the <ulink
|
|
url="http://id3lib.sourceforge.net/">id3lib</ulink> package to
|
|
extract tag information. Without it, only the filenames will
|
|
be indexed.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Text, Html, mail folders and Openoffice files are
|
|
processed internally.</para>
|
|
</sect1>
|
|
|
|
<sect1 id="rcl.install.config">
|
|
<title>Configuration overview</title>
|
|
|
|
<para>There are two sets of configuration files. The system-wide
|
|
files are kept in a directory named like
|
|
<filename>/usr/[local/]share/recoll/examples</filename>,
|
|
they define default values for the system. A parallel set of
|
|
files exists by default in the <filename>.recoll</filename> directory
|
|
in your home. This directory can be changed with the
|
|
<literal>RECOLL_CONFDIR</literal> environment variable or the -c
|
|
option parameter to <command>recoll</command> and
|
|
<command>recollindex</command>.</para>
|
|
|
|
<para>If the <filename>.recoll</filename> directory does not
|
|
exist when <command>recoll</command> or
|
|
<command>recollindex</command> are started, it
|
|
will be created with a set of empty configuration files.
|
|
<command>recoll</command> will give you a
|
|
chance to edit the configuration file before starting
|
|
indexing. <command>recollindex</command> will
|
|
proceed immediately.</para>
|
|
|
|
<para>Most of the parameters specific to the
|
|
<command>recoll</command> GUI are set through the
|
|
<guilabel>Preferences</guilabel> menu and stored in the
|
|
standard QT place
|
|
(<filename>$HOME/.qt/recollrc</filename>). You probably do not
|
|
want to edit this by hand.</para>
|
|
|
|
<para>For other options, &RCL; uses text configuration
|
|
files. You will have to edit them by hand for
|
|
now (there is still some hope for a GUI configuration tool
|
|
in the future). The most accurate documentation for the
|
|
configuration parameters is given by comments inside the default
|
|
files, and we will just give a general overview here.</para>
|
|
|
|
<para>All configuration files share the same format. For
|
|
exemple, a short extract of the main configuration file might
|
|
look as follows:</para>
|
|
<programlisting>
|
|
# Space-separated list of directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
</programlisting>
|
|
|
|
<para>There are three kinds of lines: </para>
|
|
<itemizedlist>
|
|
<listitem><para>Comment (starts with
|
|
<emphasis>#</emphasis>) or empty.</para>
|
|
</listitem>
|
|
<listitem><para>Parameter affectation (<emphasis>name =
|
|
value</emphasis>).</para>
|
|
</listitem>
|
|
<listitem><para>Section definition
|
|
([<emphasis>somedirname</emphasis>]).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Section lines allow redefining some parameters for a
|
|
directory subtree. Some of the parameters used for indexing
|
|
are looked up hierarchically from the more to the less
|
|
specific. Not all parameters can be meaningfully redefined,
|
|
this is specified for each in the next section. </para>
|
|
|
|
<para>The tilde character (~) is expanded in file names to the
|
|
name of the user's home directory.</para>
|
|
|
|
<para>White space is used for separation inside lists.
|
|
Elements with embedded spaces can be quoted using
|
|
double-quotes.</para>
|
|
|
|
<sect2 id="rcl.install.config.recollconf">
|
|
<title>Main configuration file</title>
|
|
|
|
<para><filename>recoll.conf</filename> is the main
|
|
configuration file. It defines things like
|
|
what to index (top directories and things to ignore), and the
|
|
default character set to use for document types which do not
|
|
specify it internally.</para>
|
|
|
|
<para>The default configuration will index your home
|
|
directory. If this is not appropriate, start
|
|
<command>recoll</command> to create a blank
|
|
configuration, click <guimenu>Cancel</guimenu>, and edit
|
|
the configuration file before restarting the command. This
|
|
will start the initial indexing, which may take some time.</para>
|
|
|
|
<para>Paramers:</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry><term><literal>topdirs</literal></term>
|
|
<listitem><para>Specifies the list of directories or files to
|
|
index (recursively for directories). The indexer will not
|
|
follow symbolic links inside the indexed trees. If an entry in
|
|
the <literal>topdirs</literal> list is a symbolic link,
|
|
indexing will not start and will generate an error.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>skippedNames</literal></term>
|
|
<listitem>
|
|
<para>A space-separated list of patterns for
|
|
names of files or directories that should be completely
|
|
ignored. The list defined in the default file is: </para>
|
|
<programlisting>
|
|
*~ #* bin CVS Cache caughtspam tmp
|
|
</programlisting>
|
|
<para>The list can be redefined for subdirectories, but is only
|
|
actually changed for the top level ones in
|
|
<literal>topdirs</literal>.</para>
|
|
<para>The top-level directories are not affected by this
|
|
list (that is, a directory in <literal>topdirs</literal>
|
|
might match and would still be indexed).</para>
|
|
<para>The list in the default configuration does not
|
|
exclude hidden directories (names beginning with a
|
|
dot), which means that it may index quite a few things
|
|
that you do not want. On the other hand, mail user
|
|
agents like <application>thunderbird</application>
|
|
usually store messages in hidden directories, and you
|
|
probably want this indexed. One possible solution is to
|
|
have <userinput>.*</userinput> in
|
|
<literal>skippedNames</literal>, and add things like
|
|
<filename>~/.thunderbird</filename> or
|
|
<filename>~/.evolution</filename> in
|
|
<literal>topdirs</literal>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>loglevel</literal></term>
|
|
<listitem><para>Verbosity level for recoll and
|
|
recollindex. A value of 4 lists quite a lot of
|
|
debug/information messages. 2 only lists errors. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>logfilename</literal></term>
|
|
<listitem><para>Where the messages should go. 'stderr' can
|
|
be used as a special value, and is the default. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>filtersdir</literal></term>
|
|
<listitem><para>A directory to search for the external
|
|
filter scripts used to index some types of files. The
|
|
value should not be changed, except if you want to modify
|
|
one of the default scripts. The value can be redefined for
|
|
any subdirectory. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>indexstemminglanguages</literal></term>
|
|
<listitem><para>A list of languages for which the stem
|
|
expansion databases will be built. See recollindex(1) for
|
|
possible values. You can add a stem expansion database for
|
|
a different language by using <command>recollindex
|
|
-s</command>, but it will be deleted during the next
|
|
indexing. Only languages listed in the configuration
|
|
file are permanent.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>iconsdir</literal></term>
|
|
<listitem><para>The name of the directory where
|
|
<command>recoll</command> result list icons are
|
|
stored. You can change this if you want different
|
|
images.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>dbdir</literal></term>
|
|
<listitem><para>The name of the Xapian data directory. It
|
|
will be created if needed when the index is
|
|
initialized. If this is not an absolute path, it will be
|
|
interpreted relative to the configuration directory.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>defaultcharset</literal></term>
|
|
<listitem><para>The name of the character set used for
|
|
files that do not contain a character set definition (ie:
|
|
plain text files). This can be redefined for any
|
|
subdirectory. If it is not set at all, the character set
|
|
used is the one defined by the nls environment (LC_ALL,
|
|
LC_CTYPE, LANG), or iso8859-1 if nothing is set.</para>
|
|
|
|
<varlistentry><term><literal>guesscharset</literal></term>
|
|
<listitem><para>Decide if we try to guess the character
|
|
set of files if no internal value is available (ie: for
|
|
plain text files). This does not work well in general, and
|
|
should probably not be used. </para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>usesystemfilecommand</literal></term>
|
|
<listitem><para>Decide if we use the <command>file -i</command>
|
|
system command as a final step for determining the mime
|
|
type for a file (the main procedure uses suffix
|
|
associations as defined in the <filename>mimemap</filename>
|
|
file). This can be useful for files with suffixless names,
|
|
but it will also cause the indexing of many bogus "text"
|
|
files.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry><term><literal>indexallfilenames</literal></term>
|
|
<listitem><para>&RCL; indexes file names in a special
|
|
section of the database to allow specific file names
|
|
searches using wild cards. This parameter decides if
|
|
file name indexing is performed only for files with mime
|
|
types that would qualify them for full text indexing, or
|
|
for all files inside the selected subtrees, independant of
|
|
mime type.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="rclinstall.config.mimemap">
|
|
<title>The mimemap file</title>
|
|
|
|
<para><filename>mimemap</filename> specifies the
|
|
file name extension to mime type mappings.</para>
|
|
|
|
<para>For file names without an extension, or with an unknown
|
|
one, the system's <command>file -i</command> command will be
|
|
executed to determine the mime type (this can be switched off
|
|
inside the main configuration file).</para>
|
|
|
|
<para>The mappings can be specified on a per-subtree basis,
|
|
which may be useful in some cases. Example:
|
|
<application>gaim</application> logs have a
|
|
<filename>.txt</filename> extension but
|
|
should be handled specially, which is possible because they
|
|
are usually all located in one place.</para>
|
|
|
|
<para><filename>mimemap</filename> also has a
|
|
<literal>recoll_noindex</literal> variable which is a list of
|
|
suffixes. Matching files will be skipped (avoids unnecessary
|
|
decompressions or <command>file</command> executions). This is
|
|
partially redundant with <literal>skippedNames</literal> in
|
|
the main configuration file, with two differences: it will not
|
|
affect directories, and it can be changed for any
|
|
subdirectory.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="rclinstall.config.mimeconf">
|
|
<title>The mimeconf file</title>
|
|
|
|
<para><filename>mimeconf</filename> specifies how the
|
|
different mime types are handled for indexing, and for
|
|
display.</para>
|
|
|
|
<para>Changing the indexing parameters is probably not a
|
|
good idea except if you are a &RCL; developper.</para>
|
|
|
|
<para>You may want to adjust the external viewers defined in
|
|
(ie: html is either previewed internally or displayed using
|
|
<application>firefox</application>, but you may prefer
|
|
<application>mozilla</application>, your
|
|
<application>openoffice.org</application>
|
|
program might be named <command>oofice</command> instead of
|
|
<command>openoffice</command> ...). Look
|
|
for the <literal>[view]</literal> section.</para>
|
|
|
|
<para>You can also change the icons which are displayed by
|
|
<command>recoll</command> in the result lists (the values are
|
|
the basenames of the png images inside the
|
|
<filename>iconsdir</filename> directory (specified in
|
|
<filename>recoll.conf</filename>).</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
</chapter>
|
|
|
|
</book>
|
|
|