This commit is contained in:
Jean-Francois Dockes 2015-01-19 16:57:03 +01:00
parent 4a987b708e
commit d6acbdfd9e

View File

@ -50,18 +50,23 @@
<sect1 id="RCL.INTRODUCTION.TRYIT"> <sect1 id="RCL.INTRODUCTION.TRYIT">
<title>Giving it a try</title> <title>Giving it a try</title>
<para>If you do not like reading manuals (who does?) and would like <para>If you do not like reading manuals (who does?) but
to give &RCL; a try, just <link wish to give &RCL; a try, just <link
linkend="RCL.INSTALL.BINARY">install</link> the application and linkend="RCL.INSTALL.BINARY">install</link> the application
start the <command>recoll</command> graphical user interface (GUI), and start the <command>recoll</command> graphical user
which will ask to index your home directory by default, allowing interface (GUI), which will ask permission to index your home
you to search immediately after indexing completes.</para> directory by default, allowing you to search immediately after
indexing completes.</para>
<para>Do not do this if your home directory contains a huge <para>Do not do this if your home directory contains a huge
number of documents and you do not want to wait or are very number of documents and you do not want to wait or are very
short on disk space. In this case, you may first want to customize short on disk space. In this case, you may first want to customize
the <link linkend="RCL.INDEXING.CONFIG">configuration</link> the <link linkend="RCL.INDEXING.CONFIG">configuration</link>
to restrict the indexed area.</para> to restrict the indexed area (for the very impatient with a completed package install, from the <command>recoll</command> GUI: <menuchoice>
<guimenu>Preferences</guimenu>
<guimenuitem>Indexing configuration</guimenuitem>
</menuchoice>, then adjust the <guilabel>Top
directories</guilabel> section).</para>
<para>Also be aware that you may need to install the <para>Also be aware that you may need to install the
appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting
@ -74,12 +79,12 @@
<title>Full text search</title> <title>Full text search</title>
<para>&RCL; is a full text search application. Full text search <para>&RCL; is a full text search application. Full text search
applications let you find your data by content rather finds your data by content rather than by external attributes
than by external attributes (like a file name). More (like a file name). You specify words
specifically, they will let you specify words (terms) that (terms) which should or should not appear in the text you are
should or should not appear in the text you are looking for, looking for, and receive in return a list of matching
and return a list of matching documents, ordered so that the documents, ordered so that the most
most <emphasis>relevant</emphasis> documents will appear <emphasis>relevant</emphasis> documents will appear
first.</para> first.</para>
<para>You do not need to remember in what file or email message you <para>You do not need to remember in what file or email message you
@ -88,27 +93,30 @@
these terms are prominent, in a similar way to Internet search these terms are prominent, in a similar way to Internet search
engines.</para> engines.</para>
<para>A search application tries to determine which documents are <para>Full text search applications try to determine which
most relevant to the search terms you provide. Computer algorithms documents are most relevant to the search terms you
for determining relevance can be very complex, and in general are provide. Computer algorithms for determining relevance can be
inferior to the power of the human mind to rapidly determine very complex, and in general are inferior to the power of the
relevance. The quality of relevance guessing is probably the most human mind to rapidly determine relevance. The quality of
important aspect when evaluating a search application.</para> relevance guessing is probably the most important aspect when
evaluating a search application.</para>
<para>In many cases, you are looking for all the forms of a <para>In many cases, you are looking for all the forms of a
word, not for a specific form or spelling. These different forms word, including plurals, different tenses for a verb, or terms
may include plurals, different tenses for a verb, or terms derived derived from the same root or <emphasis>stem</emphasis>
from the same root or <emphasis>stem</emphasis> (example: floor, (example: <replaceable>floor, floors, floored,
floors, floored, flooring...). Search applications usually expand flooring...</replaceable>). Queries are usually automatically
queries to all such related terms (words that reduce to the same expanded to all such related terms (words that reduce to the
stem) and also provide a way to disable this expansion if you are same stem). This can be prevented for searching for a specific
actually searching for a specific form.</para> form.</para>
<para>Stemming, by itself, does not accommodate for misspellings or
phonetic searches. &RCL; supports these features through a specific
tool (the <literal>term explorer</literal>) which will let you
explore the set of index terms along different modes.</para>
<para>Stemming, by itself, does not accommodate for misspellings
or phonetic searches. A full text search application may also
support this form of approximation. For example, a search for
<replaceable>aliterattion</replaceable> returning no result may
propose, depending on index contents, <replaceable>alliteration
alteration alterations altercation</replaceable> as possible
replacement terms. </para>
</sect1> </sect1>
@ -120,14 +128,25 @@
library as its storage and retrieval engine. &XAP; is a very library as its storage and retrieval engine. &XAP; is a very
mature package using <ulink mature package using <ulink
url="http://www.xapian.org/docs/intro_ir.html">a sophisticated url="http://www.xapian.org/docs/intro_ir.html">a sophisticated
probabilistic ranking model</ulink>. &RCL; provides the mechanisms probabilistic ranking model</ulink>.</para>
and interface to get data into and out of the system.</para>
<para>In practice, &XAP; works by remembering where terms appear <para>The &XAP; library manages an index database which
in your document files. The acquisition process is called describes where terms appear in your document files. It
indexing. </para> efficiently processes the complex queries which are produced by
the &RCL; query expansion mechanism, and is in charge of the
all-important relevance computation task.</para>
<para>The resulting index can be big (roughly the size of the <para>&RCL; provides the mechanisms and interface to get data
into and out of the index. This includes translating the many
possible document formats into pure text, handling term
variations (using &XAP; stemmers), and spelling approximations
(using the <application>aspell</application> speller),
interpreting user queries and presenting results.</para>
<para>In a shorter way, &RCL; does the dirty footwork, &XAP;
deals with the intelligent parts of the process.</para>
<para>The &XAP; index can be big (roughly the size of the
original document set), but it is not a document original document set), but it is not a document
archive. &RCL; can only display documents that still exist at archive. &RCL; can only display documents that still exist at
the place from which they were indexed. (Actually, there is a the place from which they were indexed. (Actually, there is a
@ -136,9 +155,12 @@
punctuation and capitalization are lost).</para> punctuation and capitalization are lost).</para>
<para>&RCL; stores all internal data in <application>Unicode <para>&RCL; stores all internal data in <application>Unicode
UTF-8</application> format, and it can index files with UTF-8</application> format, and it can index files of many types
different character sets, encodings, and languages into the same with different character sets, encodings, and languages into the
index. It has can process many document types.</para> same index. It can process documents embedded inside other
documents (for example a pdf document stored inside a Zip
archive sent as an email attachment...), down to an arbitrary
depth.</para>
<para>Stemming is the process by which &RCL; reduces words to <para>Stemming is the process by which &RCL; reduces words to
their radicals so that searching does not depend, for example, on a their radicals so that searching does not depend, for example, on a
@ -206,9 +228,12 @@
<para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing <para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing
process</link> is started automatically the first time you process</link> is started automatically the first time you
execute the <command>recoll</command> GUI. Indexing can also be execute the <command>recoll</command> GUI. Indexing can also
performed by executing the <command>recollindex</command> be performed by executing the <command>recollindex</command>
command.</para> command. &RCL; indexing is multithreaded by default when
appropriate hardware resources are available, and can perform
in parallel multiple tasks among text extraction, segmentation
and index updates.</para>
<para><link linkend="RCL.SEARCH">Searches</link> are usually <para><link linkend="RCL.SEARCH">Searches</link> are usually
performed inside the <command>recoll</command> GUI, which has many performed inside the <command>recoll</command> GUI, which has many
@ -220,7 +245,10 @@
<application>Python</application> <application>Python</application>
programming interface</link>, a <link linkend="RCL.SEARCH.KIO"> programming interface</link>, a <link linkend="RCL.SEARCH.KIO">
<application>KDE</application> KIO slave module</link>, and <application>KDE</application> KIO slave module</link>, and
a <ulink url="&WIKI;UnityLens">Ubuntu Unity Lens</ulink> module. Ubuntu Unity <ulink url="https://bitbucket.org/medoc/unity-lens-recoll">
Lens</ulink> (for older versions) or
<ulink url="https://bitbucket.org/medoc/unity-scope-recoll">
Scope</ulink> (for current versions) modules.
</para> </para>
</sect1> </sect1>
@ -236,11 +264,11 @@
<para>Indexing is the process by which the set of documents is <para>Indexing is the process by which the set of documents is
analyzed and the data entered into the database. &RCL; analyzed and the data entered into the database. &RCL;
indexing is normally incremental: documents will only be indexing is normally incremental: documents will only be
processed if they have been modified. On the first execution, processed if they have been modified since the last run. On
all documents will need processing. A full index build can be the first execution, all documents will need processing. A
forced later by specifying an option to the indexing command full index build can be forced later by specifying an option
(<command>recollindex</command> <option>-z</option> to the indexing command (<command>recollindex</command>
or <option>-Z</option>).</para> <option>-z</option> or <option>-Z</option>).</para>
<para>The following sections give an overview of different <para>The following sections give an overview of different
aspects of the indexing processes and configuration, with links aspects of the indexing processes and configuration, with links
@ -1853,6 +1881,11 @@ MimeType=*/*
term is not known. For example, you may not remember the exact term is not known. For example, you may not remember the exact
spelling, or only know the beginning of the name.</para> spelling, or only know the beginning of the name.</para>
<para>The search will only propose replacement terms with
spelling variations when no matching document were found. In some
cases, both proper spellings and mispellings are present in the
index, and it may be interesting to look for them explicitely.</para>
<para>The term explorer tool (started from the toolbar icon or <para>The term explorer tool (started from the toolbar icon or
from the <guilabel>Term explorer</guilabel> entry of the from the <guilabel>Term explorer</guilabel> entry of the
<guilabel>Tools</guilabel> menu) can be used to search the full index <guilabel>Tools</guilabel> menu) can be used to search the full index
@ -4636,9 +4669,11 @@ except:
<listitem><para>Openoffice files need <command>unzip</command> and <listitem><para>Openoffice files need <command>unzip</command> and
<command>xsltproc</command>.</para></listitem> <command>xsltproc</command>.</para></listitem>
<listitem><para>PDF files need <command>pdftotext</command> which <listitem><para>PDF files need <command>pdftotext</command>
is part of the <application>Xpdf</application> or which is part of <application>Poppler</application> (usually
<application>Poppler</application> packages.</para></listitem> comes with the <literal>poppler-utils</literal>
package). Avoid the original one from
<application>Xpdf</application>.</para></listitem>
<listitem><para>Postscript files need <command>pstotext</command>. <listitem><para>Postscript files need <command>pstotext</command>.
The original version has an issue with shell The original version has an issue with shell
@ -4663,9 +4698,11 @@ except:
<application>libwpd-tools</application> on Ubuntu) <application>libwpd-tools</application> on Ubuntu)
package.</para></listitem> package.</para></listitem>
<listitem><para>RTF files need <command>unrtf</command>, which, in <listitem><para>RTF files need <command>unrtf</command>,
its standard version, has much trouble with non-western character which, in its older versions, has much trouble with
sets. Check &RCLAPPS;.</para></listitem> non-western character sets. Many Linux distributions carry
outdated <command>unrtf</command> versions. Check
&RCLAPPS; for details.</para></listitem>
<listitem><para>TeX files need <command>untex</command> or <listitem><para>TeX files need <command>untex</command> or
<command>detex</command>. Check &RCLAPPS; for sources if it's not <command>detex</command>. Check &RCLAPPS; for sources if it's not