This commit is contained in:
Jean-Francois Dockes 2011-10-20 13:39:44 +02:00
parent 8d52e928d1
commit 90233c0426

View File

@ -140,20 +140,20 @@
currently makes no attempt at automatic language recognition.</para> currently makes no attempt at automatic language recognition.</para>
<para>&RCL; has many parameters which define exactly what to <para>&RCL; has many parameters which define exactly what to
index, and how to classify and decode the source index, and how to classify and decode the source documents. These
documents. These are kept in <link are kept in <link linkend="rcl.indexing.config">configuration
linkend="rcl.indexing.config">configuration files</link>. A files</link>. A default configuration is copied into a standard
default configuration is copied into a standard location location (usually something like
(usually something like <filename>/usr/[local/]share/recoll/examples</filename>) during
<filename>/usr/[local/]share/recoll/examples</filename>) installation. The default parameters from this file may be
during installation. The default parameters from this file may overridden by values that you set inside your personal
be overridden by values that you set inside your personal configuration, found by default in the <filename>.recoll</filename>
configuration, found by default in the sub-directory of your home directory. The default configuration
<filename>.recoll</filename> sub-directory of your home will index your home directory with default parameters and should
directory. The default configuration will index your home be sufficient for giving &RCL; a try, but you may want to adjust it
directory with default parameters and should be sufficient for later, which can be done either by editing the text files or by
giving &RCL; a try, but you may want to adjust it using configuration menus in the <command>recoll</command>
later.</para> GUI</para>
<para><link linkend="rcl.indexing.periodic.exec">Indexing</link> <para><link linkend="rcl.indexing.periodic.exec">Indexing</link>
is started automatically the first time you execute the is started automatically the first time you execute the
@ -184,7 +184,7 @@
<para>Indexing is the process by which the set of documents is <para>Indexing is the process by which the set of documents is
analyzed and the data entered into the database. &RCL; indexing analyzed and the data entered into the database. &RCL; indexing
is normally incremental: documents will only be processed if is normally incremental: documents will only be processed if
they have been modified. On the first execution, of course, all they have been modified. On the first execution, all
documents will need processing. A full index build can be forced documents will need processing. A full index build can be forced
later by specifying an option to the indexing command later by specifying an option to the indexing command
(<command>recollindex -z</command>).</para> (<command>recollindex -z</command>).</para>
@ -238,7 +238,7 @@
a folder file archived inside a zip file...</para> a folder file archived inside a zip file...</para>
<para>&RCL; indexing processes plain text, HTML, openoffice <para>&RCL; indexing processes plain text, HTML, openoffice
and e-mail files internally (a few more actually).</para> and e-mail files, and a few others internally.</para>
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...) <para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
need external applications for preprocessing. The list is in the need external applications for preprocessing. The list is in the
@ -342,40 +342,23 @@ recoll
<sect2 id="rcl.indexing.storage.format"> <sect2 id="rcl.indexing.storage.format">
<title>Xapian index formats</title> <title>Xapian index formats</title>
<para>If your first installation of &RCL; was 1.9.0 or more <para>&XAP; versions usually support several formats for index
recent, you can skip this section.</para> storage. A given major &XAP; version will have a current format,
used to create new indexes, and will also support the format from
<para>&XAP; has had two possible index formats for quite some the previous major version.</para>
time. The "old" one named <literal>Quartz</literal>, and the
new one named <literal>Flint</literal>. &XAP; 0.9 used
<literal>Quartz</literal> by default, but could use
<literal>Flint</literal> if a specific environment variable
(<literal>XAPIAN_PREFER_FLINT</literal>) was set. &XAP; 1.0
still supports <literal>Quartz</literal> but will use
<literal>Flint</literal> by default for new index
creations.</para>
<para>The number of disk accesses performed during indexing
has been much optimized in the new <literal>Flint</literal>
engine and you may see indexing times improved by 50% in some
cases (compared to <literal>Quartz</literal>), typically for
big indexes where disk accesses dominate the indexing
time. There is also a more modest improvement of index
size.</para>
<para>&XAP; will not convert automatically an existing index <para>&XAP; will not convert automatically an existing index
from the <literal>Quartz</literal> to the from the older format to the newer one. If you want to upgrade to
<literal>Flint</literal> format. If you have an older index the new format, or if a very old index needs to be converted
and want to take advantage of the new format (which can be because its format is not supported any more, you will have to
done without setting the environment variable as of &RCL; explicitly delete the old index, then run a normal indexing
1.8.2 and &XAP; 1.0.0), you will have to explicitly delete process.</para>
the old index, then run a normal indexing process.</para>
<para>Unfortunately, using the <literal>-z</literal> option to <para>Unfortunately, using the <literal>-z</literal> option to
<command>recollindex</command> is not sufficient to change the <command>recollindex</command> is not sufficient to change the
format, you have to delete all files inside the index format, you will have to delete all files inside the index
directory (typically <filename>~/.recoll/xapiandb</filename>) directory (typically <filename>~/.recoll/xapiandb</filename>)
before starting indexing.</para> before starting the indexing.</para>
</sect2> </sect2>
@ -387,7 +370,7 @@ recoll
complete reconstruction. If confidential data is indexed, complete reconstruction. If confidential data is indexed,
access to the database directory should be restricted. </para> access to the database directory should be restricted. </para>
<para>As of version 1.4, &RCL; will create the configuration <para>&RCL; (since version 1.4) will create the configuration
directory with a mode of 0700 (access by owner only). As the directory with a mode of 0700 (access by owner only). As the
index data directory is by default a sub-directory of the index data directory is by default a sub-directory of the
configuration directory, this should result in appropriate configuration directory, this should result in appropriate
@ -511,16 +494,16 @@ recoll
<title>Running indexing</title> <title>Running indexing</title>
<para>Indexing is performed either by the <para>Indexing is performed either by the
<command>recollindex</command> program, or by the <command>recollindex</command> program, or by the indexing thread
indexing thread inside the <command>recoll</command> inside the <command>recoll</command> program (start it from the
program (use the <guimenu>File</guimenu> menu). Both programs <guimenu>File</guimenu> menu). Both programs will use the
will use the <literal>RECOLL_CONFDIR</literal> <literal>RECOLL_CONFDIR</literal> variable or accept a
variable or accept a <literal>-c</literal> <literal>-c</literal> <replaceable>confdir</replaceable> option
<replaceable>confdir</replaceable> option to specify a non-default to specify a non-default configuration directory.</para>
configuration directory.</para>
<para>Reasons to use either the indexing thread or the <para>There are reasons to use either the indexing thread or the
<command>recollindex</command> command: <command>recollindex</command> command, but it is also a matter of
personal preferences:
<itemizedlist> <itemizedlist>
<listitem><para>Starting the indexing thread is more convenient, <listitem><para>Starting the indexing thread is more convenient,
being just one click away.</para> being just one click away.</para>
@ -534,14 +517,15 @@ recoll
but who knows...)</para> but who knows...)</para>
</listitem> </listitem>
<listitem><para>The <command>recollindex</command> command uses <listitem><para>The <command>recollindex</command> command uses
<command>setpriority/nice</command> to lower its priority while <command>setpriority/nice</command> to lower its priority
indexing while indexing. When available (and for &RCL; version
(it will also use <command>ionice</command> when this becomes 1.16.2 and newer), it also uses the
more widely available), the thread can't do it, else it would <command>ionice</command> command to lower its IO
also slow down the user/search interface.</para> priority. The thread can't do it, else it would also slow
down the user/search interface.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
I'll let the reader decide where my heart belongs...</para> </para>
<para>If the <command>recoll</command> program finds no index <para>If the <command>recoll</command> program finds no index
when it starts, it will automatically start indexing (except when it starts, it will automatically start indexing (except
@ -631,7 +615,7 @@ recoll
with the <literal>--with[out]-fam</literal> or with the <literal>--with[out]-fam</literal> or
<literal>--with[out]-inotify</literal> options. The default is <literal>--with[out]-inotify</literal> options. The default is
currently to include inotify monitoring on systems that support currently to include inotify monitoring on systems that support
it.</para> it, and, as of recoll 1.17, gamin support on FreeBSD.</para>
<para>The <filename>rclmon.sh</filename> script can be used to <para>The <filename>rclmon.sh</filename> script can be used to
easily start and stop the daemon. It can be found in the easily start and stop the daemon. It can be found in the
@ -1311,19 +1295,13 @@ fvwm
<title>Sorting search results and collapsing duplicates</title> <title>Sorting search results and collapsing duplicates</title>
<para>The documents in a result list are normally sorted in <para>The documents in a result list are normally sorted in
order of relevance. It is possible to specify different sort order of relevance. It is possible to specify a different sort
parameters by using the <guimenu>Sort parameters</guimenu> order, either by using the vertical arrows in the GUI toolbox to
dialog (located in the <guimenu>Tools</guimenu> menu).</para> sort by date, or switching to the result table display and clicking
on any header. The sort order chosen inside the result table
<para>The tool sorts a specified number of the most remains active if you switch back to the result list, until you
relevant documents in the result list, according to specified click one of the vertical arrows, until both are unchecked (you are
criteria. The currently available criteria are back to sort by relevance).</para>
<emphasis>date</emphasis> and <emphasis>mime
type</emphasis>.</para>
<para>The sort parameters stay in effect until they are
explicitly reset, or the program exits. An activated sort is
indicated in the result list header.</para>
<para>Sort parameters are remembered between program <para>Sort parameters are remembered between program
invocations, but result sorting is normally always inactive invocations, but result sorting is normally always inactive
@ -1427,15 +1405,34 @@ fvwm
<formalpara><title>AutoPhrases</title> <formalpara><title>AutoPhrases</title>
<para>This option can be set in the preferences dialog. If it is <para>This option can be set in the preferences dialog. If it is
set, a phrase will be automatically built and added to simple set, a phrase will be automatically built and added to simple
searches when looking for <literal>Any terms</literal>. This searches when looking for <literal>Any terms</literal>. This
will not change radically the results, but will give a relevance will not change radically the results, but will give a relevance
boost to the results where the search terms appear as a boost to the results where the search terms appear as a
phrase. Ie: searching for <literal>virtual reality</literal> phrase. Ie: searching for <literal>virtual reality</literal>
will still find all documents where either will still find all documents where either
<literal>virtual</literal> or <literal>reality</literal> or <literal>virtual</literal> or <literal>reality</literal> or
both appear, but those which contain <literal>virtual both appear, but those which contain <literal>virtual
reality</literal> should appear sooner in the list.</para> reality</literal> should appear sooner in the list.</para>
<para>Phrase searches can strongly slow down a query if most of the
terms in the phrase are common. This is why the
<literal>autophrase</literal> option is off by default for &RCL;
versions before 1.17. As of version 1.17,
<literal>autophrase</literal> is on by default, but very common
terms will be removed from the constructed phrase. The removal
threshold can be adjusted from the search preferences.</para>
<formalpara><title>Phrases and abbreviations</title> <para>As of
&RCL; version 1.17, dotted abbreviations like
<literal>I.B.M.</literal> are also automatically indexed as a word
without the dots: <literal>IBM</literal>. Searching for the word
inside a phrase (ie: <literal>"the IBM company"</literal>) will only
match the dotted abrreviation if you increase the phrase slack (using the
advanced search panel control, or the <literal>o</literal> query
language modifier). Literal occurences of the word will be matched
normally.</para>
</sect3> </sect3>
@ -3406,6 +3403,13 @@ skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
<programlisting> <programlisting>
skippedPaths = ~/somedir/&lowast;.txt skippedPaths = ~/somedir/&lowast;.txt
</programlisting> </programlisting>
<para>The values in the <literal>*skippedPaths</literal>
variables are currently matched with
<literal>fnmatch(3)</literal>, with the FNM_PATHNAME and
FNM_LEADING_DIR flags. This means that '/' characters must
be matched explicitely, which is probably
unfortunate.</para>
</listitem> </listitem>
</varlistentry> </varlistentry>