This commit is contained in:
Jean-Francois Dockes 2012-10-15 08:40:50 +02:00
parent 268e3824dc
commit 6bd88ca32f

View File

@ -64,8 +64,8 @@
<para>Also be aware that you may need to install the <para>Also be aware that you may need to install the
appropriate <link linkend="rcl.install.external"> supporting appropriate <link linkend="rcl.install.external"> supporting
applications</link> for document types that need them (for applications</link> for document types that need them (for
example <application>antiword</application> for ms-word example <application>antiword</application> for
files).</para> <application>Microsoft Word</application> files).</para>
</sect1> </sect1>
<sect1 id="rcl.introduction.search"> <sect1 id="rcl.introduction.search">
@ -83,7 +83,7 @@
<para>You do not need to remember in what file or email message you <para>You do not need to remember in what file or email message you
stored a given piece of information. You just ask for related stored a given piece of information. You just ask for related
terms, and the tool will return a list of documents where terms, and the tool will return a list of documents where
those terms are prominent, in a similar way to Internet search these terms are prominent, in a similar way to Internet search
engines.</para> engines.</para>
<para>A search application tries to determine which documents are <para>A search application tries to determine which documents are
@ -143,7 +143,7 @@
word being singular or plural (floor, floors), or on a verb tense word being singular or plural (floor, floors), or on a verb tense
(flooring, floored). Because the mechanisms used for stemming (flooring, floored). Because the mechanisms used for stemming
depend on the specific grammatical rules for each language, there depend on the specific grammatical rules for each language, there
is a separate stemmer module for most common languages where is a separate &XAP; stemmer module for most common languages where
stemming makes sense.</para> stemming makes sense.</para>
<para>&RCL; stores the unstemmed versions of terms in the main index <para>&RCL; stores the unstemmed versions of terms in the main index
@ -160,26 +160,27 @@
recognition, which means that the stemmer will sometimes be applied recognition, which means that the stemmer will sometimes be applied
to terms from other languages with potentially strange results. In to terms from other languages with potentially strange results. In
practise, even if this introduces possibilities of confusion, this practise, even if this introduces possibilities of confusion, this
approach has been proven quite useful, and, awaiting the addition approach has been proven quite useful, and it is much less
of an automatic language recognition module to &RCL;, it is much cumbersome than separating your documents according to what
less cumbersome than separating your documents according to what
language they are written in.</para> language they are written in.</para>
<para>Before version 1.18, &RCL; always stripped most accents and <para>Before version 1.18, &RCL; stripped most accents and
diacritics from terms, and converted them to lower case before diacritics from terms, and converted them to lower case before
storing them in the index. As a consequence, it was impossible to either storing them in the index or searching for them. As a
search for a particular capitalization of a term consequence, it was impossible to search for a particular
(<literal>US</literal> / <literal>us</literal>), or to capitalization of a term (<literal>US</literal> /
discriminate two terms based on diacritics (<literal>sake</literal> <literal>us</literal>), or to discriminate two terms based on
/ <literal>saké</literal>, <literal>mate</literal> / diacritics (<literal>sake</literal> / <literal>saké</literal>,
<literal>maté</literal>).</para> <literal>mate</literal> / <literal>maté</literal>).</para>
<para>As of version 1.18, &RCL; can optionally store the raw terms, <para>As of version 1.18, &RCL; can optionally store the raw terms,
without accent stripping or case conversion. Expansions necessary without accent stripping or case conversion. In this configuration,
for searches insensitive to case and/or diacritics are then it is still possible (and most common) for a query to be
performed when searching. This is described in more detail in the insensitive to case and/or diacritics. Appropriate term expansions
<link linkend="RCL.INDEXING.CONFIG.SENS">section about index case are performed before actually accessing the main index. This is
and diacritics sensitivity</link>.</para> described in more detail in the <link
linkend="RCL.INDEXING.CONFIG.SENS">section about index case and
diacritics sensitivity</link>.</para>
<para>&RCL; has many parameters which define exactly what to <para>&RCL; has many parameters which define exactly what to
index, and how to classify and decode the source index, and how to classify and decode the source
@ -197,7 +198,9 @@
sufficient for giving &RCL; a try, but you may want to adjust sufficient for giving &RCL; a try, but you may want to adjust
it later, which can be done either by editing the text files it later, which can be done either by editing the text files
or by using configuration menus in the or by using configuration menus in the
<command>recoll</command> GUI</para> <command>recoll</command> GUI. Some other parameters affecting only
the <command>recoll</command> GUI are stored in the standard
location defined by <application>Qt</application>.</para>
<para>The <link linkend="rcl.indexing.periodic.exec">indexing <para>The <link linkend="rcl.indexing.periodic.exec">indexing
process</link> is started automatically the first time you process</link> is started automatically the first time you
@ -241,7 +244,7 @@
aspects of the indexing processes and configuration, with links aspects of the indexing processes and configuration, with links
to detailed sections.</para> to detailed sections.</para>
<sect2> <sect2 id="rcl.indexing.introduction.modes">
<title>Indexing modes</title> <title>Indexing modes</title>
<para>&RCL; indexing can be performed along two different modes: <para>&RCL; indexing can be performed along two different modes:
@ -279,20 +282,30 @@
directory). Monitoring a big file system tree can consume directory). Monitoring a big file system tree can consume
significant system resources.</para> significant system resources.</para>
<para>The choice of method and the parameters used can be
configured from the <command>recoll</command> GUI:
<menuchoice>
<guimenu>Preferences</guimenu>
<guimenuitem>Indexing schedule</guimenuitem>
</menuchoice>
</para>
</sect2> </sect2>
<sect2> <sect2 id="rcl.indexing.introduction.config">
<title>Configurations, multiple indexes</title> <title>Configurations, multiple indexes</title>
<para>The parameters describing what is to be indexed and <para>The parameters describing what is to be indexed and
local preferences are defined in text files contained in a local preferences are defined in text files contained in a
<link linkend="rcl.indexing.config">configuration <link linkend="rcl.indexing.config">configuration
directory</link>.</para> directory</link>.</para>
<para>All parameters have defaults, defined in system-wide <para>All parameters have defaults, defined in system-wide
files.</para> files.</para>
<para>Without further configuration, &RCL; will index all <para>Without further configuration, &RCL; will index all
appropriate files from your home directory, with a reasonable appropriate files from your home directory, with a reasonable
set of defaults.</para> set of defaults.</para>
<para>A default personal configuration directory <para>A default personal configuration directory
(<filename>$HOME/.recoll/</filename>) is created (<filename>$HOME/.recoll/</filename>) is created
when a &RCL; program is first executed. It is possible to when a &RCL; program is first executed. It is possible to
@ -308,14 +321,14 @@
would be done to separate personal and shared would be done to separate personal and shared
indexes, or to take advantage of the organization of your data indexes, or to take advantage of the organization of your data
to improve search precision.</para> to improve search precision.</para>
<para>The generated indexes can <para>The generated indexes can
be <link linkend="rcl.search.multidb">queried be queried concurrently in a transparent manner.</para>
concurrently</link> in a transparent manner.</para>
<para>For index generation, multiple configurations are <para>For index generation, multiple configurations are
totally independant from each other. When multiple indexes need totally independant from each other. When multiple indexes need
to be used for a single search, to be used for a single search,
<link linkend="rcl.search.multidb">some parameters <link linkend="rcl.indexing.config.multiple">some parameters
should be consistent among the configurations</link>.</para> should be consistent among the configurations</link>.</para>
</sect2> </sect2>
@ -331,8 +344,8 @@
one document. Some file types, like email folders or zip one document. Some file types, like email folders or zip
archives, can hold many individually indexed documents, which may archives, can hold many individually indexed documents, which may
themselves be compound ones. Such hierarchies can go quite themselves be compound ones. Such hierarchies can go quite
deep, and &RCL; can process, for example, an deep, and &RCL; can process, for example, a
<application>ms-word</application> <application>LibreOffice</application>
document stored as an attachment to an email message inside an document stored as an attachment to an email message inside an
email folder archived in a zip file...</para> email folder archived in a zip file...</para>
@ -395,22 +408,23 @@ recoll
the index in the index in
<filename>~/.indexes-email/xapiandb/</filename>.</para> <filename>~/.indexes-email/xapiandb/</filename>.</para>
<para>Using multiple configuration directories and <para>Using multiple configuration directories and <link
<link linkend="rcl.install.config.recollconf">configuration linkend="rcl.install.config.recollconf">configuration
options</link> allows you to tailor multiple configurations options</link> allows you to tailor multiple configurations and
and indexes to handle whatever subset of the available data indexes to handle whatever subset of the available data you wish
that you wish to make searchable.</para> to make searchable.</para>
</listitem> </listitem>
<listitem><para>You can also specify a different storage <listitem><para>For a given configuration directory, you can
location for the index by setting the <varname>dbdir</varname> specify a non-default storage location for the index by setting
parameter in the configuration file the <varname>dbdir</varname> parameter in the configuration file
(see the <link linkend="rcl.install.config.recollconf">configuration (see the <link
section</link>). This method would mainly be of use if you linkend="rcl.install.config.recollconf">configuration
wanted to keep the configuration directory in its default location, section</link>). This method would mainly be of use if you wanted
but desired another location for the index, typically out of to keep the configuration directory in its default location, but
disk occupation concerns.</para> desired another location for the index, typically out of disk
occupation concerns.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
@ -437,7 +451,7 @@ recoll
destroyed safely.</para> destroyed safely.</para>
<sect2 id="rcl.indexing.storage.format"> <sect2 id="rcl.indexing.storage.format">
<title>Xapian index formats</title> <title>&XAP; index formats</title>
<para>&XAP; versions usually support several formats for index <para>&XAP; versions usually support several formats for index
storage. A given major &XAP; version will have a current format, storage. A given major &XAP; version will have a current format,
@ -490,8 +504,9 @@ recoll
<link linkend="rcl.install.config">&RCL; configuration files</link> <link linkend="rcl.install.config">&RCL; configuration files</link>
control which areas of the file system are indexed, and how control which areas of the file system are indexed, and how
files are processed. These variables can be set either by files are processed. These variables can be set either by
editing the text files or using the dialogs in the editing the text files or by using the
<command>recoll</command> GUI.</para> <link linkend="rcl.indexing.config.gui"> dialogs in the
<command>recoll</command> GUI</link>.</para>
<para>The first time you start <command>recoll</command>, you <para>The first time you start <command>recoll</command>, you
will be asked whether or not you would like it to build the will be asked whether or not you would like it to build the
@ -522,6 +537,61 @@ recoll
described in the <link linkend="rcl.install.external">external described in the <link linkend="rcl.install.external">external
packages section.</link></para> packages section.</link></para>
<para>As of Recoll 1.18 there are two incompatible types of Recoll
indexes, depending on the treatment of character case and
diacritics. The next section describes the two types in more
detail.</para>
<sect2 id="rcl.indexing.config.multiple">
<title>Multiple indexes</title>
<para>Multiple &RCL; indexes can be created by
using several configuration directories which are usually set to
index different areas of the file system. A specific index can
be selected for updating or searching, using the
<envar>RECOLL_CONFDIR</envar> environment variable or the
<option>-c</option> option to <command>recoll</command> and
<command>recollindex</command>.</para>
<para>A typical usage scenario for the multiple index feature
would be for a system administrator to set up a central index
for shared data, that you choose to search or not in addition to
your personal data. Of course, there are other
possibilities. There are many cases where you know the subset of
files that should be searched, and where narrowing the search
can improve the results. You can achieve approximately the same
effect with the directory filter in advanced search, but
multiple indexes will have much better performance and may be
worth the trouble.</para>
<para>A <command>recollindex</command> program instance can only
update one specific index.</para>
<para>The main index (defined by
<envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is
always active. If this is undesirable, you can set up your
base configuration to index an empty directory.</para>
<para>The different search interfaces (GUI, command line, ...)
have different methods to define the set of indexes to be
used, see the appropriate section.</para>
<para>If a set of multiple indexes are to be used together for
searches, some configuration parameters must be consistent
among the set. These are parameters which need to be the same
when indexing and searching. As the parameters come from the
main configuration when searching, they need to be compatible
with what was set when creating the other indexes (which came
from their respective configuration directories).</para>
<para>Most importantly, all indexes to be queried concurrently must
have the same option concerning character case and diacritics
stripping, but there are other constraints. Most of the
relevant parameters are described in the
<link linkend="rcl.install.config.recollconf.terms">linked
section</link>.</para>
</sect2>
<sect2 id="rcl.indexing.config.sens"> <sect2 id="rcl.indexing.config.sens">
@ -562,7 +632,7 @@ recoll
<para>As a cost for added capability, a raw index will be slightly <para>As a cost for added capability, a raw index will be slightly
bigger than a stripped one (around 10%). Also, searches will be bigger than a stripped one (around 10%). Also, searches will be
more complex, so probably slightly slower, and the feature is more complex, so probably slightly slower, and the feature is
still young, and a certain amount of weirdness cannot be still young, so that a certain amount of weirdness cannot be
excluded.</para> excluded.</para>
</sect2> </sect2>
@ -709,7 +779,7 @@ recoll
described here.</para> described here.</para>
<para>Option <option>-z</option> will reset the index when <para>Option <option>-z</option> will reset the index when
starting. This is almost the same as destroying the index starting. This is almost the same as destroying the index
files (the nuance is that the Xapian format version will not files (the nuance is that the &XAP; format version will not
be changed).</para> be changed).</para>
<para>Option <option>-Z</option> will force the update of all <para>Option <option>-Z</option> will force the update of all
documents without resetting the index first. This will not documents without resetting the index first. This will not
@ -905,8 +975,8 @@ fvwm
<listitem><para>Advanced search (a panel accessed through the <listitem><para>Advanced search (a panel accessed through the
<guilabel>Tools</guilabel> menu or the toolbox bar icon) has <guilabel>Tools</guilabel> menu or the toolbox bar icon) has
multiple entry fields, which you may use to build a logical multiple entry fields, which you may use to build a logical
condition, with additional filtering on file type and location condition, with additional filtering on file type, location
in the file system.</para> in the file system, modification date, and size.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
@ -955,60 +1025,53 @@ fvwm
described in <link linkend="rcl.search.lang">a separate described in <link linkend="rcl.search.lang">a separate
section</link>.</para> section</link>.</para>
<para><guilabel>File name</guilabel> will specifically look for file
names. The entry will be split at white space characters,
and each fragment will be separately expanded, then the search will
be for file names matching all fragments (this is new in 1.15,
older releases did an OR of the whole thing which did not make
sense). Things to know:
<itemizedlist>
<listitem><para>The search is case- and accent-insensitive.</para>
</listitem>
<listitem><para>Fragments without any wild card
character and not capitalized will be prepended and appended
with '*' (ie: <replaceable>etc</replaceable> ->
<replaceable>*etc*</replaceable>, but
<replaceable>Etc</replaceable> ->
<replaceable>etc</replaceable>). Of course it does not make
sense to have multiple fragments if one of them is capitalized
(as this one will require an exact match).</para>
</listitem>
<listitem><para>If you want to search for a pattern including
white space, use double quotes (ie: <replaceable>"admin
note*"</replaceable>).</para>
</listitem>
<listitem><para>If you have a big index (many files),
excessively generic fragments may result in inefficient
searches.</para>
</listitem>
<listitem><para>As an example, <replaceable>inst
recoll</replaceable> would match
<replaceable>recollinstall.in</replaceable> (and quite a few
others...).</para>
</listitem>
</itemizedlist>
The point of having a separate file name
search is that wild card expansion can be performed more
efficiently on a relatively small subset of the index (allowing
wild cards on the left of terms without excessive penality).</para>
<para>All search modes allow wildcards inside terms <para>All search modes allow wildcards inside terms
(<literal>*</literal>, <literal>?</literal>, (<literal>*</literal>, <literal>?</literal>,
<literal>[]</literal>). You may want to have a look at the <literal>[]</literal>). You may want to have a look at the
<link linkend="rcl.search.wildcards">section about wildcards</link> <link linkend="rcl.search.wildcards">section about wildcards</link>
for more information about this.</para> for more information about this.</para>
<para><guilabel>File name</guilabel> will specifically look for file
names. The point of having a separate file name
search is that wild card expansion can be performed more
efficiently on a small subset of the index (allowing
wild cards on the left of terms without excessive penality).
Things to know:
<itemizedlist>
<listitem><para>White space in the entry should match white
space in the file name, and is not treated specially.</para>
</listitem>
<listitem><para>The search is insensitive to character case and
accents, independantly of the type of index.</para>
</listitem>
<listitem><para>An entry without any wild card
character and not capitalized will be prepended and appended
with '*' (ie: <replaceable>etc</replaceable> ->
<replaceable>*etc*</replaceable>, but
<replaceable>Etc</replaceable> ->
<replaceable>etc</replaceable>).</para>
</listitem>
<listitem><para>If you have a big index (many files),
excessively generic fragments may result in inefficient
searches.</para>
</listitem>
</itemizedlist>
</para>
<para>You can search for exact phrases (adjacent words in a <para>You can search for exact phrases (adjacent words in a
given order) by enclosing the input inside double quotes. Ex: given order) by enclosing the input inside double quotes. Ex:
<literal>"virtual reality"</literal>.</para> <literal>"virtual reality"</literal>.</para>
<para>Character case has no influence on search, except that you <para>When using a stripped index, character case has no influence on
can disable stem expansion for any term by capitalizing it. Ie: search, except that you can disable stem expansion for any term by
a search for <literal>floor</literal> will also normally look for capitalizing it. Ie: a search for <literal>floor</literal> will also
<literal>flooring</literal>, <literal>floored</literal>, etc., but normally look for <literal>flooring</literal>,
a search for <literal>Floor</literal> will only look for <literal>floored</literal>, etc., but a search for
<literal>floor</literal>, in any character case. Stemming can <literal>Floor</literal> will only look for <literal>floor</literal>,
also be disabled globally in the preferences. </para> in any character case. Stemming can also be disabled globally in the
preferences. When using a raw index, <link
linkend="rcl.search.casediac">the rules are a bit more
complicated</link>.</para>
<para>&RCL; remembers the last few searches that you <para>&RCL; remembers the last few searches that you
performed. You can use the simple search text entry widget (a performed. You can use the simple search text entry widget (a
@ -1050,10 +1113,7 @@ fvwm
<para>By default, the document list is presented in order of <para>By default, the document list is presented in order of
relevance (how well the system estimates that the document relevance (how well the system estimates that the document
matches the query). You can sort the result by ascending or matches the query). You can sort the result by ascending or
descending date by using the vertical arrows in the toolbar (the old descending date by using the vertical arrows in the toolbar.</para>
sort tool is gone after release 1.15, because the new <link
linkend="rcl.search.gui.restable">result table</link> has much better
capability).</para>
<para>Clicking on the <para>Clicking on the
<literal>Preview</literal> link for an entry will open an <literal>Preview</literal> link for an entry will open an
@ -1520,7 +1580,7 @@ fvwm
of the string to search for (ie a wildcard expression like of the string to search for (ie a wildcard expression like
<replaceable>*coll</replaceable>), the expansion can take quite <replaceable>*coll</replaceable>), the expansion can take quite
a long time because the full index term list will have to be a long time because the full index term list will have to be
processed. The expansion is currently limited at 200 results for processed. The expansion is currently limited at 10000 results for
wildcards and regular expressions.</para> wildcards and regular expressions.</para>
<para>Double-clicking on a term in the result list will insert <para>Double-clicking on a term in the result list will insert
@ -1531,9 +1591,9 @@ fvwm
</sect2> </sect2>
<sect2 id="rcl.search.gui.multidb"> <sect2 id="rcl.search.gui.multidb">
<title>Multiple databases</title> <title>Multiple indexes</title>
<para>See the <link linkend="rcl.search.multidb">section <para>See the <link linkend="rcl.indexing.config.multiple">section
describing the use of multiple indexes</link> for describing the use of multiple indexes</link> for
generalities. Only the aspects concerning generalities. Only the aspects concerning
the <command>recoll</command> GUI are described here.</para> the <command>recoll</command> GUI are described here.</para>
@ -1627,7 +1687,7 @@ fvwm
of the document container, not only of the text contents (so of the document container, not only of the text contents (so
that ie, a text document with an image added will not be a that ie, a text document with an image added will not be a
duplicate of the text only). Duplicates hiding is controlled duplicate of the text only). Duplicates hiding is controlled
by an entry in the <guilabel>Query configuration</guilabel> by an entry in the <guilabel>GUI configuration</guilabel>
dialog, and is off by default.</para> dialog, and is off by default.</para>
</sect2> </sect2>
@ -1821,7 +1881,7 @@ fvwm
<title>Customizing the search interface</title> <title>Customizing the search interface</title>
<para>You can customize some aspects of the search interface by using <para>You can customize some aspects of the search interface by using
the <guimenu>Query configuration</guimenu> entry in the the <guimenu>GUI configuration</guimenu> entry in the
<guimenu>Preferences</guimenu> menu.</para> <guimenu>Preferences</guimenu> menu.</para>
<para>There are several tabs in the dialog, dealing with the <para>There are several tabs in the dialog, dealing with the
@ -1868,8 +1928,7 @@ fvwm
version instead. </para> version instead. </para>
</listitem> </listitem>
<listitem><para><guilabel>Use &lt;PRE&gt; tags instead of <listitem><para><guilabel>Plain text to HTML line style</guilabel>:
&lt;BR&gt; to display plain text as HTML in preview</guilabel>:
when displaying plain text inside the preview window, &RCL; when displaying plain text inside the preview window, &RCL;
tries to preserve some of the original text line breaks and tries to preserve some of the original text line breaks and
indentation. It can either use PRE HTML tags, which will indentation. It can either use PRE HTML tags, which will
@ -1877,7 +1936,9 @@ fvwm
scrolling for long lines, or use BR tags to break at the scrolling for long lines, or use BR tags to break at the
original line breaks, which will let the editor introduce original line breaks, which will let the editor introduce
other line breaks according to the window width, but will other line breaks according to the window width, but will
lose some of the original indentation.</para> lose some of the original indentation. The third option has
been available in recent releases and is probably now the best
one: use PRE tags with line wrapping.</para>
</listitem> </listitem>
<listitem><para><guilabel>Use desktop preferences to choose <listitem><para><guilabel>Use desktop preferences to choose
@ -1895,7 +1956,9 @@ fvwm
that will still be opened according to &RCL; preferences. This that will still be opened according to &RCL; preferences. This
is useful for passing parameters like page numbers or search is useful for passing parameters like page numbers or search
strings to applications that support them strings to applications that support them
(e.g. <application>evince</application>).</para> (e.g. <application>evince</application>). This cannot be done
with <command>xdg-open</command> which only supports passing
one parameter.</para>
</listitem> </listitem>
<listitem><para><guilabel>Choose editor applications</guilabel> <listitem><para><guilabel>Choose editor applications</guilabel>
@ -1917,9 +1980,8 @@ fvwm
</listitem> </listitem>
<listitem><para><guilabel>Start with advanced search dialog open <listitem><para><guilabel>Start with advanced search dialog open
</guilabel> and <guilabel>Start with sort dialog </guilabel>: If you use this dialog frequently, checking
open</guilabel>: If you use these dialogs all the time, checking the entries will get it to open when recoll starts.</para>
these entries will get them to open when recoll starts.</para>
</listitem> </listitem>
<listitem><para><guilabel>Remember sort activation <listitem><para><guilabel>Remember sort activation
@ -1957,9 +2019,9 @@ fvwm
</listitem> </listitem>
<listitem id="rcl.search.gui.custom.resulthead"> <listitem id="rcl.search.gui.custom.resulthead">
<para><guilabel>Edit result page html header insert</guilabel>: <para><guilabel>Edit result page HTML header insert</guilabel>:
allows you to define text inserted at the end of the result allows you to define text inserted at the end of the result
page html header. page HTML header.
More detail in the <link linkend="rcl.search.gui.custom.reslist"> More detail in the <link linkend="rcl.search.gui.custom.reslist">
result list customisation section.</link></para> result list customisation section.</link></para>
</listitem> </listitem>
@ -2026,11 +2088,10 @@ fvwm
<listitem><para><guilabel>Dynamically build <listitem><para><guilabel>Dynamically build
abstracts</guilabel>: this decides if &RCL; tries to build abstracts</guilabel>: this decides if &RCL; tries to build
document abstracts when displaying the result list. Abstracts document abstracts (lists of <emphasis>snippets</emphasis>)
are constructed by taking context from the document when displaying the result list. Abstracts are constructed by
information, around the search terms. This can slow down taking context from the document information, around the search
result list display significantly for big documents, and you terms.</para>
may want to turn it off.</para>
</listitem> </listitem>
<listitem><para><guilabel>Synthetic abstract size</guilabel>: <listitem><para><guilabel>Synthetic abstract size</guilabel>:
@ -2081,12 +2142,12 @@ fvwm
by adjusting two elements:</para> by adjusting two elements:</para>
<itemizedlist> <itemizedlist>
<listitem><para>The paragraph format</para></listitem> <listitem><para>The paragraph format</para></listitem>
<listitem><para>Html code inside the header <listitem><para>HTML code inside the header
section</para></listitem> section</para></listitem>
</itemizedlist> </itemizedlist>
<para>These can be edited from the <guilabel>Result list</guilabel> <para>These can be edited from the <guilabel>Result list</guilabel>
tab of the <guilabel>Query configuration</guilabel>.</para> tab of the <guilabel>GUI configuration</guilabel>.</para>
<para>Newer versions of Recoll (from 1.17) use a WebKit HTML <para>Newer versions of Recoll (from 1.17) use a WebKit HTML
object by default (this may be disabled at build time), and object by default (this may be disabled at build time), and
@ -2115,10 +2176,6 @@ fvwm
</listitem> </listitem>
<listitem><formalpara><title>%D</title><para>Date</para></formalpara> <listitem><formalpara><title>%D</title><para>Date</para></formalpara>
</listitem> </listitem>
<listitem><formalpara><title>%E</title><para>Precooked Snippets
link (will only appear for documents indexed with page
numbers)</para></formalpara>
</listitem>
<listitem><formalpara><title>%I</title><para>Icon image <listitem><formalpara><title>%I</title><para>Icon image
name. This is normally determined from the mime type. The name. This is normally determined from the mime type. The
associations are defined inside the associations are defined inside the
@ -2131,8 +2188,8 @@ fvwm
<listitem><formalpara><title>%K</title><para>Keywords (if <listitem><formalpara><title>%K</title><para>Keywords (if
any)</para></formalpara> any)</para></formalpara>
</listitem> </listitem>
<listitem><formalpara><title>%L</title><para>Precooked Preview and <listitem><formalpara><title>%L</title><para>Precooked Preview,
Edit links</para></formalpara> Edit, and possibly Snippets links</para></formalpara>
</listitem> </listitem>
<listitem><formalpara><title>%M</title><para>Mime <listitem><formalpara><title>%M</title><para>Mime
type</para></formalpara> type</para></formalpara>
@ -2156,10 +2213,11 @@ fvwm
</listitem> </listitem>
</itemizedlist> </itemizedlist>
The format of the Preview and Edit links is The format of the Preview, Edit, and Snippets links is
<literal>&lt;a href="P%N"&gt;</literal> <literal>&lt;a href="P%N"&gt;</literal>,
and
<literal>&lt;a href="E%N"&gt;</literal> <literal>&lt;a href="E%N"&gt;</literal>
and
<literal>&lt;a href="A%N"&gt;</literal>
where <replaceable>docnum</replaceable> (%N) expands to the document where <replaceable>docnum</replaceable> (%N) expands to the document
number inside the result page).</para> number inside the result page).</para>
@ -2377,7 +2435,7 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r
capabilities as the complex search interface in the capabilities as the complex search interface in the
GUI.</para> GUI.</para>
<para>The language is roughly based on the (seemingly defunct) <para>The language is based on the (seemingly defunct)
<ulink url="http://www.xesam.org/main/XesamUserSearchLanguage95"> <ulink url="http://www.xesam.org/main/XesamUserSearchLanguage95">
Xesam</ulink> user search language specification.</para> Xesam</ulink> user search language specification.</para>
@ -2405,13 +2463,15 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r
<replaceable>potatoes</replaceable> (in any part of the document).</para> <replaceable>potatoes</replaceable> (in any part of the document).</para>
<para>An element is composed of an optional field specification, <para>An element is composed of an optional field specification,
and a value, separated by a colon. Example: and a value, separated by a colon (the field separator is the last
<replaceable>Beatles</replaceable>, colon in the element). Example:
<replaceable>Eugenie</replaceable>,
<replaceable>author:balzac</replaceable>, <replaceable>author:balzac</replaceable>,
<replaceable>dc:title:grandet</replaceable> </para> <replaceable>dc:title:grandet</replaceable> </para>
<para>The colon, if present, means "contains". Xesam defines other <para>The colon, if present, means "contains". Xesam defines other
relations, which are not supported for now.</para> relations, which are mostly supported for now (except in special
cases, described further down).</para>
<para>All elements in the search entry are normally combined <para>All elements in the search entry are normally combined
with an implicit AND. It is possible to specify that elements be with an implicit AND. It is possible to specify that elements be
@ -2429,8 +2489,8 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r
not not
(<replaceable>word1</replaceable> AND (<replaceable>word1</replaceable> AND
<replaceable>word2</replaceable>) <literal>OR</literal> <replaceable>word2</replaceable>) <literal>OR</literal>
<replaceable>word3</replaceable>. Do not enter explicit <replaceable>word3</replaceable>. Explicit
parenthesis, they are not supported for now.</para> parenthesis are <emphasis>not</emphasis> supported.</para>
<para>An element preceded by a <literal>-</literal> specifies a <para>An element preceded by a <literal>-</literal> specifies a
term that should <emphasis>not</emphasis> appear. Pure negative term that should <emphasis>not</emphasis> appear. Pure negative
@ -2777,6 +2837,11 @@ dir:recoll dir:src -dir:utils -dir:common
a word can make for a slow search because &RCL; will have to a word can make for a slow search because &RCL; will have to
scan the whole index term list to find the matches.</para> scan the whole index term list to find the matches.</para>
</listitem> </listitem>
<listitem><para>When working with a raw index (preserving
character case and diacritics), the literal part of a wildcard
expression will be matched exactly for case and
diacritics.</para>
</listitem>
<listitem><para>Using a <literal>*</literal> at the end of a <listitem><para>Using a <literal>*</literal> at the end of a
word can produce more matches than you would think, and word can produce more matches than you would think, and
strange search results. You can use the <link strange search results. You can use the <link
@ -2817,7 +2882,14 @@ dir:recoll dir:src -dir:utils -dir:common
term</literal> at the beginning of the text would be a match for term</literal> at the beginning of the text would be a match for
<literal>"^my term"o5</literal>.</para> <literal>"^my term"o5</literal>.</para>
</sect2> <para>Anchored searches can be very useful for searches inside
somewhat structured documents like scientific articles, in case
explicit metadata has not been supplied (a most frequent case), for
example for looking for matches inside the abstract or the list of
authors (which occur at the top of the document).</para>
</sect2>
</sect1> <!-- wildchars and anchors --> </sect1> <!-- wildchars and anchors -->
@ -2892,61 +2964,13 @@ dir:recoll dir:src -dir:utils -dir:common
</sect1> <!-- rcl.search.desktop --> </sect1> <!-- rcl.search.desktop -->
<sect1 id="rcl.search.multidb">
<title>Multiple databases</title>
<para>Multiple &RCL; databases or indexes can be created by
using several configuration directories which are usually set to
index different areas of the file system. A specific index can
be selected for updating or searching, using the
<envar>RECOLL_CONFDIR</envar> environment variable or the
<option>-c</option> option to <command>recoll</command> and
<command>recollindex</command>.</para>
<para>A typical usage scenario for the multiple index feature
would be for a system administrator to set up a central index
for shared data, that you choose to search or not in addition to
your personal data. Of course, there are other
possibilities. There are many cases where you know the subset of
files that should be searched, and where narrowing the search
can improve the results. You can achieve approximately the same
effect with the directory filter in advanced search, but
multiple indexes will have much better performance and may be
worth the trouble.</para>
<para>A <command>recollindex</command> program instance can only
update one specific index.</para>
<para>The main index (defined by
<envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is
always active. If this is undesirable, you can set up your
base configuration to index an empty directory.</para>
<para>The different search interfaces (GUI, command line, ...)
have different methods to define the set of indexes to be
used, see the appropriate section.</para>
<para>If a set of multiple indexes are to be used together for
searches, some configuration parameters must be consistent
among the set. These are parameters which need to be the same
when indexing and searching. As the parameters come from the
main configuration when searching, they need to be compatible
with what was set when creating the other indexes (which came
from their respective configuration directories. Most of the
relevant parameters are described in the following
<link linkend="rcl.install.config.recollconf.terms">linked
section</link>.</para>
</sect1> <!-- multiple databases -->
</chapter> <!-- Search --> </chapter> <!-- Search -->
<chapter id="rcl.program"> <chapter id="rcl.program">
<title>Programming interface</title> <title>Programming interface</title>
<para>&RCL; has an Application programming Interface, usable both <para>&RCL; has an Application Programming Interface, usable both
for indexing and searching, currently accessible from the for indexing and searching, currently accessible from the
<application>Python</application> language.</para> <application>Python</application> language.</para>
@ -2972,8 +2996,8 @@ dir:recoll dir:src -dir:utils -dir:common
<listitem><para>Simple filters (the old ones) run once and <listitem><para>Simple filters (the old ones) run once and
exit. They can be bare programs like exit. They can be bare programs like
<application>antiword</application>, or shell-scripts using other <application>antiword</application>, or shell-scripts using other
programs. They are very simple to write, just having to write the programs. They are very simple to write, because they just need
text to the standard output.</para> to output the converted to the standard output.</para>
</listitem> </listitem>
<listitem><para>Multiple filters, new in 1.13, run as long as <listitem><para>Multiple filters, new in 1.13, run as long as
their master process (ie: recollindex) is active. They can their master process (ie: recollindex) is active. They can
@ -3008,12 +3032,12 @@ dir:recoll dir:src -dir:utils -dir:common
source file name. They should output the result to stdout.</para> source file name. They should output the result to stdout.</para>
<para>When writing a filter, you should decide if it will output <para>When writing a filter, you should decide if it will output
plain text or html. Plain text is simpler, but you will not be able plain text or HTML. Plain text is simpler, but you will not be able
to add metadata or vary the output character encoding (this will be to add metadata or vary the output character encoding (this will be
defined in a configuration file). Additionally, some formatting may defined in a configuration file). Additionally, some formatting may
easier to preserve when previewing html. Actually the deciding factor be easier to preserve when previewing HTML. Actually the deciding factor
is metadata: &RCL; has a way to <link linkend="rcl.program.filters.html"> is metadata: &RCL; has a way to <link linkend="rcl.program.filters.html">
extract metadata from the html header and use it for field extract metadata from the HTML header and use it for field
searches.</link>.</para> searches.</link>.</para>
<para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment <para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment
@ -3121,7 +3145,7 @@ application/x-chm = execm rclchm
should be transformed into should be transformed into
"<literal>&amp;lt;</literal>". This is not always properly "<literal>&amp;lt;</literal>". This is not always properly
done by translating programs which output HTML, and of done by translating programs which output HTML, and of
course nerver by those which output plain text.</para> course never by those which output plain text.</para>
<para>The character set needs to be specified in the <para>The character set needs to be specified in the
header. It does not need to be UTF-8 (&RCL; will take care header. It does not need to be UTF-8 (&RCL; will take care
@ -3197,11 +3221,51 @@ application/x-chm = execm rclchm
other aspects of fields handling is defined inside the other aspects of fields handling is defined inside the
<filename>fields</filename> configuration file.</para> <filename>fields</filename> configuration file.</para>
<para>The sequence of events for field processing is as follows:
<itemizedlist>
<listitem><para>During indexing,
<command>recollindex</command> scans all <literal>meta</literal>
fields in HTML documents (most document types are transformed
into HTML at some point). It compares the name for each element
to the configuration defining what should be done with fields
(the <filename>fields</filename> file)</para>
</listitem>
<listitem><para>If the name for the <literal>meta</literal>
element matches one for a field that should be indexed, the
contents are processed and the terms are entered into the index
with the prefix defined in the <filename>fields</filename>
file.</para>
</listitem>
<listitem><para>If the name for the <literal>meta</literal> element
matches one for a field that should be stored, the content of the
element is stored with the document data record, from which it
can be extracted and displayed at query time.</para>
</listitem>
<listitem><para>At query time, if a field search is performed, the
index prefix is computed and the match is only performed against
appropriately prefixed terms in the index.</para>
</listitem>
<listitem><para>At query time, the field can be displayed inside
the result list by using the appropriate directive in the
definition of the <link
linkend="rcl.search.gui.custom.reslist">result list paragraph
format</link>. All fields are displayed on the fields screen of
the preview window (which you can reach through the right-click
menu). This is independant of the fact that the search which
produced the results used the field or not.</para>
</listitem>
</itemizedlist>
<para>You can find more information in the <para>You can find more information in the
<link linkend="rcl.install.config.fields">section about the <link linkend="rcl.install.config.fields">section about the
<filename>fields</filename> file</link>, or in comments inside the <filename>fields</filename> file</link>, or in comments inside the
file.</para> file.</para>
<para>You can also have a look at the <ulink
url="https://bitbucket.org/medoc/recoll/wiki/HandleCustomField">example
on the Wiki</ulink>, detailing
how one could add a <emphasis>page count</emphasis> field to pdf
documents for displaying inside result lists.</para>
</sect1> </sect1>
@ -3276,8 +3340,7 @@ application/x-chm = execm rclchm
<para>&RCL; versions after 1.11 define a Python programming <para>&RCL; versions after 1.11 define a Python programming
interface, both for searching and indexing.</para> interface, both for searching and indexing.</para>
<para>The Python interface is not built by default and can be <para>The Python interface can be found in the source package,
found in the source package,
under <filename>python/recoll</filename>.</para> under <filename>python/recoll</filename>.</para>
<para>In order to build the module, you should first build <para>In order to build the module, you should first build
or re-build the Recoll library using position-independant or re-build the Recoll library using position-independant
@ -4389,6 +4452,12 @@ unac_except_trans =
character, you could very well have something like character, you could very well have something like
<literal>üue</literal> in the list.</para> <literal>üue</literal> in the list.</para>
<para>The default value set for
<literal>unac_except_trans</literal> can't be listed here
because I have trouble with SGML and UTF-8, but it only
contains ligature decompositions: german ss, oe, ae, fi,
fl.</para>
<para>This parameter can't be defined for subdirectories, it <para>This parameter can't be defined for subdirectories, it
is global, because there is no way to do otherwise when is global, because there is no way to do otherwise when
querying. If you have document sets which would need different querying. If you have document sets which would need different