This commit is contained in:
Jean-Francois Dockes 2019-04-14 16:18:39 +02:00
parent 48bc71da70
commit 567aaa2035
4 changed files with 875 additions and 1037 deletions

View File

@ -54,12 +54,20 @@ home directory.
Where values are lists, white space is used for separation, and elements with Where values are lists, white space is used for separation, and elements with
embedded spaces can be quoted with double-quotes. embedded spaces can be quoted with double-quotes.
.SH OPTIONS .SH OPTIONS
.TP .TP
.BI "topdirs = "string .BI "topdirs = "string
Space-separated list of files or Space-separated list of files or
directories to recursively index. Default to ~ (indexes directories to recursively index. Default to ~ (indexes
$HOME). You can use symbolic links in the list, they will be followed, $HOME). You can use symbolic links in the list, they will be followed,
independently of the value of the followLinks variable. independantly of the value of the followLinks variable.
.TP
.BI "monitordirs = "string
Space-separated list of files or directories to monitor for
updates. When running the real-time indexer, this allows monitoring only a
subset of the whole indexed area. The elements must be included in the
tree defined by the 'topdirs' members.
.TP .TP
.BI "skippedNames = "string .BI "skippedNames = "string
Files and directories which should be ignored. Files and directories which should be ignored.
@ -69,13 +77,21 @@ names. The list in the default configuration does not exclude hidden
directories (names beginning with a dot), which means that it may index directories (names beginning with a dot), which means that it may index
quite a few things that you do not want. On the other hand, email user quite a few things that you do not want. On the other hand, email user
agents like Thunderbird usually store messages in hidden directories, and agents like Thunderbird usually store messages in hidden directories, and
you probably want this indexed. One possible solution is to have '.*' you probably want this indexed. One possible solution is to have ".*" in
in 'skippedNames', and add things like '~/.thunderbird' '~/.evolution' "skippedNames", and add things like "~/.thunderbird" "~/.evolution" to
to 'topdirs'. Not even the file names are indexed for patterns in this "topdirs". Not even the file names are indexed for patterns in this
list, see the 'noContentSuffixes' variable for an alternative approach list, see the "noContentSuffixes" variable for an alternative approach
which indexes the file names. Can be redefined for any which indexes the file names. Can be redefined for any
subtree. subtree.
.TP .TP
.BI "skippedNames- = "string
List of name endings to remove from the default skippedNames
list.
.TP
.BI "skippedNames+ = "string
List of name endings to add to the default skippedNames
list.
.TP
.BI "noContentSuffixes = "string .BI "noContentSuffixes = "string
List of name endings (not necessarily dot-separated suffixes) for List of name endings (not necessarily dot-separated suffixes) for
which we don't try MIME type identification, and don't uncompress or which we don't try MIME type identification, and don't uncompress or
@ -87,38 +103,59 @@ from skippedNames because these are name ending matches only (not
wildcard patterns), and the file name itself gets indexed normally. This wildcard patterns), and the file name itself gets indexed normally. This
can be redefined for subdirectories. can be redefined for subdirectories.
.TP .TP
.BI "noContentSuffixes- = "string
List of name endings to remove from the default noContentSuffixes
list.
.TP
.BI "noContentSuffixes+ = "string
List of name endings to add to the default noContentSuffixes
list.
.TP
.BI "skippedPaths = "string .BI "skippedPaths = "string
Paths we should not go into. Space-separated list of Absolute paths we should not go into. Space-separated list of wildcard expressions for absolute
wildcard expressions for filesystem paths. Can contain files and filesystem paths. Must be defined at the top level of the configuration
directories. The database and configuration directories will file, not in a subsection. Can contain files and directories. The database and
automatically be added. The expressions are matched using 'fnmatch(3)' configuration directories will automatically be added. The expressions
with the FNM_PATHNAME flag set by default. This means that '/' characters are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by
must be matched explicitly. You can set 'skippedPathsFnmPathname' to 0 default. This means that '/' characters must be matched explicitely. You
to disable the use of FNM_PATHNAME (meaning that '/*/dir3' will can set 'skippedPathsFnmPathname' to 0 to disable the use of FNM_PATHNAME
match '/dir1/dir2/dir3'). The default value contains the usual mount point (meaning that '/*/dir3' will match '/dir1/dir2/dir3'). The default value
for removable media to remind you that it is a bad idea to have Recoll work contains the usual mount point for removable media to remind you that it
on these (esp. with the monitor: media gets indexed on mount, all data is a bad idea to have Recoll work on these (esp. with the monitor: media
gets erased on unmount). Explicitly adding '/media/xxx' to the topdirs gets indexed on mount, all data gets erased on unmount). Explicitely
will override this. adding '/media/xxx' to the 'topdirs' variable will override
this.
.TP .TP
.BI "skippedPathsFnmPathname = "bool .BI "skippedPathsFnmPathname = "bool
Set to 0 to Set to 0 to
override use of FNM_PATHNAME for matching skipped override use of FNM_PATHNAME for matching skipped
paths. paths.
.TP .TP
.BI "nowalkfn = "string
File name which will cause its parent directory to be skipped. Any directory containing a file with this name will be skipped as
if it was part of the skippedPaths list. Ex: .recoll-noindex
.TP
.BI "daemSkippedPaths = "string .BI "daemSkippedPaths = "string
skippedPaths equivalent specific to skippedPaths equivalent specific to
real time indexing. This enables having parts of the tree real time indexing. This enables having parts of the tree
which are initially indexed but not monitored. If daemSkippedPaths is which are initially indexed but not monitored. If daemSkippedPaths is
not set, the daemon uses skippedPaths. not set, the daemon uses skippedPaths.
.TP
.BI "zipUseSkippedNames = "bool
Use skippedNames inside Zip archives. Fetched
directly by the rclzip handler. Skip the patterns defined by skippedNames
inside Zip archives. Can be redefined for subdirectories.
See https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html
.TP .TP
.BI "zipSkippedNames = "string .BI "zipSkippedNames = "string
Space-separated list of wildcard expressions for names that should Space-separated list of wildcard expressions for names that should
be ignored inside zip archives. This is used directly by be ignored inside zip archives. This is used directly by
the zip handler, and has a function similar to skippedNames, but works the zip handler. If zipUseSkippedNames is not set, zipSkippedNames
independently. Can be redefined for subdirectories. Supported by recoll defines the patterns to be skipped inside archives. If zipUseSkippedNames
1.20 and newer. See is set, the two lists are concatenated and used. Can be redefined for
https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members subdirectories.
See https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html
.TP .TP
.BI "followLinks = "bool .BI "followLinks = "bool
@ -133,16 +170,27 @@ followed.
.BI "indexedmimetypes = "string .BI "indexedmimetypes = "string
Restrictive list of Restrictive list of
indexed mime types. Normally not set (in which case all indexed mime types. Normally not set (in which case all
supported types are indexed). If it is set, supported types are indexed). If it is set, only the types from the list
only the types from the list will have their contents indexed. The names will have their contents indexed. The names will be indexed anyway if
will be indexed anyway if indexallfilenames is set (default). MIME indexallfilenames is set (default). MIME type names should be taken from
type names should be taken from the mimemap file. Can be redefined for the mimemap file (the values may be different from xdg-mime or file -i
subtrees. output in some cases). Can be redefined for subtrees.
.TP .TP
.BI "excludedmimetypes = "string .BI "excludedmimetypes = "string
List of excluded MIME List of excluded MIME
types. Lets you exclude some types from indexing. Can be types. Lets you exclude some types from indexing. MIME type
redefined for subtrees. names should be taken from the mimemap file (the values may be different
from xdg-mime or file -i output in some cases) Can be redefined for
subtrees.
.TP
.BI "nomd5types = "string
Don't compute md5 for these types. md5 checksums are used only for deduplicating results, and can be
very expensive to compute on multimedia or other big files. This list
lets you turn off md5 computation for selected types. It is global (no
redefinition for subtrees). At the moment, it only has an effect for
external handlers (exec and execm). The file types can be specified by
listing either MIME types (e.g. audio/mpeg) or handler names
(e.g. rclaudio).
.TP .TP
.BI "compressedfilemaxkbs = "int .BI "compressedfilemaxkbs = "int
Size limit for compressed Size limit for compressed
@ -173,9 +221,9 @@ for the command used.
Command used to guess Command used to guess
MIME types if the internal methods fails This should be a MIME types if the internal methods fails This should be a
"file -i" workalike. The file path will be added as a last parameter to "file -i" workalike. The file path will be added as a last parameter to
the command line. 'xdg-mime' works better than the traditional 'file' the command line. "xdg-mime" works better than the traditional "file"
command, and is now the configured default (with a hard-coded fallback command, and is now the configured default (with a hard-coded fallback to
to 'file') "file")
.TP .TP
.BI "processwebqueue = "bool .BI "processwebqueue = "bool
Decide if we process the Decide if we process the
@ -204,6 +252,34 @@ will be bigger, and some marginal weirdness may sometimes occur. The
default is a stripped index. When using multiple indexes for a search, default is a stripped index. When using multiple indexes for a search,
this parameter must be defined identically for all. Changing the value this parameter must be defined identically for all. Changing the value
implies an index reset. implies an index reset.
.TP
.BI "indexStoreDocText = "bool
Decide if we store the
documents' text content in the index. Storing the text
allows extracting snippets from it at query time, instead of building
them from index position data.
Newer Xapian index formats have rendered our use of positions list
unacceptably slow in some cases. The last Xapian index format with good
performance for the old method is Chert, which is default for 1.2, still
supported but not default in 1.4 and will be dropped in 1.6.
The stored document text is translated from its original format to UTF-8
plain text, but not stripped of upper-case, diacritics, or punctuation
signs. Storing it increases the index size by 10-20% typically, but also
allows for nicer snippets, so it may be worth enabling it even if not
strictly needed for performance if you can afford the space.
The variable only has an effect when creating an index, meaning that the
xapiandb directory must not exist yet. Its exact effect depends on the
Xapian version.
For Xapian 1.4, if the variable is set to 0, the Chert format will be
used, and the text will not be stored. If the variable is 1, Glass will
be used, and the text stored.
For Xapian 1.2, and for versions after 1.5 and newer, the index format is
always the default, but the variable controls if the text is stored or
not, and the abstract generation method. With Xapian 1.5 and later, and
the variable set to 0, abstract generation may be very slow, but this
setting may still be useful to save space if you do not use abstract
generation at all.
.TP .TP
.BI "nonumbers = "bool .BI "nonumbers = "bool
Decides if terms will be Decides if terms will be
@ -216,9 +292,19 @@ will reduce the index size. This can only be set for a whole index, not
for a subtree. for a subtree.
.TP .TP
.BI "dehyphenate = "bool .BI "dehyphenate = "bool
Determines if we index 'coworker' also when the input is 'co-worker'. Determines if we index
This is new in version 1.22, and on by default. Setting the variable to off 'coworker' also when the input is 'co-worker'. This is new
allows restoring the previous behaviour. in version 1.22, and on by default. Setting the variable to off allows
restoring the previous behaviour.
.TP
.BI "backslashasletter = "bool
Process backslash as normal letter This may make sense for people wanting to index TeX commands as
such but is not of much general use.
.TP
.BI "maxtermlength = "int
Maximum term length. Words longer than this will be discarded.
The default is 40 and used to be hard-coded, but it can now be
adjusted. You need an index reset if you change the value.
.TP .TP
.BI "nocjk = "bool .BI "nocjk = "bool
Decides if specific East Asian Decides if specific East Asian
@ -263,24 +349,16 @@ lowercase and upper-case versions of a character should be specified, as
appartenance to the list will turn-off both standard accent and case appartenance to the list will turn-off both standard accent and case
processing. The value is global and affects both indexing and querying. processing. The value is global and affects both indexing and querying.
Examples: Examples:
Swedish: Swedish:
unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå
. German:
German:
unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl
In French, you probably want to decompose oe and ae and nobody would type In French, you probably want to decompose oe and ae and nobody would type
a German ß a German ß
unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl
. The default for all until someone protests follows. These decompositions
The default for all until someone protests follows. These decompositions
are not performed by unac, but it is unlikely that someone would type the are not performed by unac, but it is unlikely that someone would type the
composed forms in a search. composed forms in a search.
unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl
.TP .TP
.BI "maildefcharset = "string .BI "maildefcharset = "string
@ -352,7 +430,7 @@ over which we stop indexing. The value is a percentage,
corresponding to what the "Capacity" df output column shows. The default corresponding to what the "Capacity" df output column shows. The default
value is 0, meaning no checking. value is 0, meaning no checking.
.TP .TP
.BI "xapiandb = "dfn .BI "dbdir = "dfn
Xapian database directory Xapian database directory
location. This will be created on first indexing. If the location. This will be created on first indexing. If the
value is not an absolute path, it will be interpreted as relative to value is not an absolute path, it will be interpreted as relative to
@ -386,9 +464,17 @@ Default: 40 MB.
Reducing the size will not physically truncate the file. Reducing the size will not physically truncate the file.
.TP .TP
.BI "webqueuedir = "fn .BI "webqueuedir = "fn
The path to the Web indexing queue. This is The path to the Web indexing queue. This used to be
hard-coded in the plugin as ~/.recollweb/ToIndex so there should be no hard-coded in the old plugin as ~/.recollweb/ToIndex so there would be no
need or possibility to change it. need or possibility to change it, but the WebExtensions plugin now downloads
the files to the user Downloads directory, and a script moves them to
webqueuedir. The script reads this value from the config so it has become
possible to change it.
.TP
.BI "webdownloadsdir = "fn
The path to browser downloads directory. This is
where the new browser add-on extension has to create the files. They are
then moved by a script to webqueuedir.
.TP .TP
.BI "aspellDicDir = "dfn .BI "aspellDicDir = "dfn
Aspell dictionary storage directory location. The Aspell dictionary storage directory location. The
@ -415,10 +501,11 @@ which lets Xapian perform its own thing, meaning flushing every
$XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory $XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory
usage depends on average document size, not only document count, the usage depends on average document size, not only document count, the
Xapian approach is is not very useful, and you should let Recoll manage Xapian approach is is not very useful, and you should let Recoll manage
the flushes. The default value of idxflushmb is 10 MB, and may be a bit the flushes. The program compiled value is 0. The configured default
low. If you are looking for maximum speed, you may want to experiment value (from this file) is now 50 MB, and should be ok in many cases.
with values between 20 and You can set it as low as 10 to conserve memory, but if you are looking
80. In my experience, values beyond 100 are always counterproductive. If for maximum speed, you may want to experiment with values between 20 and
200. In my experience, values beyond this are always counterproductive. If
you find otherwise, please drop me a note. you find otherwise, please drop me a note.
.TP .TP
.BI "filtermaxseconds = "int .BI "filtermaxseconds = "int
@ -481,6 +568,25 @@ Override logfilename for the indexer in real time
mode. The default is to use the idx... values if set, else mode. The default is to use the idx... values if set, else
the log... values. the log... values.
.TP .TP
.BI "orgidxconfdir = "dfn
Original location of the configuration directory. This is used exclusively for movable datasets. Locating the
configuration directory inside the directory tree makes it possible to
provide automatic query time path translations once the data set has
moved (for example, because it has been mounted on another
location).
.TP
.BI "curidxconfdir = "dfn
Current location of the configuration directory. Complement orgidxconfdir for movable datasets. This should be used
if the configuration directory has been copied from the dataset to
another location, either because the dataset is readonly and an r/w copy
is desired, or for performance reasons. This records the original moved
location before copy, to allow path translation computations. For
example if a dataset originally indexed as '/home/me/mydata/config' has
been mounted to '/media/me/mydata', and the GUI is running from a copied
configuration, orgidxconfdir would be '/home/me/mydata/config', and
curidxconfdir (as set in the copied configuration) would be
'/media/me/mydata/config'.
.TP
.BI "idxrundir = "dfn .BI "idxrundir = "dfn
Indexing process current directory. The input Indexing process current directory. The input
handlers sometimes leave temporary files in the current directory, so it handlers sometimes leave temporary files in the current directory, so it
@ -519,6 +625,12 @@ amount of data stored in the index for the purpose of displaying fields
inside result lists or previews. The default value is 150 bytes which inside result lists or previews. The default value is 150 bytes which
may be too low if you have custom fields. may be too low if you have custom fields.
.TP .TP
.BI "idxtexttruncatelen = "int
Truncation length for all document texts. Only index
the beginning of documents. This is not recommended except if you are
sure that the interesting keywords are at the top and have severe disk
space issues.
.TP
.BI "aspellLanguage = "string .BI "aspellLanguage = "string
Language definitions to use when creating the aspell Language definitions to use when creating the aspell
dictionary. The value must match a set of aspell language dictionary. The value must match a set of aspell language
@ -612,16 +724,39 @@ Attempt OCR of PDF files with no text content if both tesseract and
pdftoppm are installed. The default is off because OCR is so pdftoppm are installed. The default is off because OCR is so
very slow. very slow.
.TP .TP
.BI "pdfocrlang = "string
Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
with tesseract. This can also be set through a configuration variable
or directory-local parameters. See the rclpdf.py script.
.TP
.BI "pdfattach = "bool .BI "pdfattach = "bool
Enable PDF attachment extraction by executing pdftk (if Enable PDF attachment extraction by executing pdftk (if
available). This is available). This is
normally disabled, because it does slow down PDF indexing a bit even if normally disabled, because it does slow down PDF indexing a bit even if
not one attachment is ever found. not one attachment is ever found.
.TP .TP
.BI "pdfextrameta = "string
Extract text from selected XMP metadata tags. This
is a space-separated list of qualified XMP tag names. Each element can also
include a translation to a Recoll field name, separated by a '|'
character. If the second element is absent, the tag name is used as the
Recoll field names. You will also need to add specifications to the
"fields" file to direct processing of the extracted data.
.TP
.BI "pdfextrametafix = "fn
Define name of XMP field editing script. This
defines the name of a script to be loaded for editing XMP field
values. The script should define a 'MetaFixer' class with a metafix()
method which will be called with the qualified tag name and value of each
selected field, for editing or erasing. A new instance is created for
each document, so that the object can keep state for, e.g. eliminating
duplicate values.
.TP
.BI "mhmboxquirks = "string .BI "mhmboxquirks = "string
Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the directory where the email mbox files are Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the directory where the email mbox files are
stored. stored.
.SH SEE ALSO .SH SEE ALSO
.PP .PP
recollindex(1) recoll(1) recollindex(1) recoll(1)

File diff suppressed because it is too large Load Diff

View File

@ -8,6 +8,7 @@
<!ENTITY RCLVERSION "1.25"> <!ENTITY RCLVERSION "1.25">
<!ENTITY XAP "<application>Xapian</application>"> <!ENTITY XAP "<application>Xapian</application>">
<!ENTITY WIN "<application>Windows</application>"> <!ENTITY WIN "<application>Windows</application>">
<!ENTITY LIN "<application>Unix</application>-like systems">
<!ENTITY FAQS "https://www.lesbonscomptes.com/recoll/faqsandhowtos/"> <!ENTITY FAQS "https://www.lesbonscomptes.com/recoll/faqsandhowtos/">
]> ]>
@ -89,7 +90,7 @@
</menuchoice>, then adjust the <guilabel>Top </menuchoice>, then adjust the <guilabel>Top
directories</guilabel> section).</para> directories</guilabel> section).</para>
<para>On Unix/Linux, you may need to install the <para>On &LIN;, you may need to install the
appropriate appropriate
<link linkend="RCL.INSTALL.EXTERNAL">supporting applications</link> <link linkend="RCL.INSTALL.EXTERNAL">supporting applications</link>
for document types that need them (for for document types that need them (for
@ -177,16 +178,13 @@
<para>The &XAP; index can be big (roughly the size of the original <para>The &XAP; index can be big (roughly the size of the original
document set), but it is not a document archive. &RCL; can only document set), but it is not a document archive. &RCL; can only
display documents that still exist at the place from which they were display documents that still exist at the place from which they were
indexed. (Actually, there is a way to reconstruct a document from the indexed.</para>
information in the index, but only the pure text is saved, possibly
without punctuation and capitalization, depending on &RCL;
version).</para>
<para>&RCL; stores all internal data in <application>Unicode <para>&RCL; stores all internal data in <application>Unicode
UTF-8</application> format, and it can index files of many types UTF-8</application> format, and it can index many types of files
with different character sets, encodings, and languages into the with different character sets, encodings, and languages into the
same index. It can process documents embedded inside other same index. It can process documents embedded inside other
documents (for example a pdf document stored inside a Zip documents (for example a PDF document stored inside a Zip
archive sent as an email attachment...), down to an arbitrary archive sent as an email attachment...), down to an arbitrary
depth.</para> depth.</para>
@ -233,25 +231,17 @@
<link linkend="RCL.INDEXING.CONFIG.SENS">index case and diacritics sensitivity</link>. <link linkend="RCL.INDEXING.CONFIG.SENS">index case and diacritics sensitivity</link>.
</para> </para>
<para>&RCL; has many parameters which define exactly what to <para>&RCL; uses many parameters to define exactly what to index,
index, and how to classify and decode the source and how to classify and decode the source documents. These are kept
documents. These are kept in in <link linkend="RCL.INDEXING.CONFIG">configuration files</link>. A
<link linkend="RCL.INDEXING.CONFIG">configuration files</link>. default configuration is copied into a standard location (usually
A default configuration is copied into a standard location something like <filename>/usr/share/recoll/examples</filename>)
(usually something like during installation. The default values set by the configuration
<filename>/usr/share/recoll/examples</filename>) files in this directory may be overridden by values set inside your
during installation. The default values set by the personal configuration. With the default configuration, &RCL; will
configuration files in this directory may be overridden by index your home directory with generic parameters. The configuration
values set inside your personal configuration, found can be customized either by editing the text files or by using
by default in the <filename>.recoll</filename> sub-directory configuration menus in the <command>recoll</command> GUI.</para>
of your home directory. The default configuration will index
your home directory with default parameters and should be
sufficient for giving &RCL; a try, but you may want to adjust
it later, which can be done either by editing the text files
or by using configuration menus in the
<command>recoll</command> GUI. Some other parameters affecting only
the <command>recoll</command> GUI are stored in the standard
location defined by <application>Qt</application>.</para>
<para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing process</link> <para>The <link linkend="RCL.INDEXING.PERIODIC.EXEC">indexing process</link>
is started automatically (after asking permission), the is started automatically (after asking permission), the
@ -265,7 +255,7 @@
<para><link linkend="RCL.SEARCH">Searches</link> are usually <para><link linkend="RCL.SEARCH">Searches</link> are usually
performed inside the <command>recoll</command> GUI, which has many performed inside the <command>recoll</command> GUI, which has many
options to help you find what you are looking for. However, there options to help you find what you are looking for. However, there
are other ways to perform &RCL; searches: are other ways to query the index:
<itemizedlist> <itemizedlist>
<listitem><para>A <listitem><para>A
<link linkend="RCL.SEARCH.COMMANDLINE">command line interface</link>. <link linkend="RCL.SEARCH.COMMANDLINE">command line interface</link>.
@ -328,41 +318,44 @@
<sect2 id="RCL.INDEXING.INTRODUCTION.MODES"> <sect2 id="RCL.INDEXING.INTRODUCTION.MODES">
<title>Indexing modes</title> <title>Indexing modes</title>
<para>&RCL; indexing can be performed along two main modes: <para>&RCL; indexing can be performed along two main modes:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<formalpara> <formalpara><title>
<title><link linkend="RCL.INDEXING.PERIODIC">Periodic (or batch) indexing:</link></title> <link linkend="RCL.INDEXING.PERIODIC">Periodic (or batch) indexing</link>
</title>
<para><command>recollindex</command> is executed <para><command>recollindex</command> is executed
at discrete times. The typical usage is to have a nightly run at discrete times. On &LIN;, the typical usage is to have a
<link linkend="RCL.INDEXING.PERIODIC.AUTOMAT">programmed</link> into nightly run
your <command>cron</command> file.</para> <link linkend="RCL.INDEXING.PERIODIC.AUTOMAT">programmed</link>
into your <command>cron</command> file. On &WIN;, this is
the only mode available, and the indexer is usually started
from the GUI (but there is nothing to prevent starting it
from a command script).</para>
</formalpara> </formalpara>
</listitem> </listitem>
<listitem> <listitem>
<formalpara><title><link linkend="RCL.INDEXING.MONITOR">Real time indexing:</link></title> <formalpara><title>
<para><command>recollindex</command> runs permanently as a <link linkend="RCL.INDEXING.MONITOR">Real time indexing</link>
daemon and uses a file system alteration monitor </title>
<para>(Only available on &LIN;). <command>recollindex</command> runs
permanently as a daemon and uses a file system alteration monitor
(e.g. <application>inotify</application>) to detect file (e.g. <application>inotify</application>) to detect file
changes. New or updated files are indexed at once.</para> changes. New or updated files are indexed at once. Monitoring a
big file system tree can consume
significant system resources. </para>
</formalpara> </formalpara>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</para>
<simplesect><title>&LIN;: choosing an indexing mode</title>
<para>The choice between the two methods is mostly a matter of <para>The choice between the two methods is mostly a matter of
preference, and they can be combined by setting up multiple preference, and they can be combined by setting up multiple
indexes (ie: use periodic indexing on a big documentation indexes (ie: use periodic indexing on a big documentation
directory, and real time indexing on a small home directory, and real time indexing on a small home
directory). Monitoring a big file system tree can consume directory), or, with &RCL; 1.24 and newer, by
significant system resources.</para> <link linkend="RCL.INDEXING.MONITOR">configuring the index so that only a subset of the tree will be monitored.</link>
</para>
<para>With &RCL; 1.24 and newer, it is also possible to set up an
index so that only a subset of the tree will be monitored and the
rest will be covered by batch/incremental indexing. (See the
details in the <link linkend="RCL.INDEXING.MONITOR">Real time indexing</link>
section.</para>
<para>The choice of method and the parameters used can be <para>The choice of method and the parameters used can be
configured from the <command>recoll</command> GUI: configured from the <command>recoll</command> GUI:
<menuchoice> <menuchoice>
@ -370,21 +363,7 @@
<guimenuitem>Indexing schedule</guimenuitem> <guimenuitem>Indexing schedule</guimenuitem>
</menuchoice> </menuchoice>
</para> </para>
</simplesect>
<para>The GUI <menuchoice><guimenu>File</guimenu>
</menuchoice> menu also has entries to start or stop
the current indexing operation. Stopping indexing is performed by
killing the <command>recollindex</command> process, which will
checkpoint its state and exit. A later restart of indexing will
mostly resume from where things stopped (the file tree walk has to
be restarted from the beginning).</para>
<para>When the real time indexer is running, two operations are
available from the menu: 'Stop' and 'Trigger incremental pass'.
When no indexing is running, you have a choice of updating the
index or rebuilding it (the first choice only processes changed
files, the second one zeroes the index before starting so that all
files are processed).</para>
</sect2> </sect2>
@ -396,11 +375,13 @@
in which several configuration files describe in which several configuration files describe
what should be indexed and how.</para> what should be indexed and how.</para>
<para>A default personal configuration directory <para>When <command>recoll</command> or
(<filename>$HOME/.recoll/</filename>) is created <command>recollindex</command> is first executed, it creates a
when a &RCL; program is first executed. This configuration is default configuration directory. This configuration is the one used
the one used for indexing and querying when no specific for indexing and querying when no specific configuration is
configuration is specified.</para> specified. It is located in <filename>$HOME/.recoll/</filename> for
&LIN; and <filename>%LOCALAPPDATA%</filename> on &WIN;
(typically <filename>C:\Users\[me]\Appdata\Local</filename>).</para>
<para>All configuration parameters have defaults, defined in <para>All configuration parameters have defaults, defined in
system-wide files. Without further customisation, the default system-wide files. Without further customisation, the default
@ -431,33 +412,6 @@
machines), and then merging them, or querying them in machines), and then merging them, or querying them in
parallel.</para> parallel.</para>
<para>A specific configuration can be selected by setting the
<envar>RECOLL_CONFDIR</envar> environment variable, or giving the
<option>-c</option> option to any of the &RCL; commands.</para>
<para>When creating or updating indexes, the different
configurations are entirely independant (no parameters are ever
shared between configurations when indexing). The
<command>recollindex</command> program always works on a single
index.</para>
<para>When querying, multiple indexes can be accessed concurrently,
either from the GUI or the command line. When doing this, there is
always one main configuration, from which both configuration and
index data are used. Only the index data from the additional
indexes is used (their configuration parameters are
ignored).</para>
<para>The behaviour of index update and query regarding multiple
configurations is important and sometimes confusing, so it will be
rephrased here: for index generation, multiple configurations are
totally independant from each other. When querying, configuration
and data are used from the main index (the one designated by
<literal>-c</literal> or <envar>RECOLL_CONFDIR</envar>), and only
the data from the additional indexes is used. This implies
that some parameters should be consistent among the configurations
for indexes which are to be used together.</para>
<para>See the section about <para>See the section about
<link linkend="RCL.INDEXING.CONFIG.MULTIPLE">configuring multiple indexes</link> <link linkend="RCL.INDEXING.CONFIG.MULTIPLE">configuring multiple indexes</link>
for more detail</para> for more detail</para>
@ -751,27 +705,26 @@
<link linkend="RCL.INDEXING.CONFIG.GUI">dialogs in the <command>recoll</command> GUI</link>. <link linkend="RCL.INDEXING.CONFIG.GUI">dialogs in the <command>recoll</command> GUI</link>.
</para> </para>
<para>The first time you start <command>recoll</command>, you <para>The first time you start <command>recoll</command>, you will be
will be asked whether or not you would like it to build the asked whether or not you would like it to build the index. If you
index. If you want to adjust the configuration before want to adjust the configuration before indexing, just click
indexing, just click <guilabel>Cancel</guilabel> at this <guilabel>Cancel</guilabel> at this point, which will get you into
point, which will get you into the configuration interface. If the configuration interface. If you exit at this point,
you exit at this point, <filename>recoll</filename> will have <filename>recoll</filename> will have created a default configuration
created a <filename>~/.recoll</filename> directory containing directory with empty configuration files, which you can then
empty configuration files, which you can edit by hand.</para> edit.</para>
<para>The configuration is documented inside the <para>The configuration is documented inside the
<link linkend="RCL.INSTALL.CONFIG">installation chapter</link> <link linkend="RCL.INSTALL.CONFIG">installation chapter</link>
of this document, or in the of this document, or in the
<citerefentry> <ulink url="https://www.lesbonscomptes.com/recoll/manpages/recoll.conf.5.html"><citerefentry><refentrytitle>recoll.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry></ulink>
<refentrytitle>recoll.conf</refentrytitle> manual page.Both documents are automatically generated from
<manvolnum>5</manvolnum> the comments inside the configuration file.</para>
</citerefentry>
man page, but the most current information will most likely be the <para>The most immediately useful variable
comments inside the sample file. The most immediately useful variable
is probably is probably
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><varname>topdirs</varname></link>, <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><varname>topdirs</varname></link>,
which determines what subtrees and files get indexed.</para> which lists the subtrees and files to be indexed.</para>
<para>The applications needed to index file types other than <para>The applications needed to index file types other than
text, HTML or email (ie: pdf, postscript, ms-word...) are text, HTML or email (ie: pdf, postscript, ms-word...) are
@ -789,67 +742,62 @@
<para>Multiple &RCL; indexes can be created by using several <para>Multiple &RCL; indexes can be created by using several
configuration directories which are typically set to index configuration directories which are typically set to index
different areas of the file system. A specific index can be different areas of the file system.</para>
selected for updating or searching, using the
<envar>RECOLL_CONFDIR</envar> environment variable or the <para>A specific index can be selected by setting the
<envar>RECOLL_CONFDIR</envar> environment variable or giving the
<option>-c</option> option to <command>recoll</command> and <option>-c</option> option to <command>recoll</command> and
<command>recollindex</command>.</para> <command>recollindex</command>.</para>
<para>Index configuration parameters can be set either by using a <para>The <command>recollindex</command> program, used for creating
text editor on the files, or, for most parameters, by using the or updating indexes, always works on a single index. The different
<command>recoll</command> index configuration GUI. In the latter configurations are entirely independant (no parameters are ever
case, the configuration directory for which parameters are modified shared between configurations when indexing). </para>
is the one which was selected by <envar>RECOLL_CONFDIR</envar> or
the <option>-c</option> parameter, and there is no way to switch
configurations within the GUI.</para>
<para>As a remainder from a previous section, a <para>All the search interfaces (<command>recoll</command>,
<command>recollindex</command> program instance can only update one
specific index, and it will only use parameters from a single
configuration (no parameters are ever shared between configurations
when indexing). All the query methods (<command>recoll</command>,
<command>recollq</command>, the Python API, etc.) operate with a <command>recollq</command>, the Python API, etc.) operate with a
main configuration, from which both configuration and index data main configuration, from which both configuration and index data
are used, but can also query data from multiple additional are used, and can also query data from multiple additional
indexes. Only the index data from the latter is used, their indexes. Only the index data from the latter is used, their
configuration parameters are ignored.</para> configuration parameters are ignored. This implies that some
parameters should be consistent among index configurations which
are to be used together.</para>
<para>When searching, the current main index (defined by <para>When searching, the current main index (defined by
<envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is always <envar>RECOLL_CONFDIR</envar> or <option>-c</option>) is always
active. If this is undesirable, you can set up your base active. If this is undesirable, you can set up your base
configuration to index an empty directory.</para> configuration to index an empty directory.</para>
<para>If a set of multiple indexes are to be used together for <para>Index configuration parameters can be set either by using a
searches, some configuration parameters must be consistent text editor on the files, or, for most parameters, by using the
among the set. These are parameters which need to be the same <link linkend="RCL.INDEXING.CONFIG.GUI"><command>recoll</command> index configuration GUI</link>.
when indexing and searching. As the parameters come from the In the latter case, the configuration directory for which
main configuration when searching, they need to be compatible parameters are modified is the one which was selected by
with what was set when creating the other indexes (which came <envar>RECOLL_CONFDIR</envar> or the <option>-c</option> parameter,
from their respective configuration directories).</para> and there is no way to switch configurations within the GUI.</para>
<para>Most importantly, all indexes to be queried concurrently must <para>See the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration section</link>
have the same option concerning character case and diacritics for a detailed description of the parameters</para>
stripping, but there are other constraints. Most of the
relevant parameters are described in the <para>Some configuration parameters must be consistent among a set
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TERMS">linked section</link>. of multiple indexes used together for searches. Most importantly,
all indexes to be queried concurrently must have the same option
concerning character case and diacritics stripping, but there are
other constraints. Most of the relevant parameters affect the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TERMS">term generation</link>.
</para> </para>
<para>The different search interfaces (GUI, command line, ...) <para>Using multiple configurations implies a small
have different methods to define the set of indexes to be level of command line or file manager usage. The user must
used, see the appropriate section.</para> explicitely create additional configuration directories, the GUI
will not do it. This is to avoid mistakenly creating additional
directories when an argument is mistyped. Also, the GUI or the
indexer must be launched with a specific option or environment to
work on the right configuration.</para>
<para>At the moment, using multiple configurations implies a small <simplesect>
level of command line usage. Additional configuration directories <title>In practise: creating and using an additional index</title>
(beyond <filename>~/.recoll</filename>) must be created by hand
(<command>mkdir</command> or such), the GUI will not do it. This is
to avoid mistakenly creating additional directories when an
argument is mistyped. Also, the GUI or the indexer must be launched
with a specific option or environment to work on the right
configuration.</para>
<para>To be more practical, here follows a few examples of the
commands need to create, configure, update, and query an additional
index.</para>
<para>Initially creating the configuration and index:<programlisting> <para>Initially creating the configuration and index:<programlisting>
mkdir <replaceable>/path/to/my/new/config</replaceable></programlisting></para> mkdir <replaceable>/path/to/my/new/config</replaceable></programlisting></para>
@ -858,15 +806,19 @@ mkdir <replaceable>/path/to/my/new/config</replaceable></programlisting></para>
<command>recoll</command> GUI, launched from the <command>recoll</command> GUI, launched from the
command line to pass the <literal>-c</literal> option command line to pass the <literal>-c</literal> option
(you could create a desktop file to do it for you), and then using the (you could create a desktop file to do it for you), and then using the
GUI index configuration tool to set up the index. <link linkend="RCL.INDEXING.CONFIG.GUI">GUI index configuration tool</link>
to set up the index.
<programlisting> <programlisting>
recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting> recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
</para> </para>
<para>Alternatively, you can just start a text editor on the main <para>Alternatively, you can just start a text editor on the main
configuration file configuration file:
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF"><filename>recoll.conf</filename></link>.</para> <programlisting>
<replaceable>someEditor</replaceable> <replaceable>/path/to/my/new/config</replaceable>/<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF"><filename>recoll.conf</filename></link>
</programlisting>
</para>
<para>Creating and updating the index can be done from the command line: <para>Creating and updating the index can be done from the command line:
@ -891,7 +843,7 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<guimenu>Preferences</guimenu> <guimenu>Preferences</guimenu>
<guimenuitem>External Index Dialog</guimenuitem> <guimenuitem>External Index Dialog</guimenuitem>
</menuchoice> menu.</para> </menuchoice> menu.</para>
</simplesect>
</sect2> </sect2>
@ -911,9 +863,8 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
the index. With a stripped index, the search term will be stripped the index. With a stripped index, the search term will be stripped
before searching.</para> before searching.</para>
<para>A raw index allows for another possibility which a stripped <para>A raw index allows using case and diacritics to discriminate
index cannot offer: using case and diacritics to discriminate between terms, e.g., returning different results when searching for
between terms, returning different results when searching for
<literal>US</literal> and <literal>us</literal> or <literal>US</literal> and <literal>us</literal> or
<literal>resume</literal> and <literal>résumé</literal>. <literal>resume</literal> and <literal>résumé</literal>.
Read the Read the
@ -927,15 +878,14 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
automated by &RCL;), and all indexes in a search must be set automated by &RCL;), and all indexes in a search must be set
in the same way (again, not checked by &RCL;). </para> in the same way (again, not checked by &RCL;). </para>
<para>If the <literal>indexStripChars</literal> is not set, &RCL; <para>&RCL; creates a stripped index by default if
1.18 creates a stripped index by default, for <literal>indexStripChars</literal> is not set.</para>
compatibility with previous versions.</para>
<para>As a cost for added capability, a raw index will be slightly <para>As a cost for added capability, a raw index will be slightly
bigger than a stripped one (around 10%). Also, searches will be bigger than a stripped one (around 10%). Also, searches will be
more complex, so probably slightly slower, and the feature is more complex, so probably slightly slower, and the feature is
still young, so that a certain amount of weirdness cannot be relatively little used, so that a certain amount of weirdness
excluded.</para> cannot be excluded.</para>
<para>One of the most adverse consequence of using a raw index <para>One of the most adverse consequence of using a raw index
is that some phrase and proximity searches may become is that some phrase and proximity searches may become
@ -950,7 +900,7 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<sect2 id="RCL.INDEXING.CONFIG.THREADS"> <sect2 id="RCL.INDEXING.CONFIG.THREADS">
<title>Indexing threads configuration</title> <title>Indexing threads configuration (&LIN;)</title>
<para>The &RCL; indexing process <para>The &RCL; indexing process
<command>recollindex</command> can use multiple threads to <command>recollindex</command> can use multiple threads to
@ -1363,7 +1313,7 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<sect1 id="RCL.INDEXING.PERIODIC"> <sect1 id="RCL.INDEXING.PERIODIC">
<title>Periodic indexing</title> <title>Periodic indexing</title>
<sect2 id="RCL.INDEXING.PERIODIC.EXEC"> <simplesect id="RCL.INDEXING.PERIODIC.EXEC">
<title>Running indexing</title> <title>Running indexing</title>
<para>Indexing is always performed by the <para>Indexing is always performed by the
@ -1381,19 +1331,36 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
when it starts, it will automatically start indexing (except when it starts, it will automatically start indexing (except
if canceled).</para> if canceled).</para>
<para>The <command>recollindex</command> indexing process can be <para>The GUI <menuchoice><guimenu>File</guimenu> </menuchoice>
interrupted by sending an interrupt (<keysym>Ctrl-C</keysym>, menu has entries to start or stop the current indexing
SIGINT) or terminate operation.</para>
(SIGTERM) signal. Some time may elapse before the process exits,
because it needs to properly flush and close the index. This can <para>When no indexing is running, you have a choice of updating the
also be done from the <command>recoll</command> GUI index or rebuilding it (the first choice only processes changed
files, the second one zeroes the index before starting so that all
files are processed).</para>
<para>On Linux, the <command>recollindex</command> indexing process
can be interrupted by sending an interrupt
(<keysym>Ctrl-C</keysym>, SIGINT) or terminate (SIGTERM)
signal.
</para>
<para>On Linux and Windows, the GUI can used to manage the indexing
operation. Stopping the indexer can be done
from the <command>recoll</command> GUI
<menuchoice> <menuchoice>
<guimenu>File</guimenu> <guimenu>File</guimenu>
<guimenuitem>Stop Indexing</guimenuitem> <guimenuitem>Stop Indexing</guimenuitem>
</menuchoice> </menuchoice>
menu entry.</para> menu entry.
</para>
<para>After such an interruption, the index will be somewhat <para>When stopped, some time may elapse before
<command>recollindex</command> exits, because it needs to properly
flush and close the index.</para>
<para>After an interruption, the index will be somewhat
inconsistent because some operations which are normally inconsistent because some operations which are normally
performed at the end of the indexing pass will have been performed at the end of the indexing pass will have been
skipped (for example, the stemming and spelling databases skipped (for example, the stemming and spelling databases
@ -1404,9 +1371,11 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
to the interruption and for which the index is still up to to the interruption and for which the index is still up to
date will not need to be reindexed).</para> date will not need to be reindexed).</para>
<para><command>recollindex</command> has a number of other options <para><command>recollindex</command> has many options
which are described in its man page. Only a few will be which are listed in its
described here.</para> <ulink url="https://www.lesbonscomptes.com/recoll/manpages/recollindex.1.html">manual page</ulink>.
Only a few will be described here.</para>
<para>Option <option>-z</option> will reset the index when <para>Option <option>-z</option> will reset the index when
starting. This is almost the same as destroying the index starting. This is almost the same as destroying the index
files (the nuance is that the &XAP; format version will not files (the nuance is that the &XAP; format version will not
@ -1446,11 +1415,10 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
but just add them as index entries. It is but just add them as index entries. It is
up to the external file selection method to build the complete up to the external file selection method to build the complete
file list.</para> file list.</para>
</sect2> </simplesect>
<sect2 id="RCL.INDEXING.PERIODIC.AUTOMAT"> <simplesect id="RCL.INDEXING.PERIODIC.AUTOMAT">
<title>Using <command>cron</command> to automate <title>Linux: using <command>cron</command> to automate indexing</title>
indexing</title>
<para>The most common way to set up indexing is to have a cron <para>The most common way to set up indexing is to have a cron
task execute it every night. For example the following task execute it every night. For example the following
@ -1468,7 +1436,7 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
]]></screen> ]]></screen>
</para> </para>
<para>As of version 1.17 the &RCL; GUI has dialogs to manage <para>The &RCL; GUI has dialogs to manage
<filename>crontab</filename> entries for <filename>crontab</filename> entries for
<command>recollindex</command>. You can reach them from the <command>recollindex</command>. You can reach them from the
<menuchoice> <menuchoice>
@ -1492,11 +1460,11 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
issues.</para> issues.</para>
</sect2> </simplesect>
</sect1> </sect1>
<sect1 id="RCL.INDEXING.MONITOR"> <sect1 id="RCL.INDEXING.MONITOR">
<title>Real time indexing</title> <title>&LIN;: real time indexing</title>
<para>Real time monitoring/indexing is performed by starting the <para>Real time monitoring/indexing is performed by starting the
<command>recollindex</command> <option>-m</option> command. <command>recollindex</command> <option>-m</option> command.
@ -1504,6 +1472,11 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
from the terminal and become a daemon, permanently monitoring from the terminal and become a daemon, permanently monitoring
file changes and updating the index.</para> file changes and updating the index.</para>
<para>In this situation, the <command>recoll</command> GUI
<menuchoice><guimenu>File</guimenu></menuchoice> menu
makes two operations available: 'Stop' and 'Trigger incremental pass'.
</para>
<para>While it is convenient that data is indexed in real time, <para>While it is convenient that data is indexed in real time,
repeated indexing can generate a significant load on the repeated indexing can generate a significant load on the
system when files such as email folders change. Also, system when files such as email folders change. Also,
@ -1522,8 +1495,8 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
process. The <command>recoll</command> GUI also has a menu entry for process. The <command>recoll</command> GUI also has a menu entry for
this.</para> this.</para>
<sect2 id="RCL.INDEXING.MONITOR.START"> <simplesect id="RCL.INDEXING.MONITOR.START">
<title>Real time indexing: automatic daemon start</title> <title>Automatic daemon start</title>
<para>Under <application>KDE</application>, <para>Under <application>KDE</application>,
<application>Gnome</application> and some other desktop <application>Gnome</application> and some other desktop
@ -1542,17 +1515,15 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<filename>examples</filename> directory (typically <filename>examples</filename> directory (typically
<filename>/usr/local/[share/]recoll/examples</filename>).</para> <filename>/usr/local/[share/]recoll/examples</filename>).</para>
<para>For example, my out of fashion <para>For example, a good old <application>xdm</application>-based
<application>xdm</application>-based session has a session could have a <filename>.xsession</filename> script with the
<filename>.xsession</filename> script with the following lines following lines at the end:</para>
at the end:</para>
<programlisting>recollconf=$HOME/.recoll-home <programlisting>recollconf=$HOME/.recoll-home
recolldata=/usr/local/share/recoll recolldata=/usr/local/share/recoll
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
fvwm fvwm
</programlisting> </programlisting>
<para>The indexing daemon gets started, then the window manager, <para>The indexing daemon gets started, then the window manager,
@ -1567,10 +1538,10 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<application>X11</application> session, you need to add option <application>X11</application> session, you need to add option
<option>-x</option> to disable <application>X11</application> <option>-x</option> to disable <application>X11</application>
session monitoring (else the daemon will not start).</para> session monitoring (else the daemon will not start).</para>
</sect2> </simplesect>
<sect2 id="RCL.INDEXING.MONITOR.DETAILS"> <simplesect id="RCL.INDEXING.MONITOR.DETAILS">
<title>Real time indexing: miscellaneous details</title> <title>Miscellaneous details</title>
<para>By default, the messages from the indexing daemon will be <para>By default, the messages from the indexing daemon will be
sent to the same file as those from the interactive commands sent to the same file as those from the interactive commands
@ -1581,17 +1552,7 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
the daemon runs permanently, the log file may grow quite big, the daemon runs permanently, the log file may grow quite big,
depending on the log level.</para> depending on the log level.</para>
<para>When building &RCL;, the real time indexing support can be <formalpara><title>Increasing resources for inotify</title>
customised during package
<link linkend="RCL.INSTALL.BUILDING">configuration</link>
with the <option>--with[out]-fam</option> or
<option>--with[out]-inotify</option> options. The default is
currently to include <application>inotify</application>
monitoring on systems that support it, and, as of &RCL; 1.17,
<application>gamin</application> support on
<application>FreeBSD</application>.</para>
<note><title>Increasing resources for inotify</title>
<para>On Linux systems, monitoring a big tree may need <para>On Linux systems, monitoring a big tree may need
increasing the resources available to inotify, which are increasing the resources available to inotify, which are
normally defined in <filename>/etc/sysctl.conf</filename>. normally defined in <filename>/etc/sysctl.conf</filename>.
@ -1609,29 +1570,28 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
fs.inotify.max_user_watches=32768 fs.inotify.max_user_watches=32768
</programlisting> </programlisting>
</para> Especially, you will need to trim your tree or adjust
<para>Especially, you will need to trim your tree or adjust
the <literal>max_user_watches</literal> value if indexing exits with the <literal>max_user_watches</literal> value if indexing exits with
a message about errno <literal>ENOSPC</literal> (28) from a message about errno <literal>ENOSPC</literal> (28) from
<function>inotify_add_watch</function>.</para> <function>inotify_add_watch</function>.
</note> </para>
</formalpara>
<note><title>Slowing down the reindexing rate for fast changing <formalpara><title>Slowing down the reindexing rate for fast changing
files</title> files</title>
<para>When using the real time monitor, it may happen that some <para>When using the real time monitor, it may happen that some
files need to be indexed, but change so often that they impose an files need to be indexed, but change so often that they impose an
excessive load for the system.</para> excessive load for the system.
<para>&RCL; provides a configuration option to specify the minimum &RCL; provides a configuration option to specify the minimum
time before which a file, specified by a wildcard pattern, cannot be time before which a file, specified by a wildcard pattern, cannot be
reindexed. See the <varname>mondelaypatterns</varname> parameter in reindexed. See the <varname>mondelaypatterns</varname> parameter in
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">configuration section</link>. the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">configuration section</link>.
</para> </para>
</note> </formalpara>
</sect2> </simplesect>
</sect1> </sect1>
@ -1660,12 +1620,9 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
<para>In most cases, you can enter the terms as you <para>In most cases, you can enter the terms as you think them, even
think them, even if they contain embedded punctuation or other if they contain embedded punctuation or other non-textual characters
non-textual characters. For (e.g. &RCL; can handle things like email addresses).</para>
example, &RCL; can handle things like email addresses, or
arbitrary cut and paste from another text window, punctation
and all.</para>
<para>The main case where you should enter text differently from <para>The main case where you should enter text differently from
how it is printed is for east-asian languages (Chinese, how it is printed is for east-asian languages (Chinese,
@ -1674,10 +1631,10 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
case (they would typically be printed without white case (they would typically be printed without white
space).</para> space).</para>
<para>Some searches can be quite complex, and you may want to <para>Some searches can be quite complex, and you may want to re-use
re-use them later, perhaps with some tweaking. &RCL; versions them later, perhaps with some tweaking. &RCL; can save and restore
1.21 and later can save and restore searches, using XML files. See searches. See <link linkend="RCL.SEARCH.SAVING">Saving and restoring
<link linkend="RCL.SEARCH.SAVING">Saving and restoring queries</link>. queries</link>.
</para> </para>
<sect2 id="RCL.SEARCH.GUI.SIMPLE"> <sect2 id="RCL.SEARCH.GUI.SIMPLE">
@ -1704,12 +1661,9 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
documents containing all of the search terms (the ones with more documents containing all of the search terms (the ones with more
terms will get better scores), just like the <guilabel>All terms will get better scores), just like the <guilabel>All
terms</guilabel> mode. <guilabel>Any term</guilabel> will search terms</guilabel> mode. <guilabel>Any term</guilabel> will search
for documents where at least one of the terms appear.</para> for documents where at least one of the terms
appear. <guilabel>File name</guilabel> will exclusively look for
<para>The <guilabel>Query Language</guilabel> features are file names, not contents</para>
described in
<link linkend="RCL.SEARCH.LANG">a separate section</link>.
</para>
<para>All search modes allow terms to be expanded with wildcards <para>All search modes allow terms to be expanded with wildcards
characters (<literal>*</literal>, <literal>?</literal>, characters (<literal>*</literal>, <literal>?</literal>,
@ -1717,11 +1671,21 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<link linkend="RCL.SEARCH.WILDCARDS">section about wildcards</link> for <link linkend="RCL.SEARCH.WILDCARDS">section about wildcards</link> for
more details.</para> more details.</para>
<para>In all modes except <guilabel>File name</guilabel>, you can
search for exact phrases (adjacent words in a given order) by
enclosing the input inside double quotes. Ex:
<literal>"virtual reality"</literal>.</para>
<para>The <guilabel>Query Language</guilabel> features are
described in
<link linkend="RCL.SEARCH.LANG">a separate section</link>.
</para>
<para>The <guilabel>File name</guilabel> search mode will <para>The <guilabel>File name</guilabel> search mode will
specifically look for file names. The point of having a separate specifically look for file names. The point of having a separate
file name search is that wild card expansion can be performed more file name search is that wild card expansion can be performed more
efficiently on a small subset of the index (allowing wild cards on efficiently on a small subset of the index (allowing wild cards on
the left of terms without excessive penality). Things to know: the left of terms without excessive cost). Things to know:
<itemizedlist> <itemizedlist>
<listitem><para>White space in the entry should match white <listitem><para>White space in the entry should match white
space in the file name, and is not treated specially.</para> space in the file name, and is not treated specially.</para>
@ -1743,11 +1707,6 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
</itemizedlist> </itemizedlist>
</para> </para>
<para>In all modes except <guilabel>File name</guilabel>, you can
search for exact phrases (adjacent words in a given order) by
enclosing the input inside double quotes. Ex:
<literal>"virtual reality"</literal>.</para>
<para>When using a stripped index (the default), character case has <para>When using a stripped index (the default), character case has
no influence on search, except that you can disable stem expansion no influence on search, except that you can disable stem expansion
for any term by capitalizing it. Ie: a search for for any term by capitalizing it. Ie: a search for
@ -3403,20 +3362,19 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
<command>recoll</command>). The query to be executed is specified <command>recoll</command>). The query to be executed is specified
as command line arguments.</para> as command line arguments.</para>
<para><command>recollq</command> is not built by default. You can <para><command>recollq</command> is not always built by default. You
use the <filename>Makefile</filename> in the can use the <filename>Makefile</filename> in the
<filename>query</filename> directory to build it. This is a very <filename>query</filename> directory to build it. This is a very
simple program, and if you can program a little c++, you may find it simple program, and if you can program a little c++, you may find it
useful to taylor its output format to your needs. Not that recollq is useful to taylor its output format to your needs. Apart from being
only really useful on systems where the Qt libraries (or even the X11 easily customised, <command>recollq</command> is only really useful
ones) are not available. Otherwise, just use on systems where the Qt libraries are not available, else it is
<literal>recoll -t</literal>, which takes the exact same redundant with <literal>recoll -t</literal>.</para>
parameters and options which
are described for <command>recollq</command></para>
<para><command>recollq</command> has a man page (not installed by <para><command>recollq</command> has a
default, look in the <filename>doc/man</filename> directory). The <ulink url="https://www.lesbonscomptes.com/recoll/manpages/recollq.1.html">man page</ulink>.
Usage string is as follows:</para>
The Usage string is as follows:</para>
<programlisting> <programlisting>
recollq: usage: recollq: usage:
-P: Show the date span for all the documents present in the index -P: Show the date span for all the documents present in the index
@ -3455,9 +3413,9 @@ recoll -c <replaceable>/path/to/my/new/config</replaceable></programlisting>
</programlisting> </programlisting>
<para>Sample execution:</para> <para>Sample execution:</para>
<programlisting>recollq 'ilur -nautique mime:text/html' <programlisting>
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11) recollq 'ilur -nautique mime:text/html'
OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html)) Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11) OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
4 results 4 results
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio... text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
@ -5835,9 +5793,8 @@ for i in range(nres):
<sect1 id="RCL.INSTALL.EXTERNAL"> <sect1 id="RCL.INSTALL.EXTERNAL">
<title>Supporting packages</title> <title>Supporting packages</title>
<note><para>The &WIN; installation of &RCL; is self-contained, and <note><para>The &WIN; installation of &RCL; is self-contained.
only needs Python 2.7 to be externally installed. &WIN; users can &WIN; users can skip this section.</para></note>
skip this section.</para></note>
<para>&RCL; uses external applications to index some file <para>&RCL; uses external applications to index some file
types. You need to install them for the file types that you wish to types. You need to install them for the file types that you wish to
@ -5851,134 +5808,46 @@ for i in range(nres):
<filename>missing</filename> text file inside the configuration <filename>missing</filename> text file inside the configuration
directory.</para> directory.</para>
<para>A list of common file types which need external <para>The past has proven that I was unable to maintain an up to date
commands follows. Many of the handlers need the application list in this manual. Please check &RCLAPPS; for a
<command>iconv</command> command, which is not always listed as a complete list along with links to the home pages or best
dependancy.</para> source/patches pages, and misc tips. What follows is only a
very short extract of the stable essentials.</para>
<para>Please note that, due to the relatively dynamic nature of this
information, the most up to date version is now kept on &RCLAPPS;
along with links to the home pages or best source/patches pages,
and misc tips. The list below is not updated often and may be quite
stale.</para>
<para>For many Linux distributions, most of the commands listed can
be installed from the package repositories. However, the packages
are sometimes outdated, or not the best version for &RCL;, so you
should take a look at &RCLAPPS; if a file
type is important to you.</para>
<para>As of &RCL; release 1.14, a number of XML-based formats that
were handled by ad hoc handler code now use the
<command>xsltproc</command> command, which usually comes with
<application>libxslt</application>. These are: abiword, fb2
(ebooks), kword, openoffice, svg.</para>
<para>Now for the list:</para>
<itemizedlist> <itemizedlist>
<listitem><para>Openoffice files need <command>unzip</command> and
<command>xsltproc</command>.</para></listitem>
<listitem><para>PDF files need <command>pdftotext</command> <listitem><para>PDF files need <command>pdftotext</command>
which is part of <application>Poppler</application> (usually which is part of <application>Poppler</application> (usually
comes with the <literal>poppler-utils</literal> comes with the <literal>poppler-utils</literal>
package). Avoid the original one from package). Avoid the original one from
<application>Xpdf</application>.</para></listitem> <application>Xpdf</application>.</para></listitem>
<listitem><para>Postscript files need <command>pstotext</command>. <listitem><para>MS Word documents need
The original version has an issue with shell
character in file names, which is corrected in recent
packages. See &RCLAPPS; for more detail.</para>
</listitem>
<listitem><para>MS Word needs
<command>antiword</command>. It is also useful to have <command>antiword</command>. It is also useful to have
<command>wvWare</command> installed as it may be <command>wvWare</command> installed as it may be
be used as a fallback for some files which be used as a fallback for some files which
<command>antiword</command> does not handle.</para></listitem> <command>antiword</command> does not handle.</para></listitem>
<listitem><para>MS Excel and PowerPoint are processed by
internal <command>Python</command> handlers.</para></listitem>
<listitem><para>MS Open XML (docx) needs <command>
xsltproc</command>.</para></listitem>
<listitem><para>Wordperfect files need <command>wpd2html</command>
from the <application>libwpd</application> (or
<application>libwpd-tools</application> on Ubuntu)
package.</para></listitem>
<listitem><para>RTF files need <command>unrtf</command>, <listitem><para>RTF files need <command>unrtf</command>,
which, in its older versions, has much trouble with which, in its older versions, has much trouble with
non-western character sets. Many Linux distributions carry non-western character sets. Many Linux distributions carry
outdated <command>unrtf</command> versions. Check outdated <command>unrtf</command> versions. Check
&RCLAPPS; for details.</para></listitem> &RCLAPPS; for details.</para></listitem>
<listitem><para>TeX files need <command>untex</command> or
<command>detex</command>. Check &RCLAPPS; for sources if it's not
packaged for your distribution.</para></listitem>
<listitem><para>dvi files need <command>dvips</command>.</para>
</listitem>
<listitem><para>djvu files need <command>djvutxt</command> and
<command>djvused</command> from the
<application>DjVuLibre</application> package.</para></listitem>
<listitem><para>Audio files: &RCL; releases 1.14 and later use
a single <application>Python</application> handler based
on <application>mutagen</application> for all audio file
types.</para>
</listitem>
<listitem><para>Pictures: &RCL; uses the <listitem><para>Pictures: &RCL; uses the
<application>Exiftool</application> <application>Exiftool</application>
<application>Perl</application> package to extract tag <application>Perl</application> package to extract tag
information. Most image file formats are supported. Note that information. Most image file formats are
there may not be much interest in indexing the technical tags supported.</para></listitem>
(image size, aperture, etc.). This is only of interest if you
store personal tags or textual descriptions inside the image
files.</para></listitem>
<listitem><para>chm: files in Microsoft help format need Python and <listitem><para>Up to &RCL; 1.24, many XML-based formats need the
the <application>pychm</application> module (which needs <command>xsltproc</command> command, which usually comes with
<application>chmlib</application>).</para></listitem> <application>libxslt</application>. These are: abiword, fb2
ebooks, kword, openoffice, opendocument svg. &RCL; 1.25 and later
<listitem><para>ICS: up to &RCL; 1.13, iCalendar files need process them internally (using libxslt).</para>
<application>Python</application>
and the <application>icalendar</application>
module. <application>icalendar</application> is not needed for newer
versions, which use internal code.</para></listitem>
<listitem><para>Zip archives need <application>Python</application>
(and the standard zipfile module). </para></listitem>
<listitem><para>Rar archives need
<application>Python</application>, the
<application>rarfile</application> Python module and the
<command>unrar</command> utility.</para></listitem>
<listitem><para>Midi karaoke files need
<application>Python</application> and the
<ulink url="http://pypi.python.org/pypi/midi/0.2.1">
<application>Midi module</application></ulink></para>
</listitem> </listitem>
<listitem><para>Konqueror webarchive format with Python (uses the
Tarfile module).</para></listitem>
<listitem><para>Mimehtml web archive format (support based on
the email handler, which introduces some mild weirdness, but
still usable).</para></listitem>
</itemizedlist> </itemizedlist>
<para>Text, HTML, email folders, and Scribus files are
processed internally. <application>Lyx</application> is used to
index Lyx files. Many handlers need <command>iconv</command> and the
standard <command>sed</command> and <command>awk</command>.
</para>
</sect1> </sect1>
@ -6089,9 +5958,10 @@ for i in range(nres):
terms. </para></listitem> terms. </para></listitem>
<listitem><para><option>--with-fam</option> or <listitem><para><option>--with-fam</option> or
<option>--with-inotify</option> will enable the code for <option>--with-inotify</option> will enable the code for real
real time indexing. Inotify support is enabled by default on time indexing. Inotify support is enabled by default on Linux
recent Linux systems.</para></listitem> systems.</para></listitem>
<listitem><para><option>--with-qzeitgeist</option> will <listitem><para><option>--with-qzeitgeist</option> will
enable sending <application>Zeitgeist</application> enable sending <application>Zeitgeist</application>

View File

@ -216,9 +216,9 @@ usesystemfilecommand = 1
# <var name="systemfilecommand" type="string"><brief>Command used to guess # <var name="systemfilecommand" type="string"><brief>Command used to guess
# MIME types if the internal methods fails</brief><descr>This should be a # MIME types if the internal methods fails</brief><descr>This should be a
# "file -i" workalike. The file path will be added as a last parameter to # "file -i" workalike. The file path will be added as a last parameter to
# the command line. 'xdg-mime' works better than the traditional 'file' # the command line. "xdg-mime" works better than the traditional "file"
# command, and is now the configured default (with a hard-coded fallback to # command, and is now the configured default (with a hard-coded fallback to
# 'file')</descr></var> # "file")</descr></var>
systemfilecommand = xdg-mime query filetype systemfilecommand = xdg-mime query filetype
# <var name="processwebqueue" type="bool"><brief>Decide if we process the # <var name="processwebqueue" type="bool"><brief>Decide if we process the
@ -885,7 +885,7 @@ snippetMaxPosWalk = 1000000
# include a translation to a Recoll field name, separated by a '|' # include a translation to a Recoll field name, separated by a '|'
# character. If the second element is absent, the tag name is used as the # character. If the second element is absent, the tag name is used as the
# Recoll field names. You will also need to add specifications to the # Recoll field names. You will also need to add specifications to the
# 'fields' file to direct processing of the extracted data.</descr></var> # "fields" file to direct processing of the extracted data.</descr></var>
#pdfextrameta = bibtex:location|location bibtex:booktitle bibtex:pages #pdfextrameta = bibtex:location|location bibtex:booktitle bibtex:pages
# <var name="pdfextrametafix" type="fn"> # <var name="pdfextrametafix" type="fn">