This commit is contained in:
Jean-Francois Dockes 2018-11-16 17:29:55 +01:00
parent 55e2fe5d27
commit 6f1be83251
2 changed files with 336 additions and 310 deletions

View File

@ -35,7 +35,7 @@ alink="#0000FF">
</div> </div>
</div> </div>
<div> <div>
<p class="copyright">Copyright © 2005-2015 Jean-Francois <p class="copyright">Copyright © 2005-2018 Jean-Francois
Dockes</p> Dockes</p>
</div> </div>
<div> <div>
@ -92,11 +92,11 @@ alink="#0000FF">
"#RCL.INDEXING.INTRODUCTION.CONFIG">Configurations, "#RCL.INDEXING.INTRODUCTION.CONFIG">Configurations,
multiple indexes</a></span></dt> multiple indexes</a></span></dt>
<dt><span class="sect2">2.1.3. <a href= <dt><span class="sect2">2.1.3. <a href=
"#idm223">Document types</a></span></dt> "#idm224">Document types</a></span></dt>
<dt><span class="sect2">2.1.4. <a href= <dt><span class="sect2">2.1.4. <a href=
"#idm264">Indexing failures</a></span></dt> "#idm265">Indexing failures</a></span></dt>
<dt><span class="sect2">2.1.5. <a href= <dt><span class="sect2">2.1.5. <a href=
"#idm276">Recovery</a></span></dt> "#idm277">Recovery</a></span></dt>
</dl> </dl>
</dd> </dd>
<dt><span class="sect1">2.2. <a href= <dt><span class="sect1">2.2. <a href=
@ -176,9 +176,11 @@ alink="#0000FF">
<dd> <dd>
<dl> <dl>
<dt><span class="sect2">2.9.1. <a href= <dt><span class="sect2">2.9.1. <a href=
"#RCL.INDEXING.MONITOR.FASTFILES">Slowing down the "#RCL.INDEXING.MONITOR.START">Real time indexing:
reindexing rate for fast changing automatic daemon start</a></span></dt>
files</a></span></dt> <dt><span class="sect2">2.9.2. <a href=
"#RCL.INDEXING.MONITOR.DETAILS">Real time indexing:
miscellaneous details</a></span></dt>
</dl> </dl>
</dd> </dd>
</dl> </dl>
@ -481,9 +483,8 @@ alink="#0000FF">
"guimenuitem">Indexing configuration</span>, then adjust "guimenuitem">Indexing configuration</span>, then adjust
the <span class="guilabel">Top directories</span> the <span class="guilabel">Top directories</span>
section).</p> section).</p>
<p>Also be aware that, on Unix/Linux, you may need to <p>On Unix/Linux, you may need to install the appropriate
install the appropriate <a class="link" href= <a class="link" href="#RCL.INSTALL.EXTERNAL" title=
"#RCL.INSTALL.EXTERNAL" title=
"6.2.&nbsp;Supporting packages">supporting applications</a> "6.2.&nbsp;Supporting packages">supporting applications</a>
for document types that need them (for example <span class= for document types that need them (for example <span class=
"application">antiword</span> for <span class= "application">antiword</span> for <span class=
@ -594,9 +595,10 @@ alink="#0000FF">
"application">Recoll</span> can only display documents that "application">Recoll</span> can only display documents that
still exist at the place from which they were indexed. still exist at the place from which they were indexed.
(Actually, there is a way to reconstruct a document from (Actually, there is a way to reconstruct a document from
the information in the index, but the result is not nice, the information in the index, but only the pure text is
as all formatting, punctuation and capitalization are saved, possibly without punctuation and capitalization,
lost).</p> depending on <span class="application">Recoll</span>
version).</p>
<p><span class="application">Recoll</span> stores all <p><span class="application">Recoll</span> stores all
internal data in <span class="application">Unicode internal data in <span class="application">Unicode
UTF-8</span> format, and it can index files of many types UTF-8</span> format, and it can index files of many types
@ -796,11 +798,10 @@ alink="#0000FF">
<li class="listitem"> <li class="listitem">
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC" <p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
title="2.8.&nbsp;Periodic indexing">Periodic (or title="2.8.&nbsp;Periodic indexing">Periodic (or
batch) indexing:</a>&nbsp;</b>indexing takes place batch) indexing:</a>&nbsp;</b><span class=
at discrete times, by executing the <span class= "command"><strong>recollindex</strong></span> is
"command"><strong>recollindex</strong></span> executed at discrete times. The typical usage is to
command. The typical usage is to have a nightly have a nightly run <a class="link" href=
indexing run <a class="link" href=
"#RCL.INDEXING.PERIODIC.AUTOMAT" title= "#RCL.INDEXING.PERIODIC.AUTOMAT" title=
"2.8.2.&nbsp;Using cron to automate indexing">programmed</a> "2.8.2.&nbsp;Using cron to automate indexing">programmed</a>
into your <span class= into your <span class=
@ -809,13 +810,13 @@ alink="#0000FF">
<li class="listitem"> <li class="listitem">
<p><b><a class="link" href="#RCL.INDEXING.MONITOR" <p><b><a class="link" href="#RCL.INDEXING.MONITOR"
title="2.9.&nbsp;Real time indexing">Real time title="2.9.&nbsp;Real time indexing">Real time
indexing:</a>&nbsp;</b>indexing takes place as soon indexing:</a>&nbsp;</b><span class=
as a file is created or changed. <span class=
"command"><strong>recollindex</strong></span> runs "command"><strong>recollindex</strong></span> runs
as a daemon and uses a file system alteration permanently as a daemon and uses a file system
monitor (e.g. <span class= alteration monitor (e.g. <span class=
"application">inotify</span>) to detect file "application">inotify</span>) to detect file
changes.</p> changes. New or updated files are indexed at
once.</p>
</li> </li>
</ul> </ul>
</div> </div>
@ -825,7 +826,7 @@ alink="#0000FF">
documentation directory, and real time indexing on a documentation directory, and real time indexing on a
small home directory). Monitoring a big file system tree small home directory). Monitoring a big file system tree
can consume significant system resources.</p> can consume significant system resources.</p>
<p>With <span class="application">Recoll</span> 1.25 and <p>With <span class="application">Recoll</span> 1.24 and
newer, it is also possible to set up an index so that newer, it is also possible to set up an index so that
only a subset of the tree will be monitored and the rest only a subset of the tree will be monitored and the rest
will be covered by batch/incremental indexing. (See the will be covered by batch/incremental indexing. (See the
@ -838,9 +839,9 @@ alink="#0000FF">
"command"><strong>recoll</strong></span> GUI: "command"><strong>recoll</strong></span> GUI:
<span class="guimenu">Preferences</span><span class= <span class="guimenu">Preferences</span><span class=
"guimenuitem">Indexing schedule</span></p> "guimenuitem">Indexing schedule</span></p>
<p>The <span class="guimenu">File</span> menu also has <p>The GUI <span class="guimenu">File</span> menu also
entries to start or stop the current indexing operation. has entries to start or stop the current indexing
Stopping indexing is performed by killing the operation. Stopping indexing is performed by killing the
<span class="command"><strong>recollindex</strong></span> <span class="command"><strong>recollindex</strong></span>
process, which will checkpoint its state and exit. A process, which will checkpoint its state and exit. A
later restart of indexing will mostly resume from where later restart of indexing will mostly resume from where
@ -900,7 +901,7 @@ alink="#0000FF">
<p>When generating indexes, the different configurations <p>When generating indexes, the different configurations
are entirely independant (no parameters are ever shared are entirely independant (no parameters are ever shared
between configurations when indexing).</p> between configurations when indexing).</p>
<p>Multiple indexes can queryied concurrently, either <p>Multiple indexes can be queryied concurrently, either
from the GUI or the command line. When doing this, there from the GUI or the command line. When doing this, there
is always a main configuration, from which both is always a main configuration, from which both
configuration and index data are used. Only the index configuration and index data are used. Only the index
@ -923,8 +924,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idm223" id= <h3 class="title"><a name="idm224" id=
"idm223"></a>2.1.3.&nbsp;Document types</h3> "idm224"></a>2.1.3.&nbsp;Document types</h3>
</div> </div>
</div> </div>
</div> </div>
@ -943,10 +944,10 @@ alink="#0000FF">
<span class="application">LibreOffice</span> document <span class="application">LibreOffice</span> document
stored as an attachment to an email message inside an stored as an attachment to an email message inside an
email folder archived in a zip file...</p> email folder archived in a zip file...</p>
<p><span class="application">Recoll</span> indexing <p><span class=
processes plain text, HTML, OpenDocument "command"><strong>recollindex</strong></span> processes
(Open/LibreOffice), email formats, and a few others plain text, HTML, OpenDocument (Open/LibreOffice), email
internally.</p> formats, and a few others internally.</p>
<p>Other file types (ie: postscript, pdf, ms-word, rtf <p>Other file types (ie: postscript, pdf, ms-word, rtf
...) need external applications for preprocessing. The ...) need external applications for preprocessing. The
list is in the <a class="link" href= list is in the <a class="link" href=
@ -967,15 +968,15 @@ alink="#0000FF">
to either exclude some types, or on the contrary define a to either exclude some types, or on the contrary define a
positive list of types to be indexed. In the latter case, positive list of types to be indexed. In the latter case,
any type not in the list will be ignored.</p> any type not in the list will be ignored.</p>
<p>Excluding file types can be done by adding wildcard <p>Excluding files by name can be done by adding wildcard
name patterns to the <a class="link" href= name patterns to the <a class="link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES">skippedNames</a> "#RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES">skippedNames</a>
list, which can be done from the GUI Index configuration list, which can be done from the GUI Index configuration
menu. For versions 1.20 and later, you can alternatively menu. Excluding by type can be done by setting the
set the <a class="link" href= <a class="link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES">excludedmimetypes</a> "#RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES">excludedmimetypes</a>
list in the configuration file. This can be redefined for list in the configuration file (1.20 and later). This can
subdirectories.</p> be redefined for subdirectories.</p>
<p>You can also define an exclusive list of MIME types to <p>You can also define an exclusive list of MIME types to
be indexed (no others will be indexed), by settting the be indexed (no others will be indexed), by settting the
<a class="link" href= <a class="link" href=
@ -1021,8 +1022,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idm264" id= <h3 class="title"><a name="idm265" id=
"idm264"></a>2.1.4.&nbsp;Indexing failures</h3> "idm265"></a>2.1.4.&nbsp;Indexing failures</h3>
</div> </div>
</div> </div>
</div> </div>
@ -1039,7 +1040,7 @@ alink="#0000FF">
may be quite costly (for example failing to uncompress a may be quite costly (for example failing to uncompress a
big file because of insufficient disk space).</p> big file because of insufficient disk space).</p>
<p>The indexer in <span class="application">Recoll</span> <p>The indexer in <span class="application">Recoll</span>
versions 1.21 and later does not retry failed file by versions 1.21 and later does not retry failed files by
default. Retrying will only occur if an explicit option default. Retrying will only occur if an explicit option
(<code class="option">-k</code>) is set on the (<code class="option">-k</code>) is set on the
<span class="command"><strong>recollindex</strong></span> <span class="command"><strong>recollindex</strong></span>
@ -1057,8 +1058,8 @@ alink="#0000FF">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name="idm276" id= <h3 class="title"><a name="idm277" id=
"idm276"></a>2.1.5.&nbsp;Recovery</h3> "idm277"></a>2.1.5.&nbsp;Recovery</h3>
</div> </div>
</div> </div>
</div> </div>
@ -1153,9 +1154,9 @@ alink="#0000FF">
non-indexed data (an extreme example being a set of mp3 non-indexed data (an extreme example being a set of mp3
files where only the tags would be indexed).</p> files where only the tags would be indexed).</p>
<p>Of course, images, sound and video do not increase the <p>Of course, images, sound and video do not increase the
index size, which means that nowadays, typically, even a index size, which means that typically, even a big index
big index will be negligible against the total amount of will be negligible against the total amount of data on the
data on the computer.</p> computer.</p>
<p>The index data directory (<code class= <p>The index data directory (<code class=
"filename">xapiandb</code>) only contains data that can be "filename">xapiandb</code>) only contains data that can be
completely rebuilt by an index run (as long as the original completely rebuilt by an index run (as long as the original
@ -1200,10 +1201,11 @@ alink="#0000FF">
</div> </div>
</div> </div>
<p>The <span class="application">Recoll</span> index does <p>The <span class="application">Recoll</span> index does
not hold copies of the indexed documents. But it does not hold complete copies of the indexed documents (it
hold enough data to allow for an almost complete almost does after version 1.24). But it does hold enough
reconstruction. If confidential data is indexed, access data to allow for an almost complete reconstruction. If
to the database directory should be restricted.</p> confidential data is indexed, access to the database
directory should be restricted.</p>
<p><span class="application">Recoll</span> will create <p><span class="application">Recoll</span> will create
the configuration directory with a mode of 0700 (access the configuration directory with a mode of 0700 (access
by owner only). As the index data directory is by default by owner only). As the index data directory is by default
@ -1256,8 +1258,7 @@ alink="#0000FF">
"refentrytitle">recoll.conf</span>(5)</span> man page, but "refentrytitle">recoll.conf</span>(5)</span> man page, but
the most current information will most likely be the the most current information will most likely be the
comments inside the sample file. The most immediately comments inside the sample file. The most immediately
useful variable you may interested in is probably <a class= useful variable is probably <a class="link" href=
"link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><code class= "#RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><code class=
"varname">topdirs</code></a>, which determines what "varname">topdirs</code></a>, which determines what
subtrees and files get indexed.</p> subtrees and files get indexed.</p>
@ -1271,9 +1272,8 @@ alink="#0000FF">
Recoll indexes, depending on the treatment of character Recoll indexes, depending on the treatment of character
case and diacritics. A <a class="link" href= case and diacritics. A <a class="link" href=
"#RCL.INDEXING.CONFIG.SENS" title= "#RCL.INDEXING.CONFIG.SENS" title=
"2.3.2.&nbsp;Index case and diacritics sensitivity">a "2.3.2.&nbsp;Index case and diacritics sensitivity">further
further section</a> describes the two types in more section</a> describes the two types in more detail.</p>
detail.</p>
<div class="sect2"> <div class="sect2">
<div class="titlepage"> <div class="titlepage">
<div> <div>
@ -1317,7 +1317,7 @@ alink="#0000FF">
where narrowing the search can improve the results. You where narrowing the search can improve the results. You
can achieve approximately the same effect with the can achieve approximately the same effect with the
directory filter in advanced search, but multiple indexes directory filter in advanced search, but multiple indexes
will have much better performance and may be worth the will have better performance and may be worth the
trouble.</p> trouble.</p>
<p>A <span class= <p>A <span class=
"command"><strong>recollindex</strong></span> program "command"><strong>recollindex</strong></span> program
@ -1325,7 +1325,7 @@ alink="#0000FF">
only use parameters from a single configuration (no only use parameters from a single configuration (no
parameters are ever shared between configurations when parameters are ever shared between configurations when
indexing).</p> indexing).</p>
<p>Multiple indexes can queryied concurrently, either <p>Multiple indexes can be queryied concurrently, either
from the GUI or the command line. When doing this, there from the GUI or the command line. When doing this, there
is always a main configuration, from which both is always a main configuration, from which both
configuration and index data are used. Only the index configuration and index data are used. Only the index
@ -2082,68 +2082,6 @@ alink="#0000FF">
"command"><strong>recollindex</strong></span> will detach "command"><strong>recollindex</strong></span> will detach
from the terminal and become a daemon, permanently from the terminal and become a daemon, permanently
monitoring file changes and updating the index.</p> monitoring file changes and updating the index.</p>
<p>Under <span class="application">KDE</span>, <span class=
"application">Gnome</span> and some other desktop
environments, the daemon can automatically started when you
log in, by creating a desktop file inside the <code class=
"filename">~/.config/autostart</code> directory. This can
be done for you by the <span class=
"application">Recoll</span> GUI. Use the <span class=
"guimenu">Preferences-&gt;Indexing Schedule</span>
menu.</p>
<p>With older <span class="application">X11</span> setups,
starting the daemon is normally performed as part of the
user session script.</p>
<p>The <code class="filename">rclmon.sh</code> script can
be used to easily start and stop the daemon. It can be
found in the <code class="filename">examples</code>
directory (typically <code class=
"filename">/usr/local/[share/]recoll/examples</code>).</p>
<p>For example, my out of fashion <span class=
"application">xdm</span>-based session has a <code class=
"filename">.xsession</code> script with the following lines
at the end:</p>
<pre class="programlisting">recollconf=$HOME/.recoll-home
recolldata=/usr/local/share/recoll
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
fvwm
</pre>
<p>The indexing daemon gets started, then the window
manager, for which the session waits.</p>
<p>By default the indexing daemon will monitor the state of
the X11 session, and exit when it finishes, it is not
necessary to kill it explicitly. (The <span class=
"application">X11</span> server monitoring can be disabled
with option <code class="option">-x</code> to <span class=
"command"><strong>recollindex</strong></span>).</p>
<p>If you use the daemon completely out of an <span class=
"application">X11</span> session, you need to add option
<code class="option">-x</code> to disable <span class=
"application">X11</span> session monitoring (else the
daemon will not start).</p>
<p>By default, the messages from the indexing daemon will
be sent to the same file as those from the interactive
commands (<code class="literal">logfilename</code>). You
may want to change this by setting the <code class=
"varname">daemlogfilename</code> and <code class=
"varname">daemloglevel</code> configuration parameters.
Also the log file will only be truncated when the daemon
starts. If the daemon runs permanently, the log file may
grow quite big, depending on the log level.</p>
<p>When building <span class="application">Recoll</span>,
the real time indexing support can be customised during
package <a class="link" href="#RCL.INSTALL.BUILDING" title=
"6.3.&nbsp;Building from source">configuration</a> with the
<code class="option">--with[out]-fam</code> or <code class=
"option">--with[out]-inotify</code> options. The default is
currently to include <span class=
"application">inotify</span> monitoring on systems that
support it, and, as of <span class=
"application">Recoll</span> 1.17, <span class=
"application">gamin</span> support on <span class=
"application">FreeBSD</span>.</p>
<p>While it is convenient that data is indexed in real <p>While it is convenient that data is indexed in real
time, repeated indexing can generate a significant load on time, repeated indexing can generate a significant load on
the system when files such as email folders change. Also, the system when files such as email folders change. Also,
@ -2151,68 +2089,149 @@ alink="#0000FF">
system resources. You probably do not want to enable it if system resources. You probably do not want to enable it if
your system is short on resources. Periodic indexing is your system is short on resources. Periodic indexing is
adequate in most cases.</p> adequate in most cases.</p>
<p>As of <span class="application">Recoll</span> 1.25, you <p>As of <span class="application">Recoll</span> 1.24, you
can set the <a class="link" href= can set the <a class="link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS">monitordirs</a> "#RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS">monitordirs</a>
configuration variable to specify that only a subset of configuration variable to specify that only a subset of
your indexed files will be monitored for instant indexing. your indexed files will be monitored for instant indexing.
In this situation, an incremental pass on the full tree can In this situation, an incremental pass on the full tree can
be triggered by either restarting the indexer, or just be triggered by either restarting the indexer, or just
running the <span class= running <span class=
"command"><strong>recollindex</strong></span>, which will "command"><strong>recollindex</strong></span>, which will
notify the running process. The <span class= notify the running process. The <span class=
"command"><strong>recoll</strong></span> GUI also has a "command"><strong>recoll</strong></span> GUI also has a
menu entry for this.</p> menu entry for this.</p>
<div class="note" style= <div class="sect2">
"margin-left: 0.5in; margin-right: 0.5in;"> <div class="titlepage">
<h3 class="title">Increasing resources for inotify</h3> <div>
<p>On Linux systems, monitoring a big tree may need <div>
increasing the resources available to inotify, which are <h3 class="title"><a name=
normally defined in <code class= "RCL.INDEXING.MONITOR.START" id=
"filename">/etc/sysctl.conf</code>.</p> "RCL.INDEXING.MONITOR.START"></a>2.9.1.&nbsp;Real
<pre class="programlisting"> time indexing: automatic daemon start</h3>
### inotify </div>
# </div>
# cat /proc/sys/fs/inotify/max_queued_events - 16384 </div>
# cat /proc/sys/fs/inotify/max_user_instances - 128 <p>Under <span class="application">KDE</span>,
# cat /proc/sys/fs/inotify/max_user_watches - 16384 <span class="application">Gnome</span> and some other
# desktop environments, the daemon can automatically
# -- Change to: started when you log in, by creating a desktop file
# inside the <code class=
fs.inotify.max_queued_events=32768 "filename">~/.config/autostart</code> directory. This can
fs.inotify.max_user_instances=256 be done for you by the <span class=
fs.inotify.max_user_watches=32768 "application">Recoll</span> GUI. Use the <span class=
</pre> "guimenu">Preferences-&gt;Indexing Schedule</span>
<p>Especially, you will need to trim your tree or adjust menu.</p>
the <code class="literal">max_user_watches</code> value <p>With older <span class="application">X11</span>
if indexing exits with a message about errno <code class= setups, starting the daemon is normally performed as part
"literal">ENOSPC</code> (28) from <code class= of the user session script.</p>
"function">inotify_add_watch</code>.</p> <p>The <code class="filename">rclmon.sh</code> script can
be used to easily start and stop the daemon. It can be
found in the <code class="filename">examples</code>
directory (typically <code class=
"filename">/usr/local/[share/]recoll/examples</code>).</p>
<p>For example, my out of fashion <span class=
"application">xdm</span>-based session has a <code class=
"filename">.xsession</code> script with the following
lines at the end:</p>
<pre class="programlisting">recollconf=$HOME/.recoll-home
recolldata=/usr/local/share/recoll
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
fvwm
</pre>
<p>The indexing daemon gets started, then the window
manager, for which the session waits.</p>
<p>By default the indexing daemon will monitor the state
of the X11 session, and exit when it finishes, it is not
necessary to kill it explicitly. (The <span class=
"application">X11</span> server monitoring can be
disabled with option <code class="option">-x</code> to
<span class=
"command"><strong>recollindex</strong></span>).</p>
<p>If you use the daemon completely out of an
<span class="application">X11</span> session, you need to
add option <code class="option">-x</code> to disable
<span class="application">X11</span> session monitoring
(else the daemon will not start).</p>
</div> </div>
<div class="sect2"> <div class="sect2">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h3 class="title"><a name= <h3 class="title"><a name=
"RCL.INDEXING.MONITOR.FASTFILES" id= "RCL.INDEXING.MONITOR.DETAILS" id=
"RCL.INDEXING.MONITOR.FASTFILES"></a>2.9.1.&nbsp;Slowing "RCL.INDEXING.MONITOR.DETAILS"></a>2.9.2.&nbsp;Real
down the reindexing rate for fast changing time indexing: miscellaneous details</h3>
files</h3>
</div> </div>
</div> </div>
</div> </div>
<p>When using the real time monitor, it may happen that <p>By default, the messages from the indexing daemon will
some files need to be indexed, but change so often that be sent to the same file as those from the interactive
they impose an excessive load for the system.</p> commands (<code class="literal">logfilename</code>). You
<p><span class="application">Recoll</span> provides a may want to change this by setting the <code class=
configuration option to specify the minimum time before "varname">daemlogfilename</code> and <code class=
which a file, specified by a wildcard pattern, cannot be "varname">daemloglevel</code> configuration parameters.
reindexed. See the <code class= Also the log file will only be truncated when the daemon
"varname">mondelaypatterns</code> parameter in the starts. If the daemon runs permanently, the log file may
<a class="link" href= grow quite big, depending on the log level.</p>
"#RCL.INSTALL.CONFIG.RECOLLCONF.MISC" title= <p>When building <span class="application">Recoll</span>,
"6.4.2.5.&nbsp;Miscellaneous parameters">configuration the real time indexing support can be customised during
section</a>.</p> package <a class="link" href="#RCL.INSTALL.BUILDING"
title="6.3.&nbsp;Building from source">configuration</a>
with the <code class="option">--with[out]-fam</code> or
<code class="option">--with[out]-inotify</code> options.
The default is currently to include <span class=
"application">inotify</span> monitoring on systems that
support it, and, as of <span class=
"application">Recoll</span> 1.17, <span class=
"application">gamin</span> support on <span class=
"application">FreeBSD</span>.</p>
<div class="note" style=
"margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Increasing resources for inotify</h3>
<p>On Linux systems, monitoring a big tree may need
increasing the resources available to inotify, which
are normally defined in <code class=
"filename">/etc/sysctl.conf</code>.</p>
<pre class="programlisting">
### inotify
#
# cat /proc/sys/fs/inotify/max_queued_events - 16384
# cat /proc/sys/fs/inotify/max_user_instances - 128
# cat /proc/sys/fs/inotify/max_user_watches - 16384
#
# -- Change to:
#
fs.inotify.max_queued_events=32768
fs.inotify.max_user_instances=256
fs.inotify.max_user_watches=32768
</pre>
<p>Especially, you will need to trim your tree or
adjust the <code class=
"literal">max_user_watches</code> value if indexing
exits with a message about errno <code class=
"literal">ENOSPC</code> (28) from <code class=
"function">inotify_add_watch</code>.</p>
</div>
<div class="note" style=
"margin-left: 0.5in; margin-right: 0.5in;">
<h3 class="title">Slowing down the reindexing rate for
fast changing files</h3>
<p>When using the real time monitor, it may happen that
some files need to be indexed, but change so often that
they impose an excessive load for the system.</p>
<p><span class="application">Recoll</span> provides a
configuration option to specify the minimum time before
which a file, specified by a wildcard pattern, cannot
be reindexed. See the <code class=
"varname">mondelaypatterns</code> parameter in the
<a class="link" href=
"#RCL.INSTALL.CONFIG.RECOLLCONF.MISC" title=
"6.4.2.5.&nbsp;Miscellaneous parameters">configuration
section</a>.</p>
</div>
</div> </div>
</div> </div>
</div> </div>

View File

@ -25,7 +25,7 @@
</author> </author>
<copyright> <copyright>
<year>2005-2015</year> <year>2005-2018</year>
<holder role="mailto:jfd@recoll.org">Jean-Francois Dockes</holder> <holder role="mailto:jfd@recoll.org">Jean-Francois Dockes</holder>
</copyright> </copyright>
@ -89,7 +89,7 @@
</menuchoice>, then adjust the <guilabel>Top </menuchoice>, then adjust the <guilabel>Top
directories</guilabel> section).</para> directories</guilabel> section).</para>
<para>Also be aware that, on Unix/Linux, you may need to install the <para>On Unix/Linux, you may need to install the
appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting appropriate <link linkend="RCL.INSTALL.EXTERNAL"> supporting
applications</link> for document types that need them (for applications</link> for document types that need them (for
example <application>antiword</application> for example <application>antiword</application> for
@ -175,13 +175,13 @@
<para>In a shorter way, &RCL; does the dirty footwork, &XAP; <para>In a shorter way, &RCL; does the dirty footwork, &XAP;
deals with the intelligent parts of the process.</para> deals with the intelligent parts of the process.</para>
<para>The &XAP; index can be big (roughly the size of the <para>The &XAP; index can be big (roughly the size of the original
original document set), but it is not a document document set), but it is not a document archive. &RCL; can only
archive. &RCL; can only display documents that still exist at display documents that still exist at the place from which they were
the place from which they were indexed. (Actually, there is a indexed. (Actually, there is a way to reconstruct a document from the
way to reconstruct a document from the information in the information in the index, but only the pure text is saved, possibly
index, but the result is not nice, as all formatting, without punctuation and capitalization, depending on &RCL;
punctuation and capitalization are lost).</para> version).</para>
<para>&RCL; stores all internal data in <application>Unicode <para>&RCL; stores all internal data in <application>Unicode
UTF-8</application> format, and it can index files of many types UTF-8</application> format, and it can index files of many types
@ -332,9 +332,8 @@
<formalpara> <formalpara>
<title><link linkend="RCL.INDEXING.PERIODIC"> <title><link linkend="RCL.INDEXING.PERIODIC">
Periodic (or batch) indexing:</link></title> Periodic (or batch) indexing:</link></title>
<para>indexing takes place at discrete <para><command>recollindex</command> is executed
times, by executing the <command>recollindex</command> at discrete times. The typical usage is to have a nightly run
command. The typical usage is to have a nightly indexing run
<link linkend="RCL.INDEXING.PERIODIC.AUTOMAT"> <link linkend="RCL.INDEXING.PERIODIC.AUTOMAT">
programmed</link> into programmed</link> into
your <command>cron</command> file.</para> your <command>cron</command> file.</para>
@ -342,12 +341,12 @@
</listitem> </listitem>
<listitem> <listitem>
<formalpara><title><link linkend="RCL.INDEXING.MONITOR">Real <formalpara><title><link linkend="RCL.INDEXING.MONITOR">Real
time indexing:</link></title> <para>indexing takes place as time indexing:</link></title>
soon as a file is created or <para><command>recollindex</command> runs permanently as a
changed. <command>recollindex</command> runs as a daemon and daemon and uses a file system alteration monitor
uses a file system alteration monitor
(e.g. <application>inotify</application>) to detect file (e.g. <application>inotify</application>) to detect file
changes.</para> </formalpara> changes. New or updated files are indexed at once.</para>
</formalpara>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</para> </para>
@ -359,7 +358,7 @@
directory). Monitoring a big file system tree can consume directory). Monitoring a big file system tree can consume
significant system resources.</para> significant system resources.</para>
<para>With &RCL; 1.25 and newer, it is also possible to set up an <para>With &RCL; 1.24 and newer, it is also possible to set up an
index so that only a subset of the tree will be monitored and the index so that only a subset of the tree will be monitored and the
rest will be covered by batch/incremental indexing. (See the rest will be covered by batch/incremental indexing. (See the
details in the <link linkend="RCL.INDEXING.MONITOR">Real time details in the <link linkend="RCL.INDEXING.MONITOR">Real time
@ -373,7 +372,7 @@
</menuchoice> </menuchoice>
</para> </para>
<para>The <menuchoice><guimenu>File</guimenu> <para>The GUI <menuchoice><guimenu>File</guimenu>
</menuchoice> menu also has entries to start or stop </menuchoice> menu also has entries to start or stop
the current indexing operation. Stopping indexing is performed by the current indexing operation. Stopping indexing is performed by
killing the <command>recollindex</command> process, which will killing the <command>recollindex</command> process, which will
@ -430,10 +429,10 @@
entirely independant (no parameters are ever shared between entirely independant (no parameters are ever shared between
configurations when indexing).</para> configurations when indexing).</para>
<para>Multiple indexes can queryied concurrently, either from the <para>Multiple indexes can be queryied concurrently, either from
GUI or the command line. When doing this, there is always a main the GUI or the command line. When doing this, there is always a
configuration, from which both configuration and index data are main configuration, from which both configuration and index data
used. Only the index data from the additional indexes is used are used. Only the index data from the additional indexes is used
(their configuration parameters are ignored).</para> (their configuration parameters are ignored).</para>
<para>This is important and sometimes confusing, so it will be <para>This is important and sometimes confusing, so it will be
@ -464,8 +463,9 @@
document stored as an attachment to an email message inside an document stored as an attachment to an email message inside an
email folder archived in a zip file...</para> email folder archived in a zip file...</para>
<para>&RCL; indexing processes plain text, HTML, OpenDocument <para><command>recollindex</command> processes plain text, HTML,
(Open/LibreOffice), email formats, and a few others internally.</para> OpenDocument (Open/LibreOffice), email formats, and a few others
internally.</para>
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...) <para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
need external applications for preprocessing. The list is in the need external applications for preprocessing. The list is in the
@ -488,14 +488,15 @@
indexed. In the latter case, any type not in the list will indexed. In the latter case, any type not in the list will
be ignored.</para> be ignored.</para>
<para>Excluding file types can be done by adding wildcard name <para>Excluding files by name can be done by adding wildcard name
patterns to the patterns to the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES"> <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES">
skippedNames</link> list, which skippedNames</link> list, which
can be done from the GUI Index configuration menu. For can be done from the GUI Index configuration menu. Excluding by
versions 1.20 and later, you can alternatively set the type can be done by setting the
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES"> <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES">
excludedmimetypes</link> list in the configuration file. This excludedmimetypes</link> list in the configuration file (1.20
and later). This
can be redefined for subdirectories.</para> can be redefined for subdirectories.</para>
<para>You can also define an exclusive list of MIME types to be <para>You can also define an exclusive list of MIME types to be
@ -550,7 +551,7 @@
file because of insufficient disk space).</para> file because of insufficient disk space).</para>
<para>The indexer in &RCL; versions 1.21 and later does not <para>The indexer in &RCL; versions 1.21 and later does not
retry failed file by default. Retrying will only occur if an retry failed files by default. Retrying will only occur if an
explicit option (<option>-k</option>) is set on the explicit option (<option>-k</option>) is set on the
<command>recollindex</command> command line, or if a script <command>recollindex</command> command line, or if a script
executed when <command>recollindex</command> starts up says executed when <command>recollindex</command> starts up says
@ -636,10 +637,9 @@
example being a set of mp3 files where only the tags would be example being a set of mp3 files where only the tags would be
indexed).</para> indexed).</para>
<para>Of course, images, sound and video do not increase the <para>Of course, images, sound and video do not increase the index
index size, which means that nowadays, typically, even a big size, which means that typically, even a big index will be negligible
index will be negligible against the total amount of data on the against the total amount of data on the computer.</para>
computer.</para>
<para>The index data directory (<filename>xapiandb</filename>) <para>The index data directory (<filename>xapiandb</filename>)
only contains data that can be completely rebuilt by an index run only contains data that can be completely rebuilt by an index run
@ -669,10 +669,11 @@
<sect2 id="RCL.INDEXING.STORAGE.SECURITY"> <sect2 id="RCL.INDEXING.STORAGE.SECURITY">
<title>Security aspects</title> <title>Security aspects</title>
<para>The &RCL; index does not hold copies of the indexed <para>The &RCL; index does not hold complete copies of the indexed
documents. But it does hold enough data to allow for an almost documents (it almost does after version 1.24). But it does
complete reconstruction. If confidential data is indexed, hold enough data to allow for an almost complete reconstruction. If
access to the database directory should be restricted. </para> confidential data is indexed, access to the database directory
should be restricted. </para>
<para>&RCL; will create the configuration directory with a mode of <para>&RCL; will create the configuration directory with a mode of
0700 (access by owner only). As the index data directory is by 0700 (access by owner only). As the index data directory is by
@ -716,10 +717,9 @@
<refentrytitle>recoll.conf</refentrytitle> <refentrytitle>recoll.conf</refentrytitle>
<manvolnum>5</manvolnum> <manvolnum>5</manvolnum>
</citerefentry> </citerefentry>
man page, but the most man page, but the most current information will most likely be the
current information will most likely be the comments inside the comments inside the sample file. The most immediately useful variable
sample file. The most immediately useful variable you may is probably
interested in is probably
<link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"> <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS">
<varname>topdirs</varname></link>, <varname>topdirs</varname></link>,
which determines what subtrees and files get indexed.</para> which determines what subtrees and files get indexed.</para>
@ -731,7 +731,7 @@
<para>As of Recoll 1.18 there are two incompatible types of Recoll <para>As of Recoll 1.18 there are two incompatible types of Recoll
indexes, depending on the treatment of character case and indexes, depending on the treatment of character case and
diacritics. A <link linkend="RCL.INDEXING.CONFIG.SENS">a further diacritics. A <link linkend="RCL.INDEXING.CONFIG.SENS">further
section</link> describes the two types in more detail.</para> section</link> describes the two types in more detail.</para>
<sect2 id="RCL.INDEXING.CONFIG.MULTIPLE"> <sect2 id="RCL.INDEXING.CONFIG.MULTIPLE">
@ -757,26 +757,25 @@
to avoid mistakenly creating additional directories when an to avoid mistakenly creating additional directories when an
argument is mistyped.</para> argument is mistyped.</para>
<para>A typical usage scenario for the multiple index feature <para>A typical usage scenario for the multiple index feature would
would be for a system administrator to set up a central index be for a system administrator to set up a central index for shared
for shared data, that you choose to search or not in addition to data, that you choose to search or not in addition to your personal
your personal data. Of course, there are other data. Of course, there are other possibilities. There are many
possibilities. There are many cases where you know the subset of cases where you know the subset of files that should be searched,
files that should be searched, and where narrowing the search and where narrowing the search can improve the results. You can
can improve the results. You can achieve approximately the same achieve approximately the same effect with the directory filter in
effect with the directory filter in advanced search, but advanced search, but multiple indexes will have better performance
multiple indexes will have much better performance and may be and may be worth the trouble.</para>
worth the trouble.</para>
<para>A <command>recollindex</command> program instance can only <para>A <command>recollindex</command> program instance can only
update one specific index, and it will only use parameters from a update one specific index, and it will only use parameters from a
single configuration (no parameters are ever shared between single configuration (no parameters are ever shared between
configurations when indexing).</para> configurations when indexing).</para>
<para>Multiple indexes can queryied concurrently, either from the <para>Multiple indexes can be queryied concurrently, either from
GUI or the command line. When doing this, there is always a main the GUI or the command line. When doing this, there is always a
configuration, from which both configuration and index data are main configuration, from which both configuration and index data
used. Only the index data from the additional indexes is used are used. Only the index data from the additional indexes is used
(their configuration parameters are ignored).</para> (their configuration parameters are ignored).</para>
<para>When searching, the current main index (defined by <para>When searching, the current main index (defined by
@ -1416,68 +1415,6 @@
from the terminal and become a daemon, permanently monitoring from the terminal and become a daemon, permanently monitoring
file changes and updating the index.</para> file changes and updating the index.</para>
<para>Under <application>KDE</application>,
<application>Gnome</application> and some other desktop
environments, the daemon can automatically started when you log
in, by creating a desktop file inside the
<filename>~/.config/autostart</filename> directory. This can be
done for you by the &RCL; GUI. Use the
<guimenu>Preferences->Indexing Schedule</guimenu> menu.</para>
<para>With older <application>X11</application> setups, starting
the daemon is normally performed as part of the user session
script.</para>
<para>The <filename>rclmon.sh</filename> script can be used to
easily start and stop the daemon. It can be found in the
<filename>examples</filename> directory (typically
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
<para>For example, my out of fashion
<application>xdm</application>-based session has a
<filename>.xsession</filename> script with the following lines
at the end:</para>
<programlisting>recollconf=$HOME/.recoll-home
recolldata=/usr/local/share/recoll
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
fvwm
</programlisting>
<para>The indexing daemon gets started, then the window manager,
for which the session waits.</para> <para>By default the
indexing daemon will monitor the state of the X11 session, and
exit when it finishes, it is not necessary to kill it
explicitly. (The <application>X11</application> server
monitoring can be disabled with option <option>-x</option> to
<command>recollindex</command>).</para>
<para>If you use the daemon completely out of an
<application>X11</application> session, you need to add option
<option>-x</option> to disable <application>X11</application>
session monitoring (else the daemon will not start).</para>
<para>By default, the messages from the indexing daemon will be
sent to the same file as those from the interactive commands
(<literal>logfilename</literal>). You may want to change this
by setting the <varname>daemlogfilename</varname> and
<varname>daemloglevel</varname> configuration parameters. Also
the log file will only be truncated when the daemon starts. If
the daemon runs permanently, the log file may grow quite big,
depending on the log level.</para>
<para>When building &RCL;, the real time indexing support can be
customised during package <link
linkend="RCL.INSTALL.BUILDING">configuration</link> with
the <option>--with[out]-fam</option> or
<option>--with[out]-inotify</option> options. The default is
currently to include <application>inotify</application>
monitoring on systems that support it, and, as of &RCL; 1.17,
<application>gamin</application> support on
<application>FreeBSD</application>.</para>
<para>While it is convenient that data is indexed in real time, <para>While it is convenient that data is indexed in real time,
repeated indexing can generate a significant load on the repeated indexing can generate a significant load on the
system when files such as email folders change. Also, system when files such as email folders change. Also,
@ -1486,44 +1423,112 @@
your system is short on resources. Periodic indexing is your system is short on resources. Periodic indexing is
adequate in most cases.</para> adequate in most cases.</para>
<para>As of &RCL; 1.25, you can set the <link <para>As of &RCL; 1.24, you can set the <link
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS">monitordirs</link> linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS">monitordirs</link>
configuration variable to specify that only a subset of your indexed configuration variable to specify that only a subset of your indexed
files will be monitored for instant indexing. In this situation, an files will be monitored for instant indexing. In this situation, an
incremental pass on the full tree can be triggered by either incremental pass on the full tree can be triggered by either
restarting the indexer, or just running the restarting the indexer, or just running
<command>recollindex</command>, which will notify the running <command>recollindex</command>, which will notify the running
process. The <command>recoll</command> GUI also has a menu entry for process. The <command>recoll</command> GUI also has a menu entry for
this.</para> this.</para>
<sect2 id="RCL.INDEXING.MONITOR.START">
<title>Real time indexing: automatic daemon start</title>
<note><title>Increasing resources for inotify</title> <para>Under <application>KDE</application>,
<para>On Linux systems, monitoring a big tree may need <application>Gnome</application> and some other desktop
increasing the resources available to inotify, which are environments, the daemon can automatically started when you log
normally defined in <filename>/etc/sysctl.conf</filename>. in, by creating a desktop file inside the
<programlisting> <filename>~/.config/autostart</filename> directory. This can be
### inotify done for you by the &RCL; GUI. Use the
# <guimenu>Preferences->Indexing Schedule</guimenu> menu.</para>
# cat /proc/sys/fs/inotify/max_queued_events - 16384
# cat /proc/sys/fs/inotify/max_user_instances - 128
# cat /proc/sys/fs/inotify/max_user_watches - 16384
#
# -- Change to:
#
fs.inotify.max_queued_events=32768
fs.inotify.max_user_instances=256
fs.inotify.max_user_watches=32768
</programlisting>
</para> <para>With older <application>X11</application> setups, starting
<para>Especially, you will need to trim your tree or adjust the daemon is normally performed as part of the user session
the <literal>max_user_watches</literal> value if indexing exits with script.</para>
a message about errno <literal>ENOSPC</literal> (28) from
<function>inotify_add_watch</function>.</para>
</note>
<sect2 id="RCL.INDEXING.MONITOR.FASTFILES"> <para>The <filename>rclmon.sh</filename> script can be used to
<title>Slowing down the reindexing rate for fast changing easily start and stop the daemon. It can be found in the
<filename>examples</filename> directory (typically
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
<para>For example, my out of fashion
<application>xdm</application>-based session has a
<filename>.xsession</filename> script with the following lines
at the end:</para>
<programlisting>recollconf=$HOME/.recoll-home
recolldata=/usr/local/share/recoll
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
fvwm
</programlisting>
<para>The indexing daemon gets started, then the window manager,
for which the session waits.</para> <para>By default the
indexing daemon will monitor the state of the X11 session, and
exit when it finishes, it is not necessary to kill it
explicitly. (The <application>X11</application> server
monitoring can be disabled with option <option>-x</option> to
<command>recollindex</command>).</para>
<para>If you use the daemon completely out of an
<application>X11</application> session, you need to add option
<option>-x</option> to disable <application>X11</application>
session monitoring (else the daemon will not start).</para>
</sect2>
<sect2 id="RCL.INDEXING.MONITOR.DETAILS">
<title>Real time indexing: miscellaneous details</title>
<para>By default, the messages from the indexing daemon will be
sent to the same file as those from the interactive commands
(<literal>logfilename</literal>). You may want to change this
by setting the <varname>daemlogfilename</varname> and
<varname>daemloglevel</varname> configuration parameters. Also
the log file will only be truncated when the daemon starts. If
the daemon runs permanently, the log file may grow quite big,
depending on the log level.</para>
<para>When building &RCL;, the real time indexing support can be
customised during package <link
linkend="RCL.INSTALL.BUILDING">configuration</link> with
the <option>--with[out]-fam</option> or
<option>--with[out]-inotify</option> options. The default is
currently to include <application>inotify</application>
monitoring on systems that support it, and, as of &RCL; 1.17,
<application>gamin</application> support on
<application>FreeBSD</application>.</para>
<note><title>Increasing resources for inotify</title>
<para>On Linux systems, monitoring a big tree may need
increasing the resources available to inotify, which are
normally defined in <filename>/etc/sysctl.conf</filename>.
<programlisting>
### inotify
#
# cat /proc/sys/fs/inotify/max_queued_events - 16384
# cat /proc/sys/fs/inotify/max_user_instances - 128
# cat /proc/sys/fs/inotify/max_user_watches - 16384
#
# -- Change to:
#
fs.inotify.max_queued_events=32768
fs.inotify.max_user_instances=256
fs.inotify.max_user_watches=32768
</programlisting>
</para>
<para>Especially, you will need to trim your tree or adjust
the <literal>max_user_watches</literal> value if indexing exits with
a message about errno <literal>ENOSPC</literal> (28) from
<function>inotify_add_watch</function>.</para>
</note>
<note><title>Slowing down the reindexing rate for fast changing
files</title> files</title>
<para>When using the real time monitor, it may happen that some <para>When using the real time monitor, it may happen that some
@ -1535,8 +1540,10 @@
reindexed. See the <varname>mondelaypatterns</varname> parameter in reindexed. See the <varname>mondelaypatterns</varname> parameter in
the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC"> the <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF.MISC">
configuration section</link>.</para> configuration section</link>.</para>
</note>
</sect2> </sect2>
</sect1> </sect1>
</chapter> </chapter>