This commit is contained in:
Jean-Francois Dockes 2019-02-04 16:24:03 +01:00
parent c7b2587f40
commit 7cd6f90554
2 changed files with 134 additions and 40 deletions

View File

@ -110,6 +110,9 @@ alink="#0000FF">
<dt><span class="sect2">2.2.2. <a href= <dt><span class="sect2">2.2.2. <a href=
"#RCL.INDEXING.STORAGE.SECURITY">Security "#RCL.INDEXING.STORAGE.SECURITY">Security
aspects</a></span></dt> aspects</a></span></dt>
<dt><span class="sect2">2.2.3. <a href=
"#RCL.INDEXING.STORAGE.BIG">Special considerations
for big indexes</a></span></dt>
</dl> </dl>
</dd> </dd>
<dt><span class="sect1">2.3. <a href= <dt><span class="sect1">2.3. <a href=
@ -1098,8 +1101,22 @@ alink="#0000FF">
"filename">$HOME/.recoll/xapiandb/</code>. This can be "filename">$HOME/.recoll/xapiandb/</code>. This can be
changed via two different methods (with different changed via two different methods (with different
purposes):</p> purposes):</p>
<div class="itemizedlist"> <div class="orderedlist">
<ul class="itemizedlist" style="list-style-type: disc;"> <ol class="orderedlist" type="1">
<li class="listitem">
<p>For a given configuration directory, you can
specify a non-default storage location for the index
by setting the <code class="varname">dbdir</code>
parameter in the configuration file (see the
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
title=
"6.4.2.&nbsp;Recoll main configuration file, recoll.conf">
configuration section</a>). This method would mainly
be of use if you wanted to keep the configuration
directory in its default location, but desired
another location for the index, typically out of disk
occupation or performance concerns.</p>
</li>
<li class="listitem"> <li class="listitem">
<p>You can specify a different configuration <p>You can specify a different configuration
directory by setting the <code class= directory by setting the <code class=
@ -1128,21 +1145,7 @@ alink="#0000FF">
whatever subset of the available data you wish to whatever subset of the available data you wish to
make searchable.</p> make searchable.</p>
</li> </li>
<li class="listitem"> </ol>
<p>For a given configuration directory, you can
specify a non-default storage location for the index
by setting the <code class="varname">dbdir</code>
parameter in the configuration file (see the
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
title=
"6.4.2.&nbsp;Recoll main configuration file, recoll.conf">
configuration section</a>). This method would mainly
be of use if you wanted to keep the configuration
directory in its default location, but desired
another location for the index, typically out of disk
occupation concerns.</p>
</li>
</ul>
</div> </div>
<p>The size of the index is determined by the size of the <p>The size of the index is determined by the size of the
set of documents, but the ratio can vary a lot. For a set of documents, but the ratio can vary a lot. For a
@ -1154,9 +1157,9 @@ alink="#0000FF">
non-indexed data (an extreme example being a set of mp3 non-indexed data (an extreme example being a set of mp3
files where only the tags would be indexed).</p> files where only the tags would be indexed).</p>
<p>Of course, images, sound and video do not increase the <p>Of course, images, sound and video do not increase the
index size, which means that typically, even a big index index size, which means that in most cases, the space used
will be negligible against the total amount of data on the by the index will be negligible against the total amount of
computer.</p> data on the computer.</p>
<p>The index data directory (<code class= <p>The index data directory (<code class=
"filename">xapiandb</code>) only contains data that can be "filename">xapiandb</code>) only contains data that can be
completely rebuilt by an index run (as long as the original completely rebuilt by an index run (as long as the original
@ -1186,8 +1189,10 @@ alink="#0000FF">
because its format is not supported any more, you will because its format is not supported any more, you will
have to explicitly delete the old index (typically have to explicitly delete the old index (typically
<code class="filename">~/.recoll/xapiandb</code>), then <code class="filename">~/.recoll/xapiandb</code>), then
run a normal indexing command. Using option <code class= run a normal indexing command. Using <span class=
"option">-z</code> would not work in this situation.</p> "command"><strong>recollindex</strong></span> option
<code class="option">-z</code> would not work in this
situation.</p>
</div> </div>
<div class="sect2"> <div class="sect2">
<div class="titlepage"> <div class="titlepage">
@ -1217,6 +1222,59 @@ alink="#0000FF">
adjust the <code class="literal">umask</code> used during adjust the <code class="literal">umask</code> used during
index updates.</p> index updates.</p>
</div> </div>
<div class="sect2">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name=
"RCL.INDEXING.STORAGE.BIG" id=
"RCL.INDEXING.STORAGE.BIG"></a>2.2.3.&nbsp;Special
considerations for big indexes</h3>
</div>
</div>
</div>
<p>This only needs concern you if your index is going to
be bigger than around 5 GBytes. Beyond 10 GBytes, it
becomes a serious issue. Most people have much smaller
indexes. For reference, 5 GBytes would be around 2000
bibles, a lot of text. If you have a huge text dataset
(remember: images don't count, the text content of PDFs
is typically less than 5% of the file size), read on.</p>
<p>The amount of writing performed by Xapian during index
creation is not linear with the index size (it is
somewhere between linear and quadratic). For big indexes
this becomes a performance issue, and may even be an SSD
disk wear issue.</p>
<p>The problem can be mitigated by observing the
following rules:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style=
"list-style-type: disc;">
<li class="listitem">
<p>Partition the data set and create several
indexes of reasonable size rather than a huge one.
These indexes can then be queried in parallel
(using the <span class="application">Recoll</span>
external indexes facility), or merged using
<span class=
"command"><strong>xapian-compact</strong></span>.</p>
</li>
<li class="listitem">
<p>Have a lot of RAM available and set the
<code class="literal">idxflushmb</code>
<span class="application">Recoll</span>
configuration parameter as high as you can without
swapping (experimentation will be needed). 200
would be a minimum in this context.</p>
</li>
<li class="listitem">
<p>Use Xapian 1.4.10 or newer, as this version
brought a significant improvement in the amount of
writes.</p>
</li>
</ul>
</div>
</div>
</div> </div>
<div class="sect1"> <div class="sect1">
<div class="titlepage"> <div class="titlepage">

View File

@ -590,7 +590,19 @@
configuration directory, typically configuration directory, typically
<filename>$HOME/.recoll/xapiandb/</filename>. This can be <filename>$HOME/.recoll/xapiandb/</filename>. This can be
changed via two different methods (with different purposes): changed via two different methods (with different purposes):
<itemizedlist> <orderedlist>
<listitem><para>For a given configuration directory, you can
specify a non-default storage location for the index by setting
the <varname>dbdir</varname> parameter in the configuration file
(see the <link
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
section</link>). This method would mainly be of use if you wanted
to keep the configuration directory in its default location, but
desired another location for the index, typically out of disk
occupation or performance concerns.</para>
</listitem>
<listitem><para>You can specify a different configuration <listitem><para>You can specify a different configuration
directory by setting the <envar>RECOLL_CONFDIR</envar> directory by setting the <envar>RECOLL_CONFDIR</envar>
environment variable, or using the <option>-c</option> environment variable, or using the <option>-c</option>
@ -611,20 +623,9 @@
options</link> allows you to tailor multiple configurations and options</link> allows you to tailor multiple configurations and
indexes to handle whatever subset of the available data you wish indexes to handle whatever subset of the available data you wish
to make searchable.</para> to make searchable.</para>
</listitem> </listitem>
<listitem><para>For a given configuration directory, you can </orderedlist>
specify a non-default storage location for the index by setting
the <varname>dbdir</varname> parameter in the configuration file
(see the <link
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
section</link>). This method would mainly be of use if you wanted
to keep the configuration directory in its default location, but
desired another location for the index, typically out of disk
occupation concerns.</para>
</listitem>
</itemizedlist>
</para> </para>
<para>The size of the index is determined by the size of the set <para>The size of the index is determined by the size of the set
@ -638,8 +639,9 @@
indexed).</para> indexed).</para>
<para>Of course, images, sound and video do not increase the index <para>Of course, images, sound and video do not increase the index
size, which means that typically, even a big index will be negligible size, which means that in most cases, the space used by the index
against the total amount of data on the computer.</para> will be negligible against the total amount of data on the
computer.</para>
<para>The index data directory (<filename>xapiandb</filename>) <para>The index data directory (<filename>xapiandb</filename>)
only contains data that can be completely rebuilt by an index run only contains data that can be completely rebuilt by an index run
@ -660,8 +662,8 @@
its format is not supported any more, you will have to explicitly its format is not supported any more, you will have to explicitly
delete the old index (typically delete the old index (typically
<filename>~/.recoll/xapiandb</filename>), then run a normal <filename>~/.recoll/xapiandb</filename>), then run a normal
indexing command. Using option <option>-z</option> would not work indexing command. Using <command>recollindex</command> option
in this situation.</para> <option>-z</option> would not work in this situation.</para>
</sect2> </sect2>
@ -684,10 +686,44 @@
of protection you need for your index, set the directory of protection you need for your index, set the directory
and files access modes appropriately, and also maybe adjust and files access modes appropriately, and also maybe adjust
the <literal>umask</literal> used during index updates.</para> the <literal>umask</literal> used during index updates.</para>
</sect2> </sect2>
<sect2 id="RCL.INDEXING.STORAGE.BIG">
<title>Special considerations for big indexes</title>
<para>This only needs concern you if your index is going to be
bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious
issue. Most people have much smaller indexes. For reference, 5
GBytes would be around 2000 bibles, a lot of text. If you have a
huge text dataset (remember: images don't count, the text content
of PDFs is typically less than 5% of the file size), read on.</para>
<para>The amount of writing performed by Xapian during index
creation is not linear with the index size (it is somewhere between
linear and quadratic). For big indexes this becomes a performance
issue, and may even be an SSD disk wear issue.</para>
<para>The problem can be mitigated by observing the following
rules:</para>
<itemizedlist>
<listitem><para>Partition the data set and create several indexes
of reasonable size rather than a huge one. These indexes can then
be queried in parallel (using the &RCL; external indexes
facility), or merged using
<command>xapian-compact</command>.</para></listitem>
<listitem><para>Have a lot of RAM available and set the
<literal>idxflushmb</literal> &RCL; configuration parameter as
high as you can without swapping (experimentation will be
needed). 200 would be a minimum in this
context.</para></listitem>
<listitem><para>Use Xapian 1.4.10 or newer, as this version
brought a significant improvement in the amount of writes.</para>
</listitem>
</itemizedlist>
</sect2>
</sect1> </sect1>
<sect1 id="RCL.INDEXING.CONFIG"> <sect1 id="RCL.INDEXING.CONFIG">