This commit is contained in:
Jean-Francois Dockes 2019-02-04 16:24:03 +01:00
parent c7b2587f40
commit 7cd6f90554
2 changed files with 134 additions and 40 deletions

View File

@ -110,6 +110,9 @@ alink="#0000FF">
<dt><span class="sect2">2.2.2. <a href=
"#RCL.INDEXING.STORAGE.SECURITY">Security
aspects</a></span></dt>
<dt><span class="sect2">2.2.3. <a href=
"#RCL.INDEXING.STORAGE.BIG">Special considerations
for big indexes</a></span></dt>
</dl>
</dd>
<dt><span class="sect1">2.3. <a href=
@ -1098,8 +1101,22 @@ alink="#0000FF">
"filename">$HOME/.recoll/xapiandb/</code>. This can be
changed via two different methods (with different
purposes):</p>
<div class="itemizedlist">
<ul class="itemizedlist" style="list-style-type: disc;">
<div class="orderedlist">
<ol class="orderedlist" type="1">
<li class="listitem">
<p>For a given configuration directory, you can
specify a non-default storage location for the index
by setting the <code class="varname">dbdir</code>
parameter in the configuration file (see the
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
title=
"6.4.2.&nbsp;Recoll main configuration file, recoll.conf">
configuration section</a>). This method would mainly
be of use if you wanted to keep the configuration
directory in its default location, but desired
another location for the index, typically out of disk
occupation or performance concerns.</p>
</li>
<li class="listitem">
<p>You can specify a different configuration
directory by setting the <code class=
@ -1128,21 +1145,7 @@ alink="#0000FF">
whatever subset of the available data you wish to
make searchable.</p>
</li>
<li class="listitem">
<p>For a given configuration directory, you can
specify a non-default storage location for the index
by setting the <code class="varname">dbdir</code>
parameter in the configuration file (see the
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
title=
"6.4.2.&nbsp;Recoll main configuration file, recoll.conf">
configuration section</a>). This method would mainly
be of use if you wanted to keep the configuration
directory in its default location, but desired
another location for the index, typically out of disk
occupation concerns.</p>
</li>
</ul>
</ol>
</div>
<p>The size of the index is determined by the size of the
set of documents, but the ratio can vary a lot. For a
@ -1154,9 +1157,9 @@ alink="#0000FF">
non-indexed data (an extreme example being a set of mp3
files where only the tags would be indexed).</p>
<p>Of course, images, sound and video do not increase the
index size, which means that typically, even a big index
will be negligible against the total amount of data on the
computer.</p>
index size, which means that in most cases, the space used
by the index will be negligible against the total amount of
data on the computer.</p>
<p>The index data directory (<code class=
"filename">xapiandb</code>) only contains data that can be
completely rebuilt by an index run (as long as the original
@ -1186,8 +1189,10 @@ alink="#0000FF">
because its format is not supported any more, you will
have to explicitly delete the old index (typically
<code class="filename">~/.recoll/xapiandb</code>), then
run a normal indexing command. Using option <code class=
"option">-z</code> would not work in this situation.</p>
run a normal indexing command. Using <span class=
"command"><strong>recollindex</strong></span> option
<code class="option">-z</code> would not work in this
situation.</p>
</div>
<div class="sect2">
<div class="titlepage">
@ -1217,6 +1222,59 @@ alink="#0000FF">
adjust the <code class="literal">umask</code> used during
index updates.</p>
</div>
<div class="sect2">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name=
"RCL.INDEXING.STORAGE.BIG" id=
"RCL.INDEXING.STORAGE.BIG"></a>2.2.3.&nbsp;Special
considerations for big indexes</h3>
</div>
</div>
</div>
<p>This only needs concern you if your index is going to
be bigger than around 5 GBytes. Beyond 10 GBytes, it
becomes a serious issue. Most people have much smaller
indexes. For reference, 5 GBytes would be around 2000
bibles, a lot of text. If you have a huge text dataset
(remember: images don't count, the text content of PDFs
is typically less than 5% of the file size), read on.</p>
<p>The amount of writing performed by Xapian during index
creation is not linear with the index size (it is
somewhere between linear and quadratic). For big indexes
this becomes a performance issue, and may even be an SSD
disk wear issue.</p>
<p>The problem can be mitigated by observing the
following rules:</p>
<div class="itemizedlist">
<ul class="itemizedlist" style=
"list-style-type: disc;">
<li class="listitem">
<p>Partition the data set and create several
indexes of reasonable size rather than a huge one.
These indexes can then be queried in parallel
(using the <span class="application">Recoll</span>
external indexes facility), or merged using
<span class=
"command"><strong>xapian-compact</strong></span>.</p>
</li>
<li class="listitem">
<p>Have a lot of RAM available and set the
<code class="literal">idxflushmb</code>
<span class="application">Recoll</span>
configuration parameter as high as you can without
swapping (experimentation will be needed). 200
would be a minimum in this context.</p>
</li>
<li class="listitem">
<p>Use Xapian 1.4.10 or newer, as this version
brought a significant improvement in the amount of
writes.</p>
</li>
</ul>
</div>
</div>
</div>
<div class="sect1">
<div class="titlepage">

View File

@ -590,7 +590,19 @@
configuration directory, typically
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
changed via two different methods (with different purposes):
<itemizedlist>
<orderedlist>
<listitem><para>For a given configuration directory, you can
specify a non-default storage location for the index by setting
the <varname>dbdir</varname> parameter in the configuration file
(see the <link
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
section</link>). This method would mainly be of use if you wanted
to keep the configuration directory in its default location, but
desired another location for the index, typically out of disk
occupation or performance concerns.</para>
</listitem>
<listitem><para>You can specify a different configuration
directory by setting the <envar>RECOLL_CONFDIR</envar>
environment variable, or using the <option>-c</option>
@ -611,20 +623,9 @@
options</link> allows you to tailor multiple configurations and
indexes to handle whatever subset of the available data you wish
to make searchable.</para>
</listitem>
<listitem><para>For a given configuration directory, you can
specify a non-default storage location for the index by setting
the <varname>dbdir</varname> parameter in the configuration file
(see the <link
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
section</link>). This method would mainly be of use if you wanted
to keep the configuration directory in its default location, but
desired another location for the index, typically out of disk
occupation concerns.</para>
</listitem>
</itemizedlist>
</orderedlist>
</para>
<para>The size of the index is determined by the size of the set
@ -638,8 +639,9 @@
indexed).</para>
<para>Of course, images, sound and video do not increase the index
size, which means that typically, even a big index will be negligible
against the total amount of data on the computer.</para>
size, which means that in most cases, the space used by the index
will be negligible against the total amount of data on the
computer.</para>
<para>The index data directory (<filename>xapiandb</filename>)
only contains data that can be completely rebuilt by an index run
@ -660,8 +662,8 @@
its format is not supported any more, you will have to explicitly
delete the old index (typically
<filename>~/.recoll/xapiandb</filename>), then run a normal
indexing command. Using option <option>-z</option> would not work
in this situation.</para>
indexing command. Using <command>recollindex</command> option
<option>-z</option> would not work in this situation.</para>
</sect2>
@ -684,10 +686,44 @@
of protection you need for your index, set the directory
and files access modes appropriately, and also maybe adjust
the <literal>umask</literal> used during index updates.</para>
</sect2>
<sect2 id="RCL.INDEXING.STORAGE.BIG">
<title>Special considerations for big indexes</title>
<para>This only needs concern you if your index is going to be
bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious
issue. Most people have much smaller indexes. For reference, 5
GBytes would be around 2000 bibles, a lot of text. If you have a
huge text dataset (remember: images don't count, the text content
of PDFs is typically less than 5% of the file size), read on.</para>
<para>The amount of writing performed by Xapian during index
creation is not linear with the index size (it is somewhere between
linear and quadratic). For big indexes this becomes a performance
issue, and may even be an SSD disk wear issue.</para>
<para>The problem can be mitigated by observing the following
rules:</para>
<itemizedlist>
<listitem><para>Partition the data set and create several indexes
of reasonable size rather than a huge one. These indexes can then
be queried in parallel (using the &RCL; external indexes
facility), or merged using
<command>xapian-compact</command>.</para></listitem>
<listitem><para>Have a lot of RAM available and set the
<literal>idxflushmb</literal> &RCL; configuration parameter as
high as you can without swapping (experimentation will be
needed). 200 would be a minimum in this
context.</para></listitem>
<listitem><para>Use Xapian 1.4.10 or newer, as this version
brought a significant improvement in the amount of writes.</para>
</listitem>
</itemizedlist>
</sect2>
</sect1>
<sect1 id="RCL.INDEXING.CONFIG">