doc
This commit is contained in:
parent
c7b2587f40
commit
7cd6f90554
@ -110,6 +110,9 @@ alink="#0000FF">
|
||||
<dt><span class="sect2">2.2.2. <a href=
|
||||
"#RCL.INDEXING.STORAGE.SECURITY">Security
|
||||
aspects</a></span></dt>
|
||||
<dt><span class="sect2">2.2.3. <a href=
|
||||
"#RCL.INDEXING.STORAGE.BIG">Special considerations
|
||||
for big indexes</a></span></dt>
|
||||
</dl>
|
||||
</dd>
|
||||
<dt><span class="sect1">2.3. <a href=
|
||||
@ -1098,8 +1101,22 @@ alink="#0000FF">
|
||||
"filename">$HOME/.recoll/xapiandb/</code>. This can be
|
||||
changed via two different methods (with different
|
||||
purposes):</p>
|
||||
<div class="itemizedlist">
|
||||
<ul class="itemizedlist" style="list-style-type: disc;">
|
||||
<div class="orderedlist">
|
||||
<ol class="orderedlist" type="1">
|
||||
<li class="listitem">
|
||||
<p>For a given configuration directory, you can
|
||||
specify a non-default storage location for the index
|
||||
by setting the <code class="varname">dbdir</code>
|
||||
parameter in the configuration file (see the
|
||||
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
||||
title=
|
||||
"6.4.2. Recoll main configuration file, recoll.conf">
|
||||
configuration section</a>). This method would mainly
|
||||
be of use if you wanted to keep the configuration
|
||||
directory in its default location, but desired
|
||||
another location for the index, typically out of disk
|
||||
occupation or performance concerns.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>You can specify a different configuration
|
||||
directory by setting the <code class=
|
||||
@ -1128,21 +1145,7 @@ alink="#0000FF">
|
||||
whatever subset of the available data you wish to
|
||||
make searchable.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>For a given configuration directory, you can
|
||||
specify a non-default storage location for the index
|
||||
by setting the <code class="varname">dbdir</code>
|
||||
parameter in the configuration file (see the
|
||||
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
||||
title=
|
||||
"6.4.2. Recoll main configuration file, recoll.conf">
|
||||
configuration section</a>). This method would mainly
|
||||
be of use if you wanted to keep the configuration
|
||||
directory in its default location, but desired
|
||||
another location for the index, typically out of disk
|
||||
occupation concerns.</p>
|
||||
</li>
|
||||
</ul>
|
||||
</ol>
|
||||
</div>
|
||||
<p>The size of the index is determined by the size of the
|
||||
set of documents, but the ratio can vary a lot. For a
|
||||
@ -1154,9 +1157,9 @@ alink="#0000FF">
|
||||
non-indexed data (an extreme example being a set of mp3
|
||||
files where only the tags would be indexed).</p>
|
||||
<p>Of course, images, sound and video do not increase the
|
||||
index size, which means that typically, even a big index
|
||||
will be negligible against the total amount of data on the
|
||||
computer.</p>
|
||||
index size, which means that in most cases, the space used
|
||||
by the index will be negligible against the total amount of
|
||||
data on the computer.</p>
|
||||
<p>The index data directory (<code class=
|
||||
"filename">xapiandb</code>) only contains data that can be
|
||||
completely rebuilt by an index run (as long as the original
|
||||
@ -1186,8 +1189,10 @@ alink="#0000FF">
|
||||
because its format is not supported any more, you will
|
||||
have to explicitly delete the old index (typically
|
||||
<code class="filename">~/.recoll/xapiandb</code>), then
|
||||
run a normal indexing command. Using option <code class=
|
||||
"option">-z</code> would not work in this situation.</p>
|
||||
run a normal indexing command. Using <span class=
|
||||
"command"><strong>recollindex</strong></span> option
|
||||
<code class="option">-z</code> would not work in this
|
||||
situation.</p>
|
||||
</div>
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
@ -1217,6 +1222,59 @@ alink="#0000FF">
|
||||
adjust the <code class="literal">umask</code> used during
|
||||
index updates.</p>
|
||||
</div>
|
||||
<div class="sect2">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name=
|
||||
"RCL.INDEXING.STORAGE.BIG" id=
|
||||
"RCL.INDEXING.STORAGE.BIG"></a>2.2.3. Special
|
||||
considerations for big indexes</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>This only needs concern you if your index is going to
|
||||
be bigger than around 5 GBytes. Beyond 10 GBytes, it
|
||||
becomes a serious issue. Most people have much smaller
|
||||
indexes. For reference, 5 GBytes would be around 2000
|
||||
bibles, a lot of text. If you have a huge text dataset
|
||||
(remember: images don't count, the text content of PDFs
|
||||
is typically less than 5% of the file size), read on.</p>
|
||||
<p>The amount of writing performed by Xapian during index
|
||||
creation is not linear with the index size (it is
|
||||
somewhere between linear and quadratic). For big indexes
|
||||
this becomes a performance issue, and may even be an SSD
|
||||
disk wear issue.</p>
|
||||
<p>The problem can be mitigated by observing the
|
||||
following rules:</p>
|
||||
<div class="itemizedlist">
|
||||
<ul class="itemizedlist" style=
|
||||
"list-style-type: disc;">
|
||||
<li class="listitem">
|
||||
<p>Partition the data set and create several
|
||||
indexes of reasonable size rather than a huge one.
|
||||
These indexes can then be queried in parallel
|
||||
(using the <span class="application">Recoll</span>
|
||||
external indexes facility), or merged using
|
||||
<span class=
|
||||
"command"><strong>xapian-compact</strong></span>.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>Have a lot of RAM available and set the
|
||||
<code class="literal">idxflushmb</code>
|
||||
<span class="application">Recoll</span>
|
||||
configuration parameter as high as you can without
|
||||
swapping (experimentation will be needed). 200
|
||||
would be a minimum in this context.</p>
|
||||
</li>
|
||||
<li class="listitem">
|
||||
<p>Use Xapian 1.4.10 or newer, as this version
|
||||
brought a significant improvement in the amount of
|
||||
writes.</p>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="sect1">
|
||||
<div class="titlepage">
|
||||
|
||||
@ -590,7 +590,19 @@
|
||||
configuration directory, typically
|
||||
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
|
||||
changed via two different methods (with different purposes):
|
||||
<itemizedlist>
|
||||
<orderedlist>
|
||||
|
||||
<listitem><para>For a given configuration directory, you can
|
||||
specify a non-default storage location for the index by setting
|
||||
the <varname>dbdir</varname> parameter in the configuration file
|
||||
(see the <link
|
||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
||||
section</link>). This method would mainly be of use if you wanted
|
||||
to keep the configuration directory in its default location, but
|
||||
desired another location for the index, typically out of disk
|
||||
occupation or performance concerns.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem><para>You can specify a different configuration
|
||||
directory by setting the <envar>RECOLL_CONFDIR</envar>
|
||||
environment variable, or using the <option>-c</option>
|
||||
@ -611,20 +623,9 @@
|
||||
options</link> allows you to tailor multiple configurations and
|
||||
indexes to handle whatever subset of the available data you wish
|
||||
to make searchable.</para>
|
||||
|
||||
</listitem>
|
||||
|
||||
<listitem><para>For a given configuration directory, you can
|
||||
specify a non-default storage location for the index by setting
|
||||
the <varname>dbdir</varname> parameter in the configuration file
|
||||
(see the <link
|
||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
||||
section</link>). This method would mainly be of use if you wanted
|
||||
to keep the configuration directory in its default location, but
|
||||
desired another location for the index, typically out of disk
|
||||
occupation concerns.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</orderedlist>
|
||||
</para>
|
||||
|
||||
<para>The size of the index is determined by the size of the set
|
||||
@ -638,8 +639,9 @@
|
||||
indexed).</para>
|
||||
|
||||
<para>Of course, images, sound and video do not increase the index
|
||||
size, which means that typically, even a big index will be negligible
|
||||
against the total amount of data on the computer.</para>
|
||||
size, which means that in most cases, the space used by the index
|
||||
will be negligible against the total amount of data on the
|
||||
computer.</para>
|
||||
|
||||
<para>The index data directory (<filename>xapiandb</filename>)
|
||||
only contains data that can be completely rebuilt by an index run
|
||||
@ -660,8 +662,8 @@
|
||||
its format is not supported any more, you will have to explicitly
|
||||
delete the old index (typically
|
||||
<filename>~/.recoll/xapiandb</filename>), then run a normal
|
||||
indexing command. Using option <option>-z</option> would not work
|
||||
in this situation.</para>
|
||||
indexing command. Using <command>recollindex</command> option
|
||||
<option>-z</option> would not work in this situation.</para>
|
||||
|
||||
|
||||
</sect2>
|
||||
@ -684,10 +686,44 @@
|
||||
of protection you need for your index, set the directory
|
||||
and files access modes appropriately, and also maybe adjust
|
||||
the <literal>umask</literal> used during index updates.</para>
|
||||
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="RCL.INDEXING.STORAGE.BIG">
|
||||
<title>Special considerations for big indexes</title>
|
||||
|
||||
<para>This only needs concern you if your index is going to be
|
||||
bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious
|
||||
issue. Most people have much smaller indexes. For reference, 5
|
||||
GBytes would be around 2000 bibles, a lot of text. If you have a
|
||||
huge text dataset (remember: images don't count, the text content
|
||||
of PDFs is typically less than 5% of the file size), read on.</para>
|
||||
|
||||
<para>The amount of writing performed by Xapian during index
|
||||
creation is not linear with the index size (it is somewhere between
|
||||
linear and quadratic). For big indexes this becomes a performance
|
||||
issue, and may even be an SSD disk wear issue.</para>
|
||||
|
||||
<para>The problem can be mitigated by observing the following
|
||||
rules:</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>Partition the data set and create several indexes
|
||||
of reasonable size rather than a huge one. These indexes can then
|
||||
be queried in parallel (using the &RCL; external indexes
|
||||
facility), or merged using
|
||||
<command>xapian-compact</command>.</para></listitem>
|
||||
<listitem><para>Have a lot of RAM available and set the
|
||||
<literal>idxflushmb</literal> &RCL; configuration parameter as
|
||||
high as you can without swapping (experimentation will be
|
||||
needed). 200 would be a minimum in this
|
||||
context.</para></listitem>
|
||||
<listitem><para>Use Xapian 1.4.10 or newer, as this version
|
||||
brought a significant improvement in the amount of writes.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="RCL.INDEXING.CONFIG">
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user