doc
This commit is contained in:
parent
c7b2587f40
commit
7cd6f90554
@ -110,6 +110,9 @@ alink="#0000FF">
|
|||||||
<dt><span class="sect2">2.2.2. <a href=
|
<dt><span class="sect2">2.2.2. <a href=
|
||||||
"#RCL.INDEXING.STORAGE.SECURITY">Security
|
"#RCL.INDEXING.STORAGE.SECURITY">Security
|
||||||
aspects</a></span></dt>
|
aspects</a></span></dt>
|
||||||
|
<dt><span class="sect2">2.2.3. <a href=
|
||||||
|
"#RCL.INDEXING.STORAGE.BIG">Special considerations
|
||||||
|
for big indexes</a></span></dt>
|
||||||
</dl>
|
</dl>
|
||||||
</dd>
|
</dd>
|
||||||
<dt><span class="sect1">2.3. <a href=
|
<dt><span class="sect1">2.3. <a href=
|
||||||
@ -1098,8 +1101,22 @@ alink="#0000FF">
|
|||||||
"filename">$HOME/.recoll/xapiandb/</code>. This can be
|
"filename">$HOME/.recoll/xapiandb/</code>. This can be
|
||||||
changed via two different methods (with different
|
changed via two different methods (with different
|
||||||
purposes):</p>
|
purposes):</p>
|
||||||
<div class="itemizedlist">
|
<div class="orderedlist">
|
||||||
<ul class="itemizedlist" style="list-style-type: disc;">
|
<ol class="orderedlist" type="1">
|
||||||
|
<li class="listitem">
|
||||||
|
<p>For a given configuration directory, you can
|
||||||
|
specify a non-default storage location for the index
|
||||||
|
by setting the <code class="varname">dbdir</code>
|
||||||
|
parameter in the configuration file (see the
|
||||||
|
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
||||||
|
title=
|
||||||
|
"6.4.2. Recoll main configuration file, recoll.conf">
|
||||||
|
configuration section</a>). This method would mainly
|
||||||
|
be of use if you wanted to keep the configuration
|
||||||
|
directory in its default location, but desired
|
||||||
|
another location for the index, typically out of disk
|
||||||
|
occupation or performance concerns.</p>
|
||||||
|
</li>
|
||||||
<li class="listitem">
|
<li class="listitem">
|
||||||
<p>You can specify a different configuration
|
<p>You can specify a different configuration
|
||||||
directory by setting the <code class=
|
directory by setting the <code class=
|
||||||
@ -1128,21 +1145,7 @@ alink="#0000FF">
|
|||||||
whatever subset of the available data you wish to
|
whatever subset of the available data you wish to
|
||||||
make searchable.</p>
|
make searchable.</p>
|
||||||
</li>
|
</li>
|
||||||
<li class="listitem">
|
</ol>
|
||||||
<p>For a given configuration directory, you can
|
|
||||||
specify a non-default storage location for the index
|
|
||||||
by setting the <code class="varname">dbdir</code>
|
|
||||||
parameter in the configuration file (see the
|
|
||||||
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
|
||||||
title=
|
|
||||||
"6.4.2. Recoll main configuration file, recoll.conf">
|
|
||||||
configuration section</a>). This method would mainly
|
|
||||||
be of use if you wanted to keep the configuration
|
|
||||||
directory in its default location, but desired
|
|
||||||
another location for the index, typically out of disk
|
|
||||||
occupation concerns.</p>
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
</div>
|
</div>
|
||||||
<p>The size of the index is determined by the size of the
|
<p>The size of the index is determined by the size of the
|
||||||
set of documents, but the ratio can vary a lot. For a
|
set of documents, but the ratio can vary a lot. For a
|
||||||
@ -1154,9 +1157,9 @@ alink="#0000FF">
|
|||||||
non-indexed data (an extreme example being a set of mp3
|
non-indexed data (an extreme example being a set of mp3
|
||||||
files where only the tags would be indexed).</p>
|
files where only the tags would be indexed).</p>
|
||||||
<p>Of course, images, sound and video do not increase the
|
<p>Of course, images, sound and video do not increase the
|
||||||
index size, which means that typically, even a big index
|
index size, which means that in most cases, the space used
|
||||||
will be negligible against the total amount of data on the
|
by the index will be negligible against the total amount of
|
||||||
computer.</p>
|
data on the computer.</p>
|
||||||
<p>The index data directory (<code class=
|
<p>The index data directory (<code class=
|
||||||
"filename">xapiandb</code>) only contains data that can be
|
"filename">xapiandb</code>) only contains data that can be
|
||||||
completely rebuilt by an index run (as long as the original
|
completely rebuilt by an index run (as long as the original
|
||||||
@ -1186,8 +1189,10 @@ alink="#0000FF">
|
|||||||
because its format is not supported any more, you will
|
because its format is not supported any more, you will
|
||||||
have to explicitly delete the old index (typically
|
have to explicitly delete the old index (typically
|
||||||
<code class="filename">~/.recoll/xapiandb</code>), then
|
<code class="filename">~/.recoll/xapiandb</code>), then
|
||||||
run a normal indexing command. Using option <code class=
|
run a normal indexing command. Using <span class=
|
||||||
"option">-z</code> would not work in this situation.</p>
|
"command"><strong>recollindex</strong></span> option
|
||||||
|
<code class="option">-z</code> would not work in this
|
||||||
|
situation.</p>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect2">
|
<div class="sect2">
|
||||||
<div class="titlepage">
|
<div class="titlepage">
|
||||||
@ -1217,6 +1222,59 @@ alink="#0000FF">
|
|||||||
adjust the <code class="literal">umask</code> used during
|
adjust the <code class="literal">umask</code> used during
|
||||||
index updates.</p>
|
index updates.</p>
|
||||||
</div>
|
</div>
|
||||||
|
<div class="sect2">
|
||||||
|
<div class="titlepage">
|
||||||
|
<div>
|
||||||
|
<div>
|
||||||
|
<h3 class="title"><a name=
|
||||||
|
"RCL.INDEXING.STORAGE.BIG" id=
|
||||||
|
"RCL.INDEXING.STORAGE.BIG"></a>2.2.3. Special
|
||||||
|
considerations for big indexes</h3>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<p>This only needs concern you if your index is going to
|
||||||
|
be bigger than around 5 GBytes. Beyond 10 GBytes, it
|
||||||
|
becomes a serious issue. Most people have much smaller
|
||||||
|
indexes. For reference, 5 GBytes would be around 2000
|
||||||
|
bibles, a lot of text. If you have a huge text dataset
|
||||||
|
(remember: images don't count, the text content of PDFs
|
||||||
|
is typically less than 5% of the file size), read on.</p>
|
||||||
|
<p>The amount of writing performed by Xapian during index
|
||||||
|
creation is not linear with the index size (it is
|
||||||
|
somewhere between linear and quadratic). For big indexes
|
||||||
|
this becomes a performance issue, and may even be an SSD
|
||||||
|
disk wear issue.</p>
|
||||||
|
<p>The problem can be mitigated by observing the
|
||||||
|
following rules:</p>
|
||||||
|
<div class="itemizedlist">
|
||||||
|
<ul class="itemizedlist" style=
|
||||||
|
"list-style-type: disc;">
|
||||||
|
<li class="listitem">
|
||||||
|
<p>Partition the data set and create several
|
||||||
|
indexes of reasonable size rather than a huge one.
|
||||||
|
These indexes can then be queried in parallel
|
||||||
|
(using the <span class="application">Recoll</span>
|
||||||
|
external indexes facility), or merged using
|
||||||
|
<span class=
|
||||||
|
"command"><strong>xapian-compact</strong></span>.</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>Have a lot of RAM available and set the
|
||||||
|
<code class="literal">idxflushmb</code>
|
||||||
|
<span class="application">Recoll</span>
|
||||||
|
configuration parameter as high as you can without
|
||||||
|
swapping (experimentation will be needed). 200
|
||||||
|
would be a minimum in this context.</p>
|
||||||
|
</li>
|
||||||
|
<li class="listitem">
|
||||||
|
<p>Use Xapian 1.4.10 or newer, as this version
|
||||||
|
brought a significant improvement in the amount of
|
||||||
|
writes.</p>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
<div class="titlepage">
|
<div class="titlepage">
|
||||||
|
|||||||
@ -590,7 +590,19 @@
|
|||||||
configuration directory, typically
|
configuration directory, typically
|
||||||
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
|
<filename>$HOME/.recoll/xapiandb/</filename>. This can be
|
||||||
changed via two different methods (with different purposes):
|
changed via two different methods (with different purposes):
|
||||||
<itemizedlist>
|
<orderedlist>
|
||||||
|
|
||||||
|
<listitem><para>For a given configuration directory, you can
|
||||||
|
specify a non-default storage location for the index by setting
|
||||||
|
the <varname>dbdir</varname> parameter in the configuration file
|
||||||
|
(see the <link
|
||||||
|
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
||||||
|
section</link>). This method would mainly be of use if you wanted
|
||||||
|
to keep the configuration directory in its default location, but
|
||||||
|
desired another location for the index, typically out of disk
|
||||||
|
occupation or performance concerns.</para>
|
||||||
|
</listitem>
|
||||||
|
|
||||||
<listitem><para>You can specify a different configuration
|
<listitem><para>You can specify a different configuration
|
||||||
directory by setting the <envar>RECOLL_CONFDIR</envar>
|
directory by setting the <envar>RECOLL_CONFDIR</envar>
|
||||||
environment variable, or using the <option>-c</option>
|
environment variable, or using the <option>-c</option>
|
||||||
@ -611,20 +623,9 @@
|
|||||||
options</link> allows you to tailor multiple configurations and
|
options</link> allows you to tailor multiple configurations and
|
||||||
indexes to handle whatever subset of the available data you wish
|
indexes to handle whatever subset of the available data you wish
|
||||||
to make searchable.</para>
|
to make searchable.</para>
|
||||||
|
|
||||||
</listitem>
|
</listitem>
|
||||||
|
|
||||||
<listitem><para>For a given configuration directory, you can
|
</orderedlist>
|
||||||
specify a non-default storage location for the index by setting
|
|
||||||
the <varname>dbdir</varname> parameter in the configuration file
|
|
||||||
(see the <link
|
|
||||||
linkend="RCL.INSTALL.CONFIG.RECOLLCONF">configuration
|
|
||||||
section</link>). This method would mainly be of use if you wanted
|
|
||||||
to keep the configuration directory in its default location, but
|
|
||||||
desired another location for the index, typically out of disk
|
|
||||||
occupation concerns.</para>
|
|
||||||
</listitem>
|
|
||||||
</itemizedlist>
|
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>The size of the index is determined by the size of the set
|
<para>The size of the index is determined by the size of the set
|
||||||
@ -638,8 +639,9 @@
|
|||||||
indexed).</para>
|
indexed).</para>
|
||||||
|
|
||||||
<para>Of course, images, sound and video do not increase the index
|
<para>Of course, images, sound and video do not increase the index
|
||||||
size, which means that typically, even a big index will be negligible
|
size, which means that in most cases, the space used by the index
|
||||||
against the total amount of data on the computer.</para>
|
will be negligible against the total amount of data on the
|
||||||
|
computer.</para>
|
||||||
|
|
||||||
<para>The index data directory (<filename>xapiandb</filename>)
|
<para>The index data directory (<filename>xapiandb</filename>)
|
||||||
only contains data that can be completely rebuilt by an index run
|
only contains data that can be completely rebuilt by an index run
|
||||||
@ -660,8 +662,8 @@
|
|||||||
its format is not supported any more, you will have to explicitly
|
its format is not supported any more, you will have to explicitly
|
||||||
delete the old index (typically
|
delete the old index (typically
|
||||||
<filename>~/.recoll/xapiandb</filename>), then run a normal
|
<filename>~/.recoll/xapiandb</filename>), then run a normal
|
||||||
indexing command. Using option <option>-z</option> would not work
|
indexing command. Using <command>recollindex</command> option
|
||||||
in this situation.</para>
|
<option>-z</option> would not work in this situation.</para>
|
||||||
|
|
||||||
|
|
||||||
</sect2>
|
</sect2>
|
||||||
@ -684,10 +686,44 @@
|
|||||||
of protection you need for your index, set the directory
|
of protection you need for your index, set the directory
|
||||||
and files access modes appropriately, and also maybe adjust
|
and files access modes appropriately, and also maybe adjust
|
||||||
the <literal>umask</literal> used during index updates.</para>
|
the <literal>umask</literal> used during index updates.</para>
|
||||||
|
|
||||||
|
|
||||||
</sect2>
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="RCL.INDEXING.STORAGE.BIG">
|
||||||
|
<title>Special considerations for big indexes</title>
|
||||||
|
|
||||||
|
<para>This only needs concern you if your index is going to be
|
||||||
|
bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious
|
||||||
|
issue. Most people have much smaller indexes. For reference, 5
|
||||||
|
GBytes would be around 2000 bibles, a lot of text. If you have a
|
||||||
|
huge text dataset (remember: images don't count, the text content
|
||||||
|
of PDFs is typically less than 5% of the file size), read on.</para>
|
||||||
|
|
||||||
|
<para>The amount of writing performed by Xapian during index
|
||||||
|
creation is not linear with the index size (it is somewhere between
|
||||||
|
linear and quadratic). For big indexes this becomes a performance
|
||||||
|
issue, and may even be an SSD disk wear issue.</para>
|
||||||
|
|
||||||
|
<para>The problem can be mitigated by observing the following
|
||||||
|
rules:</para>
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem><para>Partition the data set and create several indexes
|
||||||
|
of reasonable size rather than a huge one. These indexes can then
|
||||||
|
be queried in parallel (using the &RCL; external indexes
|
||||||
|
facility), or merged using
|
||||||
|
<command>xapian-compact</command>.</para></listitem>
|
||||||
|
<listitem><para>Have a lot of RAM available and set the
|
||||||
|
<literal>idxflushmb</literal> &RCL; configuration parameter as
|
||||||
|
high as you can without swapping (experimentation will be
|
||||||
|
needed). 200 would be a minimum in this
|
||||||
|
context.</para></listitem>
|
||||||
|
<listitem><para>Use Xapian 1.4.10 or newer, as this version
|
||||||
|
brought a significant improvement in the amount of writes.</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
</sect2>
|
||||||
|
|
||||||
</sect1>
|
</sect1>
|
||||||
|
|
||||||
<sect1 id="RCL.INDEXING.CONFIG">
|
<sect1 id="RCL.INDEXING.CONFIG">
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user