doc and web perf notes

This commit is contained in:
Jean-Francois Dockes 2016-08-11 12:15:01 +02:00
parent 1fc5e9ccec
commit 8289584aa9
7 changed files with 359 additions and 114 deletions

View File

@ -18,10 +18,10 @@ names. The list in the default configuration does not exclude hidden
directories (names beginning with a dot), which means that it may index
quite a few things that you do not want. On the other hand, email user
agents like Thunderbird usually store messages in hidden directories, and
you probably want this indexed. One possible solution is to have '.*' in
'skippedNames', and add things like '~/.thunderbird' '~/.evolution' to
'topdirs'. Not even the file names are indexed for patterns in this
list, see the 'noContentSuffixes' variable for an alternative approach
you probably want this indexed. One possible solution is to have ".*" in
"skippedNames", and add things like "~/.thunderbird" "~/.evolution" to
"topdirs". Not even the file names are indexed for patterns in this
list, see the "noContentSuffixes" variable for an alternative approach
which indexes the file names. Can be redefined for any
subtree.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES">
@ -366,10 +366,11 @@ which lets Xapian perform its own thing, meaning flushing every
$XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory
usage depends on average document size, not only document count, the
Xapian approach is is not very useful, and you should let Recoll manage
the flushes. The default value of idxflushmb is 10 MB, and may be a bit
low. If you are looking for maximum speed, you may want to experiment
with values between 20 and
80. In my experience, values beyond 100 are always counterproductive. If
the flushes. The program compiled value is 0. The configured default
value (from this file) is 10 MB, and will be too low in many cases (it is
chosen to conserve memory). If you are looking
for maximum speed, you may want to experiment with values between 20 and
200. In my experience, values beyond this are always counterproductive. If
you find otherwise, please drop me a note.</para></listitem></varlistentry>
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.FILTERMAXSECONDS">
<term><varname>filtermaxseconds</varname></term>

View File

@ -20,8 +20,8 @@ alink="#0000FF">
<div class="titlepage">
<div>
<div>
<h1 class="title"><a name="idp41214976" id=
"idp41214976"></a>Recoll user manual</h1>
<h1 class="title"><a name="idp9509520" id=
"idp9509520"></a>Recoll user manual</h1>
</div>
<div>
@ -109,13 +109,13 @@ alink="#0000FF">
multiple indexes</a></span></dt>
<dt><span class="sect2">2.1.3. <a href=
"#idp46788704">Document types</a></span></dt>
"#idp41562832">Document types</a></span></dt>
<dt><span class="sect2">2.1.4. <a href=
"#idp46808384">Indexing failures</a></span></dt>
"#idp41582512">Indexing failures</a></span></dt>
<dt><span class="sect2">2.1.5. <a href=
"#idp46815840">Recovery</a></span></dt>
"#idp41589968">Recovery</a></span></dt>
</dl>
</dd>
@ -997,8 +997,8 @@ alink="#0000FF">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name="idp46788704" id=
"idp46788704"></a>2.1.3.&nbsp;Document types</h3>
<h3 class="title"><a name="idp41562832" id=
"idp41562832"></a>2.1.3.&nbsp;Document types</h3>
</div>
</div>
</div>
@ -1091,8 +1091,8 @@ indexedmimetypes = application/pdf
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name="idp46808384" id=
"idp46808384"></a>2.1.4.&nbsp;Indexing
<h3 class="title"><a name="idp41582512" id=
"idp41582512"></a>2.1.4.&nbsp;Indexing
failures</h3>
</div>
</div>
@ -1132,8 +1132,8 @@ indexedmimetypes = application/pdf
<div class="titlepage">
<div>
<div>
<h3 class="title"><a name="idp46815840" id=
"idp46815840"></a>2.1.5.&nbsp;Recovery</h3>
<h3 class="title"><a name="idp41589968" id=
"idp41589968"></a>2.1.5.&nbsp;Recovery</h3>
</div>
</div>
</div>
@ -6571,9 +6571,8 @@ for doc in results:
<div class="variablelist">
<dl class="variablelist">
<dt><a name="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI" id=
"RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI"></a><span class=
"term">ipath</span></dt>
<dt><a name="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH"
id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH"></a><span class="term">ipath</span></dt>
<dd>
<p>This data value (set as a field in the Doc
@ -8652,10 +8651,10 @@ thesame = "some string with spaces"
email user agents like Thunderbird usually store
messages in hidden directories, and you probably
want this indexed. One possible solution is to have
'.*' in 'skippedNames', and add things like
'~/.thunderbird' '~/.evolution' to 'topdirs'. Not
".*" in "skippedNames", and add things like
"~/.thunderbird" "~/.evolution" to "topdirs". Not
even the file names are indexed for patterns in
this list, see the 'noContentSuffixes' variable for
this list, see the "noContentSuffixes" variable for
an alternative approach which indexes the file
names. Can be redefined for any subtree.</p>
</dd>
@ -9306,11 +9305,13 @@ thesame = "some string with spaces"
modified or deleted: as memory usage depends on
average document size, not only document count, the
Xapian approach is is not very useful, and you
should let Recoll manage the flushes. The default
value of idxflushmb is 10 MB, and may be a bit low.
If you are looking for maximum speed, you may want
to experiment with values between 20 and 80. In my
experience, values beyond 100 are always
should let Recoll manage the flushes. The program
compiled value is 0. The configured default value
(from this file) is 10 MB, and will be too low in
many cases (it is chosen to conserve memory). If
you are looking for maximum speed, you may want to
experiment with values between 20 and 200. In my
experience, values beyond this are always
counterproductive. If you find otherwise, please
drop me a note.</p>
</dd>

View File

@ -4489,7 +4489,7 @@ for doc in results:
<variablelist>
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">>
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH">>
<term>ipath</term>
<listitem><para>This data value (set as a field in the Doc

View File

@ -956,7 +956,7 @@ achieved with this method.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_the_next_step_multi_stage_parallelism">The next step: multi-stage parallelism</h2>
<h2 id="recoll.idxthreads.multistage">The next step: multi-stage parallelism</h2>
<div class="sectionbody">
<div class="imageblock" style="float:right;">
<div class="content">
@ -1283,7 +1283,8 @@ the executing of ephemeral external commands.</p></div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2016-05-08 08:30:29 CEST
Last updated
2016-08-07 15:42:01 CEST
</div>
</div>
</body>

View File

@ -206,6 +206,7 @@ when working on HTML or plain text.
In practice, very modest indexing time improvements from 5% to 15% were
achieved with this method.
[[recoll.idxthreads.multistage]]
== The next step: multi-stage parallelism
image::multipara.png["Multi-stage parallelism", float="right"]

View File

@ -73,6 +73,12 @@ improving the Windows version, the link:recoll-mingw.html[build instructions].
== Known problems:
- Indexing is very slow, especially when using external commands (e.g. for
PDF files). I don't know if this is a case of my doing something stupid,
or if the general architecture is really bad fitted for windows. If
someone with good Windows programming knowledge reads this, I'd be very
interested by a discussion.
- Filtering by directory location ('dir:' clauses) is currently
case-sensitive, including drive letters. This will hopefully be fixed in
a future version.

View File

@ -2,8 +2,7 @@
<html>
<head>
<title>RECOLL: a personal text search system for
Unix/Linux</title>
<title>RECOLL indexing performance and index sizes</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
@ -33,20 +32,323 @@
<h1>Recoll: Indexing performance and index sizes</h1>
<p>The time needed to index a given set of documents, and the
resulting index size depend of many factors, such as file size
and proportion of actual text content for the index size, cpu
speed, available memory, average file size and format for the
speed of indexing.</p>
resulting index size depend of many factors.
<p>We try here to give a number of reference points which can
be used to roughly estimate the resources needed to create and
store an index. Obviously, your data set will never fit one of
the samples, so the results cannot be exactly predicted.</p>
<p>The index size depends almost only on the size of the
uncompressed input text, and you can expect it to be roughly
of the same order of magnitude. Depending on the type of file,
the proportion of text to file size varies very widely, going
from close to 1 for pure text files to a very small factor
for, e.g., metadata tags in mp3 files.</p>
<p>The following very old data was obtained on a machine with a
1800 Mhz
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
disk, running Suse 10.1. More recent data follows.</p>
<p>Estimating indexing time is a much more complicated issue,
depending on the type and size of input and on system
performance. There is no general way to determine what part of
the hardware should be optimized. Depending on the type of
input, performance may be bound by I/O read or write
performance, CPU single-processing speed, or combined
multi-processing speed.</p>
<p>It should be noted that Recoll performance will not be an
issue for most people. The indexer can process 1000 typical
PDF files per minute, or 500 Wikipedia HTML pages per second
on medium-range hardware, meaning that the initial indexing of
a typical dataset will need a few dozen minutes at
most. Further incremental index updates will be much faster
because most files will not need to be processed again.</p>
<p>However, there are Recoll installations with
terabyte-sized datasets, on which indexing can take days. For
such operations (or even much smaller ones), it is very
important to know what kind of performance can be expected,
and what aspects of the hardware should be optimized.</p>
<p>In order to provide some reference points, I have run a
number of benchs on medium-sized datasets, using typical
mid-range desktop hardware, and varying the indexing
configuration parameters to show how they affect the results.</p>
<p>The following may help you check that you are getting typical
performance for your indexing, and give some indications about
what to adjust to improve it.</p>
<p>From time to time, I receive a report about a system becoming
unusable during indexing. As far as I know, with the default
Recoll configuration, and barring an exceptional issue (bug),
this is always due to a system problem (typically bad hardware
such as a disk doing retries). The tests below were mostly run
while I was using the desktop, which never became
unusable. However, some tests rendered it less responsive and
this is noted with the results.</p>
<p>The following text refers to the indexing parameters without
further explanation. Here follow links to more explanation about the
<a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
model</a> and
<a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
parameters</a>.</p>
<p>All text were run without generating the stemming database or
aspell dictionary. These phases are relatively short and there
is nothing which can be optimized about them.</p>
<h2>Hardware</h2>
<p>The tests were run on what could be considered a mid-range
desktop PC:
<ul>
<li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
hyper-threading for a total of 8 hardware threads</li>
<li>8 GBytes of RAM</li>
<li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
</ul>
</p>
<p>This is usually a fanless PC, but I did run a fan on the
external case fins during some of the tests (esp. PDF
indexing), because the CPU was running a bit too hot.</p>
<h2>Indexing PDF files</h2>
<p>The tests were run on 18000 random PDFs harvested on
Google, with a total size of around 30 GB, using Recoll 1.22.3
and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>
<h3>PDF: storage</h3>
<p>Typical PDF files have a low text to file size ratio, and a
lot of data needs to be read for indexing. With the test
configuration, the indexer needs to read around 45 MBytes / S
from multiple files. This means that input storage makes a
difference and that you need an SSD or a fast array for
optimal performance.</p>
<table border=1>
<thead>
<tr>
<th>Storage</th>
<th>idxflushmb</th>
<th>thrTCounts</th>
<th>Real Time</th>
</tr>
<tbody>
<tr>
<td>NFS drive (gigabit)</td>
<td>200</td>
<td>6/4/1</td>
<td>24m40</td>
</tr>
<tr>
<td>local SSD</td>
<td>200</td>
<td>6/4/1</td>
<td>11m40</td>
</tr>
</tbody>
</table>
<h3>PDF: threading</h3>
<p>Because PDF files are bulky and complicated to process, the
dominant step for indexing them is input processing. PDF text
extraction is performed by multiple instances
the <i>pdftotext</i> program, and parallelisation works very
well.</p>
<p>The following table shows the indexing times with a variety
of threading parameters.</p>
<table border=1>
<thead>
<tr>
<th>idxflushmb</th>
<th>thrQSizes</th>
<th>thrTCounts</th>
<th>Time R/U/S</th>
</tr>
<tbody>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>2/1/1</td>
<td>19m21</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>10/10/1</td>
<td>10m38</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>100/10/1</td>
<td>11m</td>
</tr>
</tbody>
</table>
<p>10/10/1 was the best value for thrTCounts for this test. The
total CPU time was around 78 mn.</p>
<p>The last line shows the effect of a ridiculously high thread
count value for the input step, which is not much. Using
sligthly lower values than the optimum has not much impact
either. The only thing which really degrades performance is
configuring less threads than available from the hardware.</p>
<p>With the optimal parameters above, the peak recollindex
resident memory size is around 930 MB, to which we should add
ten instances of pdftotext (10MB typical), and of the
rclpdf.py Python input handler (around 15 MB each). This means
that the total resident memory used by indexing is around 1200
MB, quite a modest value in 2016.</p>
<h3>PDF: Xapian flushes</h3>
<p>idxflushmb has practically no influence on the indexing time
(tested from 40 to 1000), which is not too surprising because
the Xapian index size is very small relatively to the input
size, so that the cost of Xapian flushes to disk is not very
significant. The value of 200 used for the threading tests
could be lowered in practise, which would decrease memory
usage and not change the indexing time significantly.</p>
<h3>PDF: conclusion</h3>
<p>For indexing PDF files, you need many cores and a fast
input storage system. Neither single-thread performance nor
amount of memory will be critical aspects.</p>
<p>Running the PDF indexing tests had no influence on the system
"feel", I could work on it just as if it were quiescent.</p>
<h2>Indexing HTML files</h2>
<p>The tests were run on an (old) French Wikipedia dump: 2.9
million HTML files stored in 42000 directories, for an
approximate total size of 41 GB (average file size
14 KB).
<p>The files are stored on a local SSD. Just reading them with
find+cpio takes close to 8 mn.</p>
<p>The resulting index has a size of around 30 GB.</p>
<p>I was too lazy to extract 3 million entries tar file on a
spinning disk, so all tests were performed with the data
stored on a local SSD.</p>
<p>For this test, the indexing time is dominated by the Xapian
index updates. As these are single threaded, only the flush
interval has a real influence.</p>
<table border=1>
<thead>
<tr>
<th>idxflushmb</th>
<th>thrQSizes</th>
<th>thrTCounts</th>
<th>Time R/U/S</th>
</tr>
<tbody>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>2/1/1</td>
<td>88m</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>6/4/1</td>
<td>91m</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>1/1/1</td>
<td>96m</td>
</tr>
<tr>
<td>100</td>
<td>2/2/2</td>
<td>1/2/1</td>
<td>120m</td>
</tr>
<tr>
<td>100</td>
<td>2/2/2</td>
<td>6/4/1</td>
<td>121m</td>
</tr>
<tr>
<td>40</td>
<td>2/2/2</td>
<td>1/2/1</td>
<td>173m</td>
</tr>
</tbody>
</table>
<p>The indexing process becomes quite big (resident size around
4GB), and the combination of high I/O load and high memory
usage makes the system less responsive at times (but not
unusable). As this happens principally when switching
applications, my guess would be that some program pages
(e.g. from the window manager and X) get flushed out, and take
time being read in, during which time the display appears
frozen.</p>
<p>For this kind of data, single-threaded CPU performance and
storage write speed can make a difference. Multithreading does
not help.</p>
<h2>Adjusting hardware to improve indexing performance</h2>
<p>I think that the following multi-step approach has a good
chance to improve performance:
<ul>
<li>Check that multithreading is enabled (it is, by default
with recent Recoll versions).</li>
<li>Increase the flush threshold until the machine begins to
have memory issues. Maybe add memory.</li>
<li>Store the index on an SSD. If possible, also store the
data on an SSD. Actually, when using many threads, it is
probably almost more important to have the data on an
SSD.</li>
<li>If you have many files which will need temporary copies
(email attachments, archive members, compressed files): use
a memory temporary directory. Add memory.</li>
<li>More CPUs...</li>
</ul>
</p>
<p>At some point, the index updating and writing may become the
bottleneck (this depends on the data mix, very quickly with
HTML or text files). As far as I can think, the only possible
approach is then to partition the index. You can query the
multiple Xapian indices either by using the Recoll external
index capability, or by actually merging the results with
xapian-compact.</p>
<h5>Old benchmarks</h5>
<p>To provide a point of comparison for the evolution of
hardware and software...</p>
<p>The following very old data was obtained (around 2007?) on a
machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
executed with the default flush threshold value.
@ -108,73 +410,6 @@
the exact reason is not known to me, possibly because of
additional fragmentation </p>
<p>There is more recent performance data (2012) at the end of
the <a href="idxthreads/threadingRecoll.html">article about
converting Recoll indexing to multithreading</a></p>
<p>Update, March 2016: I took another sample of PDF performance
data on a more modern machine, with Recoll multithreading turned
on. The machine has an Intel Core I7-4770T Cpu, which has 4
physical cores, and supports hyper-threading for a total of 8
threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
fanless, this is not a "beast" computer).</p>
<table border=1>
<thead>
<tr>
<th>Data</th>
<th>Data size</th>
<th>Indexing time</th>
<th>Index size</th>
<th>Peak process memory usage</th>
</tr>
<tbody>
<tr>
<td>Random pdfs harvested on Google<br>
Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
parameters 6/4/1</td>
<td>11 GB, 5320 files</td>
<td>3 mn 15 S</td>
<td>400 MB</td>
<td>545 MB</td>
</tr>
</tbody>
</table>
<p>The indexing process used 21 mn of CPU during these 3mn15 of
real time, we are not letting these cores stay idle
much... The improvement compared to the numbers above is quite
spectacular (a factor of 11, approximately), mostly due to the
multiprocessing, but also to the faster CPU and the SSD
storage. Note that the peak memory value is for the
recollindex process, and does not take into account the
multiple Python and pdftotext instances (which are relatively
small but things add up...).</p>
<h5>Improving indexing performance with hardware:</h5>
<p>I think
that the following multi-step approach has a good chance to
improve performance:
<ul>
<li>Check that multithreading is enabled (it is, by default
with recent Recoll versions).</li>
<li>Increase the flush threshold until the machine begins to
have memory issues. Maybe add memory.</li>
<li>Store the index on an SSD. If possible, also store the
data on an SSD. Actually, when using many threads, it is
probably almost more important to have the data on an
SSD.</li>
<li>If you have many files which will need temporary copies
(email attachments, archive members, compressed files): use
a memory temporary directory. Add memory.</li>
<li>More CPUs...</li>
</ul>
</p>
<p>At some point, the index writing may become the
bottleneck. As far as I can think, the only possible approach
then is to partition the index.</p>
</div>
</body>
</html>