doc and web perf notes
This commit is contained in:
parent
1fc5e9ccec
commit
8289584aa9
@ -18,10 +18,10 @@ names. The list in the default configuration does not exclude hidden
|
||||
directories (names beginning with a dot), which means that it may index
|
||||
quite a few things that you do not want. On the other hand, email user
|
||||
agents like Thunderbird usually store messages in hidden directories, and
|
||||
you probably want this indexed. One possible solution is to have '.*' in
|
||||
'skippedNames', and add things like '~/.thunderbird' '~/.evolution' to
|
||||
'topdirs'. Not even the file names are indexed for patterns in this
|
||||
list, see the 'noContentSuffixes' variable for an alternative approach
|
||||
you probably want this indexed. One possible solution is to have ".*" in
|
||||
"skippedNames", and add things like "~/.thunderbird" "~/.evolution" to
|
||||
"topdirs". Not even the file names are indexed for patterns in this
|
||||
list, see the "noContentSuffixes" variable for an alternative approach
|
||||
which indexes the file names. Can be redefined for any
|
||||
subtree.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES">
|
||||
@ -366,10 +366,11 @@ which lets Xapian perform its own thing, meaning flushing every
|
||||
$XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory
|
||||
usage depends on average document size, not only document count, the
|
||||
Xapian approach is is not very useful, and you should let Recoll manage
|
||||
the flushes. The default value of idxflushmb is 10 MB, and may be a bit
|
||||
low. If you are looking for maximum speed, you may want to experiment
|
||||
with values between 20 and
|
||||
80. In my experience, values beyond 100 are always counterproductive. If
|
||||
the flushes. The program compiled value is 0. The configured default
|
||||
value (from this file) is 10 MB, and will be too low in many cases (it is
|
||||
chosen to conserve memory). If you are looking
|
||||
for maximum speed, you may want to experiment with values between 20 and
|
||||
200. In my experience, values beyond this are always counterproductive. If
|
||||
you find otherwise, please drop me a note.</para></listitem></varlistentry>
|
||||
<varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.FILTERMAXSECONDS">
|
||||
<term><varname>filtermaxseconds</varname></term>
|
||||
|
||||
@ -20,8 +20,8 @@ alink="#0000FF">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h1 class="title"><a name="idp41214976" id=
|
||||
"idp41214976"></a>Recoll user manual</h1>
|
||||
<h1 class="title"><a name="idp9509520" id=
|
||||
"idp9509520"></a>Recoll user manual</h1>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
@ -109,13 +109,13 @@ alink="#0000FF">
|
||||
multiple indexes</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.1.3. <a href=
|
||||
"#idp46788704">Document types</a></span></dt>
|
||||
"#idp41562832">Document types</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.1.4. <a href=
|
||||
"#idp46808384">Indexing failures</a></span></dt>
|
||||
"#idp41582512">Indexing failures</a></span></dt>
|
||||
|
||||
<dt><span class="sect2">2.1.5. <a href=
|
||||
"#idp46815840">Recovery</a></span></dt>
|
||||
"#idp41589968">Recovery</a></span></dt>
|
||||
</dl>
|
||||
</dd>
|
||||
|
||||
@ -997,8 +997,8 @@ alink="#0000FF">
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="idp46788704" id=
|
||||
"idp46788704"></a>2.1.3. Document types</h3>
|
||||
<h3 class="title"><a name="idp41562832" id=
|
||||
"idp41562832"></a>2.1.3. Document types</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@ -1091,8 +1091,8 @@ indexedmimetypes = application/pdf
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="idp46808384" id=
|
||||
"idp46808384"></a>2.1.4. Indexing
|
||||
<h3 class="title"><a name="idp41582512" id=
|
||||
"idp41582512"></a>2.1.4. Indexing
|
||||
failures</h3>
|
||||
</div>
|
||||
</div>
|
||||
@ -1132,8 +1132,8 @@ indexedmimetypes = application/pdf
|
||||
<div class="titlepage">
|
||||
<div>
|
||||
<div>
|
||||
<h3 class="title"><a name="idp46815840" id=
|
||||
"idp46815840"></a>2.1.5. Recovery</h3>
|
||||
<h3 class="title"><a name="idp41589968" id=
|
||||
"idp41589968"></a>2.1.5. Recovery</h3>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
@ -6571,9 +6571,8 @@ for doc in results:
|
||||
|
||||
<div class="variablelist">
|
||||
<dl class="variablelist">
|
||||
<dt><a name="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI" id=
|
||||
"RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI"></a><span class=
|
||||
"term">ipath</span></dt>
|
||||
<dt><a name="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH"
|
||||
id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH"></a><span class="term">ipath</span></dt>
|
||||
|
||||
<dd>
|
||||
<p>This data value (set as a field in the Doc
|
||||
@ -8652,10 +8651,10 @@ thesame = "some string with spaces"
|
||||
email user agents like Thunderbird usually store
|
||||
messages in hidden directories, and you probably
|
||||
want this indexed. One possible solution is to have
|
||||
'.*' in 'skippedNames', and add things like
|
||||
'~/.thunderbird' '~/.evolution' to 'topdirs'. Not
|
||||
".*" in "skippedNames", and add things like
|
||||
"~/.thunderbird" "~/.evolution" to "topdirs". Not
|
||||
even the file names are indexed for patterns in
|
||||
this list, see the 'noContentSuffixes' variable for
|
||||
this list, see the "noContentSuffixes" variable for
|
||||
an alternative approach which indexes the file
|
||||
names. Can be redefined for any subtree.</p>
|
||||
</dd>
|
||||
@ -9306,11 +9305,13 @@ thesame = "some string with spaces"
|
||||
modified or deleted: as memory usage depends on
|
||||
average document size, not only document count, the
|
||||
Xapian approach is is not very useful, and you
|
||||
should let Recoll manage the flushes. The default
|
||||
value of idxflushmb is 10 MB, and may be a bit low.
|
||||
If you are looking for maximum speed, you may want
|
||||
to experiment with values between 20 and 80. In my
|
||||
experience, values beyond 100 are always
|
||||
should let Recoll manage the flushes. The program
|
||||
compiled value is 0. The configured default value
|
||||
(from this file) is 10 MB, and will be too low in
|
||||
many cases (it is chosen to conserve memory). If
|
||||
you are looking for maximum speed, you may want to
|
||||
experiment with values between 20 and 200. In my
|
||||
experience, values beyond this are always
|
||||
counterproductive. If you find otherwise, please
|
||||
drop me a note.</p>
|
||||
</dd>
|
||||
|
||||
@ -4489,7 +4489,7 @@ for doc in results:
|
||||
|
||||
<variablelist>
|
||||
|
||||
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">>
|
||||
<varlistentry id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH">>
|
||||
<term>ipath</term>
|
||||
|
||||
<listitem><para>This data value (set as a field in the Doc
|
||||
|
||||
@ -956,7 +956,7 @@ achieved with this method.</p></div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="sect1">
|
||||
<h2 id="_the_next_step_multi_stage_parallelism">The next step: multi-stage parallelism</h2>
|
||||
<h2 id="recoll.idxthreads.multistage">The next step: multi-stage parallelism</h2>
|
||||
<div class="sectionbody">
|
||||
<div class="imageblock" style="float:right;">
|
||||
<div class="content">
|
||||
@ -1283,7 +1283,8 @@ the executing of ephemeral external commands.</p></div>
|
||||
<div id="footnotes"><hr /></div>
|
||||
<div id="footer">
|
||||
<div id="footer-text">
|
||||
Last updated 2016-05-08 08:30:29 CEST
|
||||
Last updated
|
||||
2016-08-07 15:42:01 CEST
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
|
||||
@ -206,6 +206,7 @@ when working on HTML or plain text.
|
||||
In practice, very modest indexing time improvements from 5% to 15% were
|
||||
achieved with this method.
|
||||
|
||||
[[recoll.idxthreads.multistage]]
|
||||
== The next step: multi-stage parallelism
|
||||
|
||||
image::multipara.png["Multi-stage parallelism", float="right"]
|
||||
|
||||
@ -73,6 +73,12 @@ improving the Windows version, the link:recoll-mingw.html[build instructions].
|
||||
|
||||
== Known problems:
|
||||
|
||||
- Indexing is very slow, especially when using external commands (e.g. for
|
||||
PDF files). I don't know if this is a case of my doing something stupid,
|
||||
or if the general architecture is really bad fitted for windows. If
|
||||
someone with good Windows programming knowledge reads this, I'd be very
|
||||
interested by a discussion.
|
||||
|
||||
- Filtering by directory location ('dir:' clauses) is currently
|
||||
case-sensitive, including drive letters. This will hopefully be fixed in
|
||||
a future version.
|
||||
|
||||
@ -2,8 +2,7 @@
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>RECOLL: a personal text search system for
|
||||
Unix/Linux</title>
|
||||
<title>RECOLL indexing performance and index sizes</title>
|
||||
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
||||
<meta name="Author" content="Jean-Francois Dockes">
|
||||
<meta name="Description" content=
|
||||
@ -33,20 +32,323 @@
|
||||
<h1>Recoll: Indexing performance and index sizes</h1>
|
||||
|
||||
<p>The time needed to index a given set of documents, and the
|
||||
resulting index size depend of many factors, such as file size
|
||||
and proportion of actual text content for the index size, cpu
|
||||
speed, available memory, average file size and format for the
|
||||
speed of indexing.</p>
|
||||
resulting index size depend of many factors.
|
||||
|
||||
<p>We try here to give a number of reference points which can
|
||||
be used to roughly estimate the resources needed to create and
|
||||
store an index. Obviously, your data set will never fit one of
|
||||
the samples, so the results cannot be exactly predicted.</p>
|
||||
<p>The index size depends almost only on the size of the
|
||||
uncompressed input text, and you can expect it to be roughly
|
||||
of the same order of magnitude. Depending on the type of file,
|
||||
the proportion of text to file size varies very widely, going
|
||||
from close to 1 for pure text files to a very small factor
|
||||
for, e.g., metadata tags in mp3 files.</p>
|
||||
|
||||
<p>The following very old data was obtained on a machine with a
|
||||
1800 Mhz
|
||||
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
|
||||
disk, running Suse 10.1. More recent data follows.</p>
|
||||
<p>Estimating indexing time is a much more complicated issue,
|
||||
depending on the type and size of input and on system
|
||||
performance. There is no general way to determine what part of
|
||||
the hardware should be optimized. Depending on the type of
|
||||
input, performance may be bound by I/O read or write
|
||||
performance, CPU single-processing speed, or combined
|
||||
multi-processing speed.</p>
|
||||
|
||||
<p>It should be noted that Recoll performance will not be an
|
||||
issue for most people. The indexer can process 1000 typical
|
||||
PDF files per minute, or 500 Wikipedia HTML pages per second
|
||||
on medium-range hardware, meaning that the initial indexing of
|
||||
a typical dataset will need a few dozen minutes at
|
||||
most. Further incremental index updates will be much faster
|
||||
because most files will not need to be processed again.</p>
|
||||
|
||||
<p>However, there are Recoll installations with
|
||||
terabyte-sized datasets, on which indexing can take days. For
|
||||
such operations (or even much smaller ones), it is very
|
||||
important to know what kind of performance can be expected,
|
||||
and what aspects of the hardware should be optimized.</p>
|
||||
|
||||
<p>In order to provide some reference points, I have run a
|
||||
number of benchs on medium-sized datasets, using typical
|
||||
mid-range desktop hardware, and varying the indexing
|
||||
configuration parameters to show how they affect the results.</p>
|
||||
|
||||
<p>The following may help you check that you are getting typical
|
||||
performance for your indexing, and give some indications about
|
||||
what to adjust to improve it.</p>
|
||||
|
||||
<p>From time to time, I receive a report about a system becoming
|
||||
unusable during indexing. As far as I know, with the default
|
||||
Recoll configuration, and barring an exceptional issue (bug),
|
||||
this is always due to a system problem (typically bad hardware
|
||||
such as a disk doing retries). The tests below were mostly run
|
||||
while I was using the desktop, which never became
|
||||
unusable. However, some tests rendered it less responsive and
|
||||
this is noted with the results.</p>
|
||||
|
||||
<p>The following text refers to the indexing parameters without
|
||||
further explanation. Here follow links to more explanation about the
|
||||
<a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
|
||||
model</a> and
|
||||
<a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
|
||||
parameters</a>.</p>
|
||||
|
||||
|
||||
<p>All text were run without generating the stemming database or
|
||||
aspell dictionary. These phases are relatively short and there
|
||||
is nothing which can be optimized about them.</p>
|
||||
|
||||
<h2>Hardware</h2>
|
||||
|
||||
<p>The tests were run on what could be considered a mid-range
|
||||
desktop PC:
|
||||
<ul>
|
||||
<li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
|
||||
hyper-threading for a total of 8 hardware threads</li>
|
||||
<li>8 GBytes of RAM</li>
|
||||
<li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>This is usually a fanless PC, but I did run a fan on the
|
||||
external case fins during some of the tests (esp. PDF
|
||||
indexing), because the CPU was running a bit too hot.</p>
|
||||
|
||||
|
||||
<h2>Indexing PDF files</h2>
|
||||
|
||||
|
||||
<p>The tests were run on 18000 random PDFs harvested on
|
||||
Google, with a total size of around 30 GB, using Recoll 1.22.3
|
||||
and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>
|
||||
|
||||
<h3>PDF: storage</h3>
|
||||
|
||||
<p>Typical PDF files have a low text to file size ratio, and a
|
||||
lot of data needs to be read for indexing. With the test
|
||||
configuration, the indexer needs to read around 45 MBytes / S
|
||||
from multiple files. This means that input storage makes a
|
||||
difference and that you need an SSD or a fast array for
|
||||
optimal performance.</p>
|
||||
|
||||
<table border=1>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Storage</th>
|
||||
<th>idxflushmb</th>
|
||||
<th>thrTCounts</th>
|
||||
<th>Real Time</th>
|
||||
</tr>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>NFS drive (gigabit)</td>
|
||||
<td>200</td>
|
||||
<td>6/4/1</td>
|
||||
<td>24m40</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>local SSD</td>
|
||||
<td>200</td>
|
||||
<td>6/4/1</td>
|
||||
<td>11m40</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
<h3>PDF: threading</h3>
|
||||
|
||||
<p>Because PDF files are bulky and complicated to process, the
|
||||
dominant step for indexing them is input processing. PDF text
|
||||
extraction is performed by multiple instances
|
||||
the <i>pdftotext</i> program, and parallelisation works very
|
||||
well.</p>
|
||||
|
||||
<p>The following table shows the indexing times with a variety
|
||||
of threading parameters.</p>
|
||||
|
||||
<table border=1>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>idxflushmb</th>
|
||||
<th>thrQSizes</th>
|
||||
<th>thrTCounts</th>
|
||||
<th>Time R/U/S</th>
|
||||
</tr>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>200</td>
|
||||
<td>2/2/2</td>
|
||||
<td>2/1/1</td>
|
||||
<td>19m21</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>200</td>
|
||||
<td>2/2/2</td>
|
||||
<td>10/10/1</td>
|
||||
<td>10m38</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>200</td>
|
||||
<td>2/2/2</td>
|
||||
<td>100/10/1</td>
|
||||
<td>11m</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>10/10/1 was the best value for thrTCounts for this test. The
|
||||
total CPU time was around 78 mn.</p>
|
||||
|
||||
<p>The last line shows the effect of a ridiculously high thread
|
||||
count value for the input step, which is not much. Using
|
||||
sligthly lower values than the optimum has not much impact
|
||||
either. The only thing which really degrades performance is
|
||||
configuring less threads than available from the hardware.</p>
|
||||
|
||||
<p>With the optimal parameters above, the peak recollindex
|
||||
resident memory size is around 930 MB, to which we should add
|
||||
ten instances of pdftotext (10MB typical), and of the
|
||||
rclpdf.py Python input handler (around 15 MB each). This means
|
||||
that the total resident memory used by indexing is around 1200
|
||||
MB, quite a modest value in 2016.</p>
|
||||
|
||||
|
||||
<h3>PDF: Xapian flushes</h3>
|
||||
|
||||
<p>idxflushmb has practically no influence on the indexing time
|
||||
(tested from 40 to 1000), which is not too surprising because
|
||||
the Xapian index size is very small relatively to the input
|
||||
size, so that the cost of Xapian flushes to disk is not very
|
||||
significant. The value of 200 used for the threading tests
|
||||
could be lowered in practise, which would decrease memory
|
||||
usage and not change the indexing time significantly.</p>
|
||||
|
||||
<h3>PDF: conclusion</h3>
|
||||
|
||||
<p>For indexing PDF files, you need many cores and a fast
|
||||
input storage system. Neither single-thread performance nor
|
||||
amount of memory will be critical aspects.</p>
|
||||
|
||||
<p>Running the PDF indexing tests had no influence on the system
|
||||
"feel", I could work on it just as if it were quiescent.</p>
|
||||
|
||||
|
||||
<h2>Indexing HTML files</h2>
|
||||
|
||||
<p>The tests were run on an (old) French Wikipedia dump: 2.9
|
||||
million HTML files stored in 42000 directories, for an
|
||||
approximate total size of 41 GB (average file size
|
||||
14 KB).
|
||||
|
||||
<p>The files are stored on a local SSD. Just reading them with
|
||||
find+cpio takes close to 8 mn.</p>
|
||||
|
||||
<p>The resulting index has a size of around 30 GB.</p>
|
||||
|
||||
<p>I was too lazy to extract 3 million entries tar file on a
|
||||
spinning disk, so all tests were performed with the data
|
||||
stored on a local SSD.</p>
|
||||
|
||||
<p>For this test, the indexing time is dominated by the Xapian
|
||||
index updates. As these are single threaded, only the flush
|
||||
interval has a real influence.</p>
|
||||
|
||||
<table border=1>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>idxflushmb</th>
|
||||
<th>thrQSizes</th>
|
||||
<th>thrTCounts</th>
|
||||
<th>Time R/U/S</th>
|
||||
</tr>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>200</td>
|
||||
<td>2/2/2</td>
|
||||
<td>2/1/1</td>
|
||||
<td>88m</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>200</td>
|
||||
<td>2/2/2</td>
|
||||
<td>6/4/1</td>
|
||||
<td>91m</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>200</td>
|
||||
<td>2/2/2</td>
|
||||
<td>1/1/1</td>
|
||||
<td>96m</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>100</td>
|
||||
<td>2/2/2</td>
|
||||
<td>1/2/1</td>
|
||||
<td>120m</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>100</td>
|
||||
<td>2/2/2</td>
|
||||
<td>6/4/1</td>
|
||||
<td>121m</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>40</td>
|
||||
<td>2/2/2</td>
|
||||
<td>1/2/1</td>
|
||||
<td>173m</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
<p>The indexing process becomes quite big (resident size around
|
||||
4GB), and the combination of high I/O load and high memory
|
||||
usage makes the system less responsive at times (but not
|
||||
unusable). As this happens principally when switching
|
||||
applications, my guess would be that some program pages
|
||||
(e.g. from the window manager and X) get flushed out, and take
|
||||
time being read in, during which time the display appears
|
||||
frozen.</p>
|
||||
|
||||
<p>For this kind of data, single-threaded CPU performance and
|
||||
storage write speed can make a difference. Multithreading does
|
||||
not help.</p>
|
||||
|
||||
<h2>Adjusting hardware to improve indexing performance</h2>
|
||||
|
||||
<p>I think that the following multi-step approach has a good
|
||||
chance to improve performance:
|
||||
<ul>
|
||||
<li>Check that multithreading is enabled (it is, by default
|
||||
with recent Recoll versions).</li>
|
||||
<li>Increase the flush threshold until the machine begins to
|
||||
have memory issues. Maybe add memory.</li>
|
||||
<li>Store the index on an SSD. If possible, also store the
|
||||
data on an SSD. Actually, when using many threads, it is
|
||||
probably almost more important to have the data on an
|
||||
SSD.</li>
|
||||
<li>If you have many files which will need temporary copies
|
||||
(email attachments, archive members, compressed files): use
|
||||
a memory temporary directory. Add memory.</li>
|
||||
<li>More CPUs...</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>At some point, the index updating and writing may become the
|
||||
bottleneck (this depends on the data mix, very quickly with
|
||||
HTML or text files). As far as I can think, the only possible
|
||||
approach is then to partition the index. You can query the
|
||||
multiple Xapian indices either by using the Recoll external
|
||||
index capability, or by actually merging the results with
|
||||
xapian-compact.</p>
|
||||
|
||||
|
||||
|
||||
<h5>Old benchmarks</h5>
|
||||
|
||||
<p>To provide a point of comparison for the evolution of
|
||||
hardware and software...</p>
|
||||
|
||||
<p>The following very old data was obtained (around 2007?) on a
|
||||
machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
|
||||
7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>
|
||||
|
||||
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
||||
executed with the default flush threshold value.
|
||||
@ -108,73 +410,6 @@
|
||||
the exact reason is not known to me, possibly because of
|
||||
additional fragmentation </p>
|
||||
|
||||
<p>There is more recent performance data (2012) at the end of
|
||||
the <a href="idxthreads/threadingRecoll.html">article about
|
||||
converting Recoll indexing to multithreading</a></p>
|
||||
|
||||
<p>Update, March 2016: I took another sample of PDF performance
|
||||
data on a more modern machine, with Recoll multithreading turned
|
||||
on. The machine has an Intel Core I7-4770T Cpu, which has 4
|
||||
physical cores, and supports hyper-threading for a total of 8
|
||||
threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
|
||||
fanless, this is not a "beast" computer).</p>
|
||||
|
||||
<table border=1>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Data</th>
|
||||
<th>Data size</th>
|
||||
<th>Indexing time</th>
|
||||
<th>Index size</th>
|
||||
<th>Peak process memory usage</th>
|
||||
</tr>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Random pdfs harvested on Google<br>
|
||||
Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
|
||||
parameters 6/4/1</td>
|
||||
<td>11 GB, 5320 files</td>
|
||||
<td>3 mn 15 S</td>
|
||||
<td>400 MB</td>
|
||||
<td>545 MB</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>The indexing process used 21 mn of CPU during these 3mn15 of
|
||||
real time, we are not letting these cores stay idle
|
||||
much... The improvement compared to the numbers above is quite
|
||||
spectacular (a factor of 11, approximately), mostly due to the
|
||||
multiprocessing, but also to the faster CPU and the SSD
|
||||
storage. Note that the peak memory value is for the
|
||||
recollindex process, and does not take into account the
|
||||
multiple Python and pdftotext instances (which are relatively
|
||||
small but things add up...).</p>
|
||||
|
||||
<h5>Improving indexing performance with hardware:</h5>
|
||||
<p>I think
|
||||
that the following multi-step approach has a good chance to
|
||||
improve performance:
|
||||
<ul>
|
||||
<li>Check that multithreading is enabled (it is, by default
|
||||
with recent Recoll versions).</li>
|
||||
<li>Increase the flush threshold until the machine begins to
|
||||
have memory issues. Maybe add memory.</li>
|
||||
<li>Store the index on an SSD. If possible, also store the
|
||||
data on an SSD. Actually, when using many threads, it is
|
||||
probably almost more important to have the data on an
|
||||
SSD.</li>
|
||||
<li>If you have many files which will need temporary copies
|
||||
(email attachments, archive members, compressed files): use
|
||||
a memory temporary directory. Add memory.</li>
|
||||
<li>More CPUs...</li>
|
||||
</ul>
|
||||
</p>
|
||||
|
||||
<p>At some point, the index writing may become the
|
||||
bottleneck. As far as I can think, the only possible approach
|
||||
then is to partition the index.</p>
|
||||
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user