diff --git a/src/doc/user/recoll.conf.xml b/src/doc/user/recoll.conf.xml index a522f5ff..81bd8b02 100644 --- a/src/doc/user/recoll.conf.xml +++ b/src/doc/user/recoll.conf.xml @@ -18,10 +18,10 @@ names. The list in the default configuration does not exclude hidden directories (names beginning with a dot), which means that it may index quite a few things that you do not want. On the other hand, email user agents like Thunderbird usually store messages in hidden directories, and -you probably want this indexed. One possible solution is to have '.*' in -'skippedNames', and add things like '~/.thunderbird' '~/.evolution' to -'topdirs'. Not even the file names are indexed for patterns in this -list, see the 'noContentSuffixes' variable for an alternative approach +you probably want this indexed. One possible solution is to have ".*" in +"skippedNames", and add things like "~/.thunderbird" "~/.evolution" to +"topdirs". Not even the file names are indexed for patterns in this +list, see the "noContentSuffixes" variable for an alternative approach which indexes the file names. Can be redefined for any subtree. @@ -366,10 +366,11 @@ which lets Xapian perform its own thing, meaning flushing every $XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory usage depends on average document size, not only document count, the Xapian approach is is not very useful, and you should let Recoll manage -the flushes. The default value of idxflushmb is 10 MB, and may be a bit -low. If you are looking for maximum speed, you may want to experiment -with values between 20 and -80. In my experience, values beyond 100 are always counterproductive. If +the flushes. The program compiled value is 0. The configured default +value (from this file) is 10 MB, and will be too low in many cases (it is +chosen to conserve memory). If you are looking +for maximum speed, you may want to experiment with values between 20 and +200. In my experience, values beyond this are always counterproductive. If you find otherwise, please drop me a note. filtermaxseconds diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index adcda502..0238165c 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -20,8 +20,8 @@ alink="#0000FF">
-

Recoll user manual

+

Recoll user manual

@@ -109,13 +109,13 @@ alink="#0000FF"> multiple indexes
2.1.3. Document types
+ "#idp41562832">Document types
2.1.4. Indexing failures
+ "#idp41582512">Indexing failures
2.1.5. Recovery
+ "#idp41589968">Recovery @@ -997,8 +997,8 @@ alink="#0000FF">
-

2.1.3. Document types

+

2.1.3. Document types

@@ -1091,8 +1091,8 @@ indexedmimetypes = application/pdf
-

2.1.4. Indexing +

2.1.4. Indexing failures

@@ -1132,8 +1132,8 @@ indexedmimetypes = application/pdf
-

2.1.5. Recovery

+

2.1.5. Recovery

@@ -6571,9 +6571,8 @@ for doc in results:
-
ipath
+
ipath

This data value (set as a field in the Doc @@ -8652,10 +8651,10 @@ thesame = "some string with spaces" email user agents like Thunderbird usually store messages in hidden directories, and you probably want this indexed. One possible solution is to have - '.*' in 'skippedNames', and add things like - '~/.thunderbird' '~/.evolution' to 'topdirs'. Not + ".*" in "skippedNames", and add things like + "~/.thunderbird" "~/.evolution" to "topdirs". Not even the file names are indexed for patterns in - this list, see the 'noContentSuffixes' variable for + this list, see the "noContentSuffixes" variable for an alternative approach which indexes the file names. Can be redefined for any subtree.

@@ -9306,11 +9305,13 @@ thesame = "some string with spaces" modified or deleted: as memory usage depends on average document size, not only document count, the Xapian approach is is not very useful, and you - should let Recoll manage the flushes. The default - value of idxflushmb is 10 MB, and may be a bit low. - If you are looking for maximum speed, you may want - to experiment with values between 20 and 80. In my - experience, values beyond 100 are always + should let Recoll manage the flushes. The program + compiled value is 0. The configured default value + (from this file) is 10 MB, and will be too low in + many cases (it is chosen to conserve memory). If + you are looking for maximum speed, you may want to + experiment with values between 20 and 200. In my + experience, values beyond this are always counterproductive. If you find otherwise, please drop me a note.

diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 4fb4ffd0..ac1a5fc0 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -4489,7 +4489,7 @@ for doc in results: - > + > ipath This data value (set as a field in the Doc diff --git a/website/idxthreads/threadingRecoll.html b/website/idxthreads/threadingRecoll.html index a89f6fa6..f6c6e45b 100644 --- a/website/idxthreads/threadingRecoll.html +++ b/website/idxthreads/threadingRecoll.html @@ -956,7 +956,7 @@ achieved with this method.

-

The next step: multi-stage parallelism

+

The next step: multi-stage parallelism

@@ -1283,7 +1283,8 @@ the executing of ephemeral external commands.


diff --git a/website/idxthreads/threadingRecoll.txt b/website/idxthreads/threadingRecoll.txt index cfeec2e6..94791345 100644 --- a/website/idxthreads/threadingRecoll.txt +++ b/website/idxthreads/threadingRecoll.txt @@ -206,6 +206,7 @@ when working on HTML or plain text. In practice, very modest indexing time improvements from 5% to 15% were achieved with this method. +[[recoll.idxthreads.multistage]] == The next step: multi-stage parallelism image::multipara.png["Multi-stage parallelism", float="right"] diff --git a/website/pages/recoll-windows.txt b/website/pages/recoll-windows.txt index 50c93fd2..ffea8c93 100644 --- a/website/pages/recoll-windows.txt +++ b/website/pages/recoll-windows.txt @@ -73,6 +73,12 @@ improving the Windows version, the link:recoll-mingw.html[build instructions]. == Known problems: +- Indexing is very slow, especially when using external commands (e.g. for + PDF files). I don't know if this is a case of my doing something stupid, + or if the general architecture is really bad fitted for windows. If + someone with good Windows programming knowledge reads this, I'd be very + interested by a discussion. + - Filtering by directory location ('dir:' clauses) is currently case-sensitive, including drive letters. This will hopefully be fixed in a future version. diff --git a/website/perfs.html b/website/perfs.html index b821cc96..64d2c768 100644 --- a/website/perfs.html +++ b/website/perfs.html @@ -2,8 +2,7 @@ - RECOLL: a personal text search system for - Unix/Linux + RECOLL indexing performance and index sizes Recoll: Indexing performance and index sizes

The time needed to index a given set of documents, and the - resulting index size depend of many factors, such as file size - and proportion of actual text content for the index size, cpu - speed, available memory, average file size and format for the - speed of indexing.

+ resulting index size depend of many factors. -

We try here to give a number of reference points which can - be used to roughly estimate the resources needed to create and - store an index. Obviously, your data set will never fit one of - the samples, so the results cannot be exactly predicted.

+

The index size depends almost only on the size of the + uncompressed input text, and you can expect it to be roughly + of the same order of magnitude. Depending on the type of file, + the proportion of text to file size varies very widely, going + from close to 1 for pure text files to a very small factor + for, e.g., metadata tags in mp3 files.

-

The following very old data was obtained on a machine with a - 1800 Mhz - AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE - disk, running Suse 10.1. More recent data follows.

+

Estimating indexing time is a much more complicated issue, + depending on the type and size of input and on system + performance. There is no general way to determine what part of + the hardware should be optimized. Depending on the type of + input, performance may be bound by I/O read or write + performance, CPU single-processing speed, or combined + multi-processing speed.

+ +

It should be noted that Recoll performance will not be an + issue for most people. The indexer can process 1000 typical + PDF files per minute, or 500 Wikipedia HTML pages per second + on medium-range hardware, meaning that the initial indexing of + a typical dataset will need a few dozen minutes at + most. Further incremental index updates will be much faster + because most files will not need to be processed again.

+ +

However, there are Recoll installations with + terabyte-sized datasets, on which indexing can take days. For + such operations (or even much smaller ones), it is very + important to know what kind of performance can be expected, + and what aspects of the hardware should be optimized.

+ +

In order to provide some reference points, I have run a + number of benchs on medium-sized datasets, using typical + mid-range desktop hardware, and varying the indexing + configuration parameters to show how they affect the results.

+ +

The following may help you check that you are getting typical + performance for your indexing, and give some indications about + what to adjust to improve it.

+ +

From time to time, I receive a report about a system becoming + unusable during indexing. As far as I know, with the default + Recoll configuration, and barring an exceptional issue (bug), + this is always due to a system problem (typically bad hardware + such as a disk doing retries). The tests below were mostly run + while I was using the desktop, which never became + unusable. However, some tests rendered it less responsive and + this is noted with the results.

+ +

The following text refers to the indexing parameters without + further explanation. Here follow links to more explanation about the + processing + model and + configuration + parameters.

+ + +

All text were run without generating the stemming database or + aspell dictionary. These phases are relatively short and there + is nothing which can be optimized about them.

+ +

Hardware

+ +

The tests were run on what could be considered a mid-range + desktop PC: +

    +
  • Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and + hyper-threading for a total of 8 hardware threads
  • +
  • 8 GBytes of RAM
  • +
  • Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage
  • +
+

+ +

This is usually a fanless PC, but I did run a fan on the + external case fins during some of the tests (esp. PDF + indexing), because the CPU was running a bit too hot.

+ + +

Indexing PDF files

+ + +

The tests were run on 18000 random PDFs harvested on + Google, with a total size of around 30 GB, using Recoll 1.22.3 + and Xapian 1.2.22. The resulting index size was 1.2 GB.

+ +

PDF: storage

+ +

Typical PDF files have a low text to file size ratio, and a + lot of data needs to be read for indexing. With the test + configuration, the indexer needs to read around 45 MBytes / S + from multiple files. This means that input storage makes a + difference and that you need an SSD or a fast array for + optimal performance.

+ + + + + + + + + + + + + + + + + + + + + + + +
StorageidxflushmbthrTCountsReal Time
NFS drive (gigabit)2006/4/124m40
local SSD2006/4/111m40
+ + +

PDF: threading

+ +

Because PDF files are bulky and complicated to process, the + dominant step for indexing them is input processing. PDF text + extraction is performed by multiple instances + the pdftotext program, and parallelisation works very + well.

+ +

The following table shows the indexing times with a variety + of threading parameters.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idxflushmbthrQSizesthrTCountsTime R/U/S
2002/2/22/1/119m21
2002/2/210/10/110m38
2002/2/2100/10/111m
+ +

10/10/1 was the best value for thrTCounts for this test. The + total CPU time was around 78 mn.

+ +

The last line shows the effect of a ridiculously high thread + count value for the input step, which is not much. Using + sligthly lower values than the optimum has not much impact + either. The only thing which really degrades performance is + configuring less threads than available from the hardware.

+ +

With the optimal parameters above, the peak recollindex + resident memory size is around 930 MB, to which we should add + ten instances of pdftotext (10MB typical), and of the + rclpdf.py Python input handler (around 15 MB each). This means + that the total resident memory used by indexing is around 1200 + MB, quite a modest value in 2016.

+ + +

PDF: Xapian flushes

+ +

idxflushmb has practically no influence on the indexing time + (tested from 40 to 1000), which is not too surprising because + the Xapian index size is very small relatively to the input + size, so that the cost of Xapian flushes to disk is not very + significant. The value of 200 used for the threading tests + could be lowered in practise, which would decrease memory + usage and not change the indexing time significantly.

+ +

PDF: conclusion

+ +

For indexing PDF files, you need many cores and a fast + input storage system. Neither single-thread performance nor + amount of memory will be critical aspects.

+ +

Running the PDF indexing tests had no influence on the system + "feel", I could work on it just as if it were quiescent.

+ + +

Indexing HTML files

+ +

The tests were run on an (old) French Wikipedia dump: 2.9 + million HTML files stored in 42000 directories, for an + approximate total size of 41 GB (average file size + 14 KB). + +

The files are stored on a local SSD. Just reading them with + find+cpio takes close to 8 mn.

+ +

The resulting index has a size of around 30 GB.

+ +

I was too lazy to extract 3 million entries tar file on a + spinning disk, so all tests were performed with the data + stored on a local SSD.

+ +

For this test, the indexing time is dominated by the Xapian + index updates. As these are single threaded, only the flush + interval has a real influence.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idxflushmbthrQSizesthrTCountsTime R/U/S
2002/2/22/1/188m
2002/2/26/4/191m
2002/2/21/1/196m
1002/2/21/2/1120m
1002/2/26/4/1121m
402/2/21/2/1173m
+ + +

The indexing process becomes quite big (resident size around + 4GB), and the combination of high I/O load and high memory + usage makes the system less responsive at times (but not + unusable). As this happens principally when switching + applications, my guess would be that some program pages + (e.g. from the window manager and X) get flushed out, and take + time being read in, during which time the display appears + frozen.

+ +

For this kind of data, single-threaded CPU performance and + storage write speed can make a difference. Multithreading does + not help.

+ +

Adjusting hardware to improve indexing performance

+ +

I think that the following multi-step approach has a good + chance to improve performance: +

    +
  • Check that multithreading is enabled (it is, by default + with recent Recoll versions).
  • +
  • Increase the flush threshold until the machine begins to + have memory issues. Maybe add memory.
  • +
  • Store the index on an SSD. If possible, also store the + data on an SSD. Actually, when using many threads, it is + probably almost more important to have the data on an + SSD.
  • +
  • If you have many files which will need temporary copies + (email attachments, archive members, compressed files): use + a memory temporary directory. Add memory.
  • +
  • More CPUs...
  • +
+

+ +

At some point, the index updating and writing may become the + bottleneck (this depends on the data mix, very quickly with + HTML or text files). As far as I can think, the only possible + approach is then to partition the index. You can query the + multiple Xapian indices either by using the Recoll external + index capability, or by actually merging the results with + xapian-compact.

+ + + +
Old benchmarks
+ +

To provide a point of comparison for the evolution of + hardware and software...

+ +

The following very old data was obtained (around 2007?) on a + machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a + 7200 RPM 160 GBytes IDE disk, running Suse 10.1.

recollindex (version 1.8.2 with xapian 1.0.0) is executed with the default flush threshold value. @@ -108,73 +410,6 @@ the exact reason is not known to me, possibly because of additional fragmentation

-

There is more recent performance data (2012) at the end of - the article about - converting Recoll indexing to multithreading

- -

Update, March 2016: I took another sample of PDF performance - data on a more modern machine, with Recoll multithreading turned - on. The machine has an Intel Core I7-4770T Cpu, which has 4 - physical cores, and supports hyper-threading for a total of 8 - threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is - fanless, this is not a "beast" computer).

- - - - - - - - - - - - - - - - - - - -
DataData sizeIndexing timeIndex sizePeak process memory usage
Random pdfs harvested on Google
- Recoll 1.21.5, idxflushmb set to 200, thread - parameters 6/4/1
11 GB, 5320 files3 mn 15 S400 MB545 MB
- -

The indexing process used 21 mn of CPU during these 3mn15 of - real time, we are not letting these cores stay idle - much... The improvement compared to the numbers above is quite - spectacular (a factor of 11, approximately), mostly due to the - multiprocessing, but also to the faster CPU and the SSD - storage. Note that the peak memory value is for the - recollindex process, and does not take into account the - multiple Python and pdftotext instances (which are relatively - small but things add up...).

- -
Improving indexing performance with hardware:
-

I think - that the following multi-step approach has a good chance to - improve performance: -

    -
  • Check that multithreading is enabled (it is, by default - with recent Recoll versions).
  • -
  • Increase the flush threshold until the machine begins to - have memory issues. Maybe add memory.
  • -
  • Store the index on an SSD. If possible, also store the - data on an SSD. Actually, when using many threads, it is - probably almost more important to have the data on an - SSD.
  • -
  • If you have many files which will need temporary copies - (email attachments, archive members, compressed files): use - a memory temporary directory. Add memory.
  • -
  • More CPUs...
  • -
-

- -

At some point, the index writing may become the - bottleneck. As far as I can think, the only possible approach - then is to partition the index.

-