From 8289584aa9f04c34dbf1ffd7947a265928cf1083 Mon Sep 17 00:00:00 2001
From: Jean-Francois Dockes This data value (set as a field in the Doc
@@ -8652,10 +8651,10 @@ thesame = "some string with spaces"
email user agents like Thunderbird usually store
messages in hidden directories, and you probably
want this indexed. One possible solution is to have
- '.*' in 'skippedNames', and add things like
- '~/.thunderbird' '~/.evolution' to 'topdirs'. Not
+ ".*" in "skippedNames", and add things like
+ "~/.thunderbird" "~/.evolution" to "topdirs". Not
even the file names are indexed for patterns in
- this list, see the 'noContentSuffixes' variable for
+ this list, see the "noContentSuffixes" variable for
an alternative approach which indexes the file
names. Can be redefined for any subtree.
-
The next step: multi-stage parallelism
+The next step: multi-stage parallelism
diff --git a/website/idxthreads/threadingRecoll.txt b/website/idxthreads/threadingRecoll.txt index cfeec2e6..94791345 100644 --- a/website/idxthreads/threadingRecoll.txt +++ b/website/idxthreads/threadingRecoll.txt @@ -206,6 +206,7 @@ when working on HTML or plain text. In practice, very modest indexing time improvements from 5% to 15% were achieved with this method. +[[recoll.idxthreads.multistage]] == The next step: multi-stage parallelism image::multipara.png["Multi-stage parallelism", float="right"] diff --git a/website/pages/recoll-windows.txt b/website/pages/recoll-windows.txt index 50c93fd2..ffea8c93 100644 --- a/website/pages/recoll-windows.txt +++ b/website/pages/recoll-windows.txt @@ -73,6 +73,12 @@ improving the Windows version, the link:recoll-mingw.html[build instructions]. == Known problems: +- Indexing is very slow, especially when using external commands (e.g. for + PDF files). I don't know if this is a case of my doing something stupid, + or if the general architecture is really bad fitted for windows. If + someone with good Windows programming knowledge reads this, I'd be very + interested by a discussion. + - Filtering by directory location ('dir:' clauses) is currently case-sensitive, including drive letters. This will hopefully be fixed in a future version. diff --git a/website/perfs.html b/website/perfs.html index b821cc96..64d2c768 100644 --- a/website/perfs.html +++ b/website/perfs.html @@ -2,8 +2,7 @@
-
+
Recoll: Indexing performance and index sizes
The time needed to index a given set of documents, and the - resulting index size depend of many factors, such as file size - and proportion of actual text content for the index size, cpu - speed, available memory, average file size and format for the - speed of indexing.
+ resulting index size depend of many factors. -
We try here to give a number of reference points which can - be used to roughly estimate the resources needed to create and - store an index. Obviously, your data set will never fit one of - the samples, so the results cannot be exactly predicted.
+
The index size depends almost only on the size of the + uncompressed input text, and you can expect it to be roughly + of the same order of magnitude. Depending on the type of file, + the proportion of text to file size varies very widely, going + from close to 1 for pure text files to a very small factor + for, e.g., metadata tags in mp3 files.
-
The following very old data was obtained on a machine with a - 1800 Mhz - AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE - disk, running Suse 10.1. More recent data follows.
+
Estimating indexing time is a much more complicated issue, + depending on the type and size of input and on system + performance. There is no general way to determine what part of + the hardware should be optimized. Depending on the type of + input, performance may be bound by I/O read or write + performance, CPU single-processing speed, or combined + multi-processing speed.
+ +
It should be noted that Recoll performance will not be an + issue for most people. The indexer can process 1000 typical + PDF files per minute, or 500 Wikipedia HTML pages per second + on medium-range hardware, meaning that the initial indexing of + a typical dataset will need a few dozen minutes at + most. Further incremental index updates will be much faster + because most files will not need to be processed again.
+ +
However, there are Recoll installations with + terabyte-sized datasets, on which indexing can take days. For + such operations (or even much smaller ones), it is very + important to know what kind of performance can be expected, + and what aspects of the hardware should be optimized.
+ +
In order to provide some reference points, I have run a + number of benchs on medium-sized datasets, using typical + mid-range desktop hardware, and varying the indexing + configuration parameters to show how they affect the results.
+ +
The following may help you check that you are getting typical + performance for your indexing, and give some indications about + what to adjust to improve it.
+ +
From time to time, I receive a report about a system becoming + unusable during indexing. As far as I know, with the default + Recoll configuration, and barring an exceptional issue (bug), + this is always due to a system problem (typically bad hardware + such as a disk doing retries). The tests below were mostly run + while I was using the desktop, which never became + unusable. However, some tests rendered it less responsive and + this is noted with the results.
+ +
The following text refers to the indexing parameters without + further explanation. Here follow links to more explanation about the + processing + model and + configuration + parameters.
+ + +
All text were run without generating the stemming database or + aspell dictionary. These phases are relatively short and there + is nothing which can be optimized about them.
+ +
+ +
The tests were run on what could be considered a mid-range + desktop PC: +
+
+ +
This is usually a fanless PC, but I did run a fan on the + external case fins during some of the tests (esp. PDF + indexing), because the CPU was running a bit too hot.
+ + +
+ + +
The tests were run on 18000 random PDFs harvested on + Google, with a total size of around 30 GB, using Recoll 1.22.3 + and Xapian 1.2.22. The resulting index size was 1.2 GB.
+ +
+ +
Typical PDF files have a low text to file size ratio, and a + lot of data needs to be read for indexing. With the test + configuration, the indexer needs to read around 45 MBytes / S + from multiple files. This means that input storage makes a + difference and that you need an SSD or a fast array for + optimal performance.
+ +
| Storage | +idxflushmb | +thrTCounts | +Real Time | +
|---|---|---|---|
| NFS drive (gigabit) | +200 | +6/4/1 | +24m40 | +
| local SSD | +200 | +6/4/1 | +11m40 | +
+ + +
+ +
Because PDF files are bulky and complicated to process, the + dominant step for indexing them is input processing. PDF text + extraction is performed by multiple instances + the pdftotext program, and parallelisation works very + well.
+ +
The following table shows the indexing times with a variety + of threading parameters.
+ +
| idxflushmb | +thrQSizes | +thrTCounts | +Time R/U/S | +
|---|---|---|---|
| 200 | +2/2/2 | +2/1/1 | +19m21 | +
| 200 | +2/2/2 | +10/10/1 | +10m38 | +
| 200 | +2/2/2 | +100/10/1 | +11m | +
+ +
10/10/1 was the best value for thrTCounts for this test. The + total CPU time was around 78 mn.
+ +
The last line shows the effect of a ridiculously high thread + count value for the input step, which is not much. Using + sligthly lower values than the optimum has not much impact + either. The only thing which really degrades performance is + configuring less threads than available from the hardware.
+ +
With the optimal parameters above, the peak recollindex + resident memory size is around 930 MB, to which we should add + ten instances of pdftotext (10MB typical), and of the + rclpdf.py Python input handler (around 15 MB each). This means + that the total resident memory used by indexing is around 1200 + MB, quite a modest value in 2016.
+ + +
+ +
idxflushmb has practically no influence on the indexing time + (tested from 40 to 1000), which is not too surprising because + the Xapian index size is very small relatively to the input + size, so that the cost of Xapian flushes to disk is not very + significant. The value of 200 used for the threading tests + could be lowered in practise, which would decrease memory + usage and not change the indexing time significantly.
+ +
+ +
For indexing PDF files, you need many cores and a fast + input storage system. Neither single-thread performance nor + amount of memory will be critical aspects.
+ +
Running the PDF indexing tests had no influence on the system + "feel", I could work on it just as if it were quiescent.
+ + +
+ +
The tests were run on an (old) French Wikipedia dump: 2.9 + million HTML files stored in 42000 directories, for an + approximate total size of 41 GB (average file size + 14 KB). + +
The files are stored on a local SSD. Just reading them with + find+cpio takes close to 8 mn.
+ +
The resulting index has a size of around 30 GB.
+ +
I was too lazy to extract 3 million entries tar file on a + spinning disk, so all tests were performed with the data + stored on a local SSD.
+ +
For this test, the indexing time is dominated by the Xapian + index updates. As these are single threaded, only the flush + interval has a real influence.
+ +
| idxflushmb | +thrQSizes | +thrTCounts | +Time R/U/S | +
|---|---|---|---|
| 200 | +2/2/2 | +2/1/1 | +88m | +
| 200 | +2/2/2 | +6/4/1 | +91m | +
| 200 | +2/2/2 | +1/1/1 | +96m | +
| 100 | +2/2/2 | +1/2/1 | +120m | +
| 100 | +2/2/2 | +6/4/1 | +121m | +
| 40 | +2/2/2 | +1/2/1 | +173m | +
+ + +
The indexing process becomes quite big (resident size around + 4GB), and the combination of high I/O load and high memory + usage makes the system less responsive at times (but not + unusable). As this happens principally when switching + applications, my guess would be that some program pages + (e.g. from the window manager and X) get flushed out, and take + time being read in, during which time the display appears + frozen.
+ +
For this kind of data, single-threaded CPU performance and + storage write speed can make a difference. Multithreading does + not help.
+ +
+ +
I think that the following multi-step approach has a good + chance to improve performance: +
+
+ +
At some point, the index updating and writing may become the + bottleneck (this depends on the data mix, very quickly with + HTML or text files). As far as I can think, the only possible + approach is then to partition the index. You can query the + multiple Xapian indices either by using the Recoll external + index capability, or by actually merging the results with + xapian-compact.
+ + + +
+ +
To provide a point of comparison for the evolution of + hardware and software...
+ +
The following very old data was obtained (around 2007?) on a + machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a + 7200 RPM 160 GBytes IDE disk, running Suse 10.1.
recollindex (version 1.8.2 with xapian 1.0.0) is executed with the default flush threshold value. @@ -108,73 +410,6 @@ the exact reason is not known to me, possibly because of additional fragmentation
-
There is more recent performance data (2012) at the end of - the article about - converting Recoll indexing to multithreading
- -
Update, March 2016: I took another sample of PDF performance - data on a more modern machine, with Recoll multithreading turned - on. The machine has an Intel Core I7-4770T Cpu, which has 4 - physical cores, and supports hyper-threading for a total of 8 - threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is - fanless, this is not a "beast" computer).
- -
| Data | -Data size | -Indexing time | -Index size | -Peak process memory usage | -
|---|---|---|---|---|
| Random pdfs harvested on Google - Recoll 1.21.5, idxflushmb set to 200, thread - parameters 6/4/1 |
- 11 GB, 5320 files | -3 mn 15 S | -400 MB | -545 MB | -
- -
The indexing process used 21 mn of CPU during these 3mn15 of - real time, we are not letting these cores stay idle - much... The improvement compared to the numbers above is quite - spectacular (a factor of 11, approximately), mostly due to the - multiprocessing, but also to the faster CPU and the SSD - storage. Note that the peak memory value is for the - recollindex process, and does not take into account the - multiple Python and pdftotext instances (which are relatively - small but things add up...).
- -
-
I think - that the following multi-step approach has a good chance to - improve performance: -
-
- -
At some point, the index writing may become the - bottleneck. As far as I can think, the only possible approach - then is to partition the index.
-