diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 26d99aa8..fc22e43e 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -110,6 +110,9 @@ alink="#0000FF">
2.2.2. Security aspects
+
2.2.3. Special considerations + for big indexes
2.3. "filename">$HOME/.recoll/xapiandb/. This can be changed via two different methods (with different purposes):

-
- +

The size of the index is determined by the size of the set of documents, but the ratio can vary a lot. For a @@ -1154,9 +1157,9 @@ alink="#0000FF"> non-indexed data (an extreme example being a set of mp3 files where only the tags would be indexed).

Of course, images, sound and video do not increase the - index size, which means that typically, even a big index - will be negligible against the total amount of data on the - computer.

+ index size, which means that in most cases, the space used + by the index will be negligible against the total amount of + data on the computer.

The index data directory (xapiandb) only contains data that can be completely rebuilt by an index run (as long as the original @@ -1186,8 +1189,10 @@ alink="#0000FF"> because its format is not supported any more, you will have to explicitly delete the old index (typically ~/.recoll/xapiandb), then - run a normal indexing command. Using option -z would not work in this situation.

+ run a normal indexing command. Using recollindex option + -z would not work in this + situation.

@@ -1217,6 +1222,59 @@ alink="#0000FF"> adjust the umask used during index updates.

+
+
+
+
+

2.2.3. Special + considerations for big indexes

+
+
+
+

This only needs concern you if your index is going to + be bigger than around 5 GBytes. Beyond 10 GBytes, it + becomes a serious issue. Most people have much smaller + indexes. For reference, 5 GBytes would be around 2000 + bibles, a lot of text. If you have a huge text dataset + (remember: images don't count, the text content of PDFs + is typically less than 5% of the file size), read on.

+

The amount of writing performed by Xapian during index + creation is not linear with the index size (it is + somewhere between linear and quadratic). For big indexes + this becomes a performance issue, and may even be an SSD + disk wear issue.

+

The problem can be mitigated by observing the + following rules:

+
+
    +
  • +

    Partition the data set and create several + indexes of reasonable size rather than a huge one. + These indexes can then be queried in parallel + (using the Recoll + external indexes facility), or merged using + xapian-compact.

    +
  • +
  • +

    Have a lot of RAM available and set the + idxflushmb + Recoll + configuration parameter as high as you can without + swapping (experimentation will be needed). 200 + would be a minimum in this context.

    +
  • +
  • +

    Use Xapian 1.4.10 or newer, as this version + brought a significant improvement in the amount of + writes.

    +
  • +
+
+
diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 91e48f00..759e40fe 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -590,7 +590,19 @@ configuration directory, typically $HOME/.recoll/xapiandb/. This can be changed via two different methods (with different purposes): - + + + For a given configuration directory, you can + specify a non-default storage location for the index by setting + the dbdir parameter in the configuration file + (see the configuration + section). This method would mainly be of use if you wanted + to keep the configuration directory in its default location, but + desired another location for the index, typically out of disk + occupation or performance concerns. + + You can specify a different configuration directory by setting the RECOLL_CONFDIR environment variable, or using the @@ -611,20 +623,9 @@ options allows you to tailor multiple configurations and indexes to handle whatever subset of the available data you wish to make searchable. - - For a given configuration directory, you can - specify a non-default storage location for the index by setting - the dbdir parameter in the configuration file - (see the configuration - section). This method would mainly be of use if you wanted - to keep the configuration directory in its default location, but - desired another location for the index, typically out of disk - occupation concerns. - - + The size of the index is determined by the size of the set @@ -638,8 +639,9 @@ indexed). Of course, images, sound and video do not increase the index - size, which means that typically, even a big index will be negligible - against the total amount of data on the computer. + size, which means that in most cases, the space used by the index + will be negligible against the total amount of data on the + computer. The index data directory (xapiandb) only contains data that can be completely rebuilt by an index run @@ -660,8 +662,8 @@ its format is not supported any more, you will have to explicitly delete the old index (typically ~/.recoll/xapiandb), then run a normal - indexing command. Using option would not work - in this situation. + indexing command. Using recollindex option + would not work in this situation. @@ -684,10 +686,44 @@ of protection you need for your index, set the directory and files access modes appropriately, and also maybe adjust the umask used during index updates. - + + Special considerations for big indexes + + This only needs concern you if your index is going to be + bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious + issue. Most people have much smaller indexes. For reference, 5 + GBytes would be around 2000 bibles, a lot of text. If you have a + huge text dataset (remember: images don't count, the text content + of PDFs is typically less than 5% of the file size), read on. + + The amount of writing performed by Xapian during index + creation is not linear with the index size (it is somewhere between + linear and quadratic). For big indexes this becomes a performance + issue, and may even be an SSD disk wear issue. + + The problem can be mitigated by observing the following + rules: + + Partition the data set and create several indexes + of reasonable size rather than a huge one. These indexes can then + be queried in parallel (using the &RCL; external indexes + facility), or merged using + xapian-compact. + Have a lot of RAM available and set the + idxflushmb &RCL; configuration parameter as + high as you can without swapping (experimentation will be + needed). 200 would be a minimum in this + context. + Use Xapian 1.4.10 or newer, as this version + brought a significant improvement in the amount of writes. + + + + +