recoll/website/perfs.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
  <head>
    <title>RECOLL indexing performance and index sizes</title>
    <meta name="generator" content="HTML Tidy, see www.w3.org">
    <meta name="Author" content="Jean-Francois Dockes">
    <meta name="Description" content=
    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
    <meta name="Keywords" content=
      "full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
    <meta http-equiv="Content-language" content="en">
    <meta http-equiv="content-type" content=
    "text/html; charset=iso-8859-1">
    <meta name="robots" content="All,Index,Follow">
    <link type="text/css" rel="stylesheet" href="styles/style.css">
  </head>

  <body>

    <div class="rightlinks">
      <ul>
	<li><a href="index.html">Home</a></li>
	<li><a href="pics/index.html">Screenshots</a></li>
	<li><a href="download.html">Downloads</a></li>
	<li><a href="doc.html">Documentation</a></li>
      </ul>
    </div>

    <div class="content">

      <h1>Recoll: Indexing performance and index sizes</h1>

      <p>The time needed to index a given set of documents, and the
	resulting index size depend of many factors.

      <p>The index size depends almost only on the size of the
        uncompressed input text, and you can expect it to be roughly
        of the same order of magnitude. Depending on the type of file,
        the proportion of text to file size varies very widely, going
        from close to 1 for pure text files to a very small factor
        for, e.g., metadata tags in mp3 files.</p>

      <p>Estimating indexing time is a much more complicated issue,
        depending on the type and size of input and on system
        performance. There is no general way to determine what part of
        the hardware should be optimized. Depending on the type of
        input, performance may be bound by I/O read or write
        performance, CPU single-processing speed, or combined
        multi-processing speed.</p>

      <p>It should be noted that Recoll performance will not be an
        issue for most people. The indexer can process 1000 typical
        PDF files per minute, or 500 Wikipedia HTML pages per second
        on medium-range hardware, meaning that the initial indexing of
        a typical dataset will need a few dozen minutes at
        most. Further incremental index updates will be much faster
        because most files will not need to be processed again.</p>

      <p>However, there are Recoll installations with
        terabyte-sized datasets, on which indexing can take days. For
        such operations (or even much smaller ones), it is very
        important to know what kind of performance can be expected,
        and what aspects of the hardware should be optimized.</p>

      <p>In order to provide some reference points, I have run a
        number of benchs on medium-sized datasets, using typical
        mid-range desktop hardware, and varying the indexing
        configuration parameters to show how they affect the results.</p>

      <p>The following may help you check that you are getting typical
        performance for your indexing, and give some indications about
        what to adjust to improve it.</p>

      <p>From time to time, I receive a report about a system becoming
        unusable during indexing. As far as I know, with the default
        Recoll configuration, and barring an exceptional issue (bug),
        this is always due to a system problem (typically bad hardware
        such as a disk doing retries). The tests below were mostly run
        while I was using the desktop, which never became
        unusable. However, some tests rendered it less responsive and
        this is noted with the results.</p>

      <p>The following text refers to the indexing parameters without
        further explanation. Here follow links to more explanation about the
        <a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
        model</a> and
        <a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
          parameters</a>.</p>


      <p>All text were run without generating the stemming database or
        aspell dictionary. These phases are relatively short and there
        is nothing which can be optimized about them.</p>

      <h2>Hardware</h2>

      <p>The tests were run on what could be considered a mid-range
        desktop PC:
        <ul>
          <li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
            hyper-threading for a total of 8 hardware threads</li>
          <li>8 GBytes of RAM</li>
          <li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
        </ul>
      </p>

      <p>This is usually a fanless PC, but I did run a fan on the
        external case fins during some of the tests (esp. PDF
        indexing), because the CPU was running a bit too hot.</p>


      <h2>Indexing PDF files</h2>


      <p>The tests were run on 18000 random PDFs harvested on
        Google, with a total size of around 30 GB, using Recoll 1.22.3
        and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>

      <h3>PDF: storage</h3>

      <p>Typical PDF files have a low text to file size ratio, and a
        lot of data needs to be read for indexing. With the test
        configuration, the indexer needs to read around 45 MBytes / S
        from multiple files. This means that input storage makes a
        difference and that you need an SSD or a fast array for
        optimal performance.</p>

      <table border=1>
	<thead>
	  <tr>
	    <th>Storage</th>
	    <th>idxflushmb</th>
	    <th>thrTCounts</th>
	    <th>Real Time</th>
	  </tr>
	<tbody>
	  <tr>
	    <td>NFS drive (gigabit)</td>
	    <td>200</td>
	    <td>6/4/1</td>
	    <td>24m40</td>
	  </tr>
	  <tr>
	    <td>local SSD</td>
	    <td>200</td>
	    <td>6/4/1</td>
	    <td>11m40</td>
	  </tr>
	</tbody>
      </table>


      <h3>PDF: threading</h3>

      <p>Because PDF files are bulky and complicated to process, the
        dominant step for indexing them is input processing. PDF text
        extraction is performed by multiple instances
        the <i>pdftotext</i> program, and parallelisation works very
        well.</p>

      <p>The following table shows the indexing times with a variety
        of threading parameters.</p>

      <table border=1>
	<thead>
	  <tr>
	    <th>idxflushmb</th>
	    <th>thrQSizes</th>
	    <th>thrTCounts</th>
	    <th>Time R/U/S</th>
	  </tr>
          <tbody>
	  <tr>
	    <td>200</td>
	    <td>2/2/2</td>
	    <td>2/1/1</td>
	    <td>19m21</td>
	  </tr>
	  <tr>
	    <td>200</td>
	    <td>2/2/2</td>
	    <td>10/10/1</td>
	    <td>10m38</td>
	  </tr>
	  <tr>
	    <td>200</td>
	    <td>2/2/2</td>
	    <td>100/10/1</td>
	    <td>11m</td>
	  </tr>
          </tbody>
      </table>

      <p>10/10/1 was the best value for thrTCounts for this test. The
        total CPU time was around 78 mn.</p>

      <p>The last line shows the effect of a ridiculously high thread
        count value for the input step, which is not much. Using
        sligthly lower values than the optimum has not much impact
        either. The only thing which really degrades performance is
        configuring less threads than available from the hardware.</p>

      <p>With the optimal parameters above, the peak recollindex
        resident memory size is around 930 MB, to which we should add
        ten instances of pdftotext (10MB typical), and of the
        rclpdf.py Python input handler (around 15 MB each). This means
        that the total resident memory used by indexing is around 1200
        MB, quite a modest value in 2016.</p>


      <h3>PDF: Xapian flushes</h3>

      <p>idxflushmb has practically no influence on the indexing time
        (tested from 40 to 1000), which is not too surprising because
        the Xapian index size is very small relatively to the input
        size, so that the cost of Xapian flushes to disk is not very
        significant. The value of 200 used for the threading tests
        could be lowered in practise, which would decrease memory
        usage and not change the indexing time significantly.</p>

      <h3>PDF: conclusion</h3>

      <p>For indexing PDF files, you need many cores and a fast
        input storage system. Neither single-thread performance nor
        amount of memory will be critical aspects.</p>

      <p>Running the PDF indexing tests had no influence on the system
        "feel", I could work on it just as if it were quiescent.</p>


      <h2>Indexing HTML files</h2>

      <p>The tests were run on an (old) French Wikipedia dump: 2.9
        million HTML files stored in 42000 directories, for an
        approximate total size of 41 GB (average file size
        14 KB).

        <p>The files are stored on a local SSD. Just reading them with
          find+cpio takes close to 8 mn.</p>

        <p>The resulting index has a size of around 30 GB.</p>

        <p>I was too lazy to extract 3 million entries tar file on a
          spinning disk, so all tests were performed with the data
          stored on a local SSD.</p>

        <p>For this test, the indexing time is dominated by the Xapian
          index updates. As these are single threaded, only the flush
          interval has a real influence.</p>

      <table border=1>
	<thead>
	  <tr>
	    <th>idxflushmb</th>
	    <th>thrQSizes</th>
	    <th>thrTCounts</th>
	    <th>Time R/U/S</th>
	  </tr>
          <tbody>
	  <tr>
	    <td>200</td>
	    <td>2/2/2</td>
	    <td>2/1/1</td>
	    <td>88m</td>
	  </tr>
	  <tr>
	    <td>200</td>
	    <td>2/2/2</td>
	    <td>6/4/1</td>
	    <td>91m</td>
	  </tr>
	  <tr>
	    <td>200</td>
	    <td>2/2/2</td>
	    <td>1/1/1</td>
	    <td>96m</td>
	  </tr>
	  <tr>
	    <td>100</td>
	    <td>2/2/2</td>
	    <td>1/2/1</td>
	    <td>120m</td>
	  </tr>
	  <tr>
	    <td>100</td>
	    <td>2/2/2</td>
	    <td>6/4/1</td>
	    <td>121m</td>
	  </tr>
	  <tr>
	    <td>40</td>
	    <td>2/2/2</td>
	    <td>1/2/1</td>
	    <td>173m</td>
	  </tr>
          </tbody>
      </table>


      <p>The indexing process becomes quite big (resident size around
        4GB), and the combination of high I/O load and high memory
        usage makes the system less responsive at times (but not
        unusable). As this happens principally when switching
        applications, my guess would be that some program pages
        (e.g. from the window manager and X) get flushed out, and take
        time being read in, during which time the display appears
        frozen.</p>

      <p>For this kind of data, single-threaded CPU performance and
        storage write speed can make a difference. Multithreading does
        not help.</p>

      <h2>Adjusting hardware to improve indexing performance</h2>

      <p>I think that the following multi-step approach has a good
        chance to improve performance:
        <ul>
          <li>Check that multithreading is enabled (it is, by default
            with recent Recoll versions).</li>
          <li>Increase the flush threshold until the machine begins to
            have memory issues. Maybe add memory.</li>
          <li>Store the index on an SSD. If possible, also store the
            data on an SSD. Actually, when using many threads, it is
            probably almost more important to have the data on an
            SSD.</li>
          <li>If you have many files which will need temporary copies
            (email attachments, archive members, compressed files): use
            a memory temporary directory. Add memory.</li>
          <li>More CPUs...</li>
        </ul>
      </p>

      <p>At some point, the index updating and writing may become the
        bottleneck (this depends on the data mix, very quickly with
        HTML or text files). As far as I can think, the only possible
        approach is then to partition the index. You can query the
        multiple Xapian indices either by using the Recoll external
        index capability, or by actually merging the results with
        xapian-compact.</p>


      <h5>Old benchmarks</h5>

      <p>To provide a point of comparison for the evolution of
        hardware and software...</p>

      <p>The following very old data was obtained (around 2007?) on a
        machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
        7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>

      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
	executed with the default flush threshold value.
	The process memory usage is the one given by <b>ps</b></p>

      <table border=1>
	<thead>
	  <tr>
	    <th>Data</th>
	    <th>Data size</th>
	    <th>Indexing time</th>
	    <th>Index size</th>
	    <th>Peak process memory usage</th>
	  </tr>
	<tbody>
	  <tr>
	    <td>Random pdfs harvested on Google</td>
	    <td>1.7 GB, 3564 files</td>
	    <td>27 mn</td>
	    <td>230 MB</td>
	    <td>225 MB</td>
	  </tr>
	  <tr>
	    <td>Ietf mailing list archive</td>
	    <td>211 MB, 44,000 messages</td>
	    <td>8 mn</td>
	    <td>350 MB</td>
	    <td>90 MB</td>
	  </tr>
	  <tr>
	    <td>Partial Wikipedia dump</td>
	    <td>15 GB, one million files</td>
	    <td>6H30</td>
	    <td>10 GB</td>
	    <td>324 MB</td>
	  </tr>
	  <tr>
	    <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
	    <td>Random pdfs harvested on Google<br>
	    Recoll 1.9, <em>idxflushmb</em> set to 10</td>
	    <td>1.7 GB, 3564 files</td>
	    <td>25 mn</td>
	    <td>262 MB</td>
	    <td>65 MB</td>
	  </tr>
	</tbody>
      </table>

      <p>Notice how the index size for the mail archive is bigger than
	the data size. Myriads of small pure text documents will do
	this. The factor of expansion would be even much worse with
	compressed folders of course (the test was on uncompressed
	data).</p>

      <p>The last test was performed with Recoll 1.9.0 which has an
	ajustable flush threshold (<em>idxflushmb</em> parameter), here
	set to 10 MB. Notice the much lower peak memory usage, with no
	performance degradation. The resulting index is bigger though,
	the exact reason is not known to me, possibly because of
	additional fragmentation </p>

    </div>
  </body>
</html>