doc

2015-06-15 18:13:00 +02:00 · 2015-06-15 18:13:00 +02:00 · bab589c7ee
commit bab589c7ee
parent f19f790b36
6 changed files with 268 additions and 131 deletions
--- a/website/doc.html
+++ b/website/doc.html
@ -44,6 +44,14 @@

      <p><br></p>

+      <h1>Howtos on the Recoll Wiki</h1>
+
+      <p>You will find a number of useful tips for common
+        issues and extensions on the 
+	  <a href="http://bitbucket.org/medoc/recoll/wiki/">
+	    Recoll Wiki</a> on 
+	  <a href="http://bitbucket.org/medoc/recoll">bitbucket.org</a>.</p>
+
      <h1>Other documentation</h1>

      <ul>
--- a/website/download.html
+++ b/website/download.html
@ -371,7 +371,6 @@ Bitbucket).</p>
 <p>A Danish translation by Morten Langlo:
 <a href="translations/recoll_da.ts">recoll_da.ts</a>
 <a href="translations/recoll_da.qm">recoll_da.qm</a><br/>
-This is in 1.20.6
 </p>

 <p>Note that, if you are running an older release, you may find updated
--- a/website/idxthreads/Makefile
+++ b/website/idxthreads/Makefile
@ -1,2 +1,7 @@
-threadingRecoll.html : threadingRecoll.txt nothreads.png
-	asciidoc threadingRecoll.txt
+.SUFFIXES: .txt .html
+
+.txt.html:
+	asciidoc $<
+
+all: threadingRecoll.html forkingRecoll.html xapDocCopyCrash.html
+
--- a/website/idxthreads/forkingRecoll.txt
+++ b/website/idxthreads/forkingRecoll.txt
@ -19,39 +19,40 @@ many texts which address the subject. While researching, though, I found
 out that not so many were accurate and that a lot of questions were left as
 an exercise to the reader.

-This document will list the references I found reliable and interesting and
-describe the solution chosen along the other possible approaches.
-
 == Issues with fork

 The traditional way for a Unix process to start another is the
-fork()/exec() system call pair. The initial fork() duplicates the address
-space and resources (open files etc.) of the first process, then duplicates
-the thread of execution, ending up with 2 mostly identical processes.
-exec() then replaces part of the newly executing process with an address space
-initialized from an executable file, inheriting some of the old assets
+fork()+/+exec()+ system call pair. 
+
+fork()+ duplicates the process address space and resources (open files
+etc.), then duplicates the thread of execution, ending up with 2 mostly
+identical processes.  
+
+exec()+ then replaces part of the newly executing process with an address
+space initialized from an executable file, inheriting some of the resources
 under various conditions.

-As processes became bigger the copying-before-discard operation wasted
+As processes became bigger the copy-before-discard operation wasted
 significant resources, and was optimized using two methods (at very
 different points in time):

- - The first approach was to supplement fork() with the vfork() call, which
+ - The first approach was to supplement +fork()+ with the +vfork()+ call, which
   is similar but does not duplicate the address space: the new process
   thread executes in the old address space. The old thread is blocked
-   until the new one calls exec() and frees up access to the memory
+   until the new one calls +exec()+ and frees up access to the memory
   space. Any modification performed by the child thread persists when
   the old one resumes.

- - The more modern approach, which cohexists with vfork(), was to replace
+ - The more modern approach, which cohexists with +vfork()+, was to replace
   the full duplication of the memory space with duplication of the page
   descriptors only. The pages in the new process are marked copy-on-write
   so that the new process has write access to its memory without
-   disturbing its parent. The problem with this approach is that the
-   operation can still be a significant resource consumer for big processes
-   mapping a lot of memory. Many processes can fall in this category not
-   because they have huge data segments, but just because they are linked
-   to many shared libraries.
+   disturbing its parent. This approach was supposed to make +vfork()+
+   obsolete, but the operation can still be a significant resource consumer
+   for big processes mapping a lot of memory, so that +vfork()+ is still
+   around. Programs can have big memory spaces not only because they have
+   huge data segments (rare), but just because they are linked to many
+   shared libraries (more common).

 NOTE: Orders of magnitude: a *recollindex* process will easily grow into a
 few hundred of megabytes of virtual space. It executes the small and
@ -60,7 +61,7 @@ indexing multiple such files, *recollindex* can spend '60% of its CPU time'
 doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux,
 where `fork()` uses copy-on-write).

-Apart from the performance cost, another issue with fork() is that a big
+Apart from the performance cost, another issue with +fork()+ is that a big
 process can fail executing a small command because of the temporary need to
 allocate twice its address space. This is a much discussed subject which we
 will leave aside because it generally does not concern *recollindex*, which
@ -68,16 +69,16 @@ in typical conditions uses a small portion of the machine virtual memory,
 so that a temporary doubling is not an issue.

 The Recoll indexer is multithreaded, which may introduce other issues. Here
-is what happens to threads during the fork()/exec() interval:
+is what happens to threads during the +fork()+/+exec()+ interval:

- - fork():
+ - +fork()+:
   * The parent process threads all go on their merry way.
   * The child process is created with only one thread active, duplicated
-     from the one which called fork()
- - vfork()
-   * The parent process thread calling vfork() is suspended, the others
+     from the one which called +fork()+
+ - +vfork()+
+   * The parent process thread calling +vfork()+ is suspended, the others
     are unaffected.
-   * The child is created with only one thread, as for fork(). 
+   * The child is created with only one thread, as for +fork()+. 
     This thread shares the memory space with the parent ones, without
     having any means to synchronize with them (pthread locks are not
     supposed to work across processes): caution needed !
@ -92,14 +93,14 @@ performed in the child (if no cleanup is performed, pipes may remain open
 at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange
 user Celada for explaining this to me.

-For multithreaded programs, both fork() and vfork() introduce possibilities
+For multithreaded programs, both +fork()+ and +vfork()+ introduce possibilities
 of deadlock, because the resources held by a non-forking thread in the
 parent process can't be released in the child because the thread is not
 duplicated. This used to happen from time to time in *recollindex* because
-of an error logging call performed if the exec() failed after the fork()
+of an error logging call performed if the +exec()+ failed after the +fork()+
 (e.g. command not found).

-With vfork() it is also possible to trigger a deadlock in the parent by
+With +vfork()+ it is also possible to trigger a deadlock in the parent by
 (inadvertently) modifying data in the child. This could happen just
 link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because
 of dynamic linker operation] (which, seriously, should be considered a
@ -110,7 +111,7 @@ In general, the state of program data in the child process is a semi-random
 snapshot of what it was in the parent, and the official word about what you
 can do is that you can only call
 link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library
-functions] between 'fork()' and 'exec()'. These are functions which are
+functions] between +fork()+ and +exec()+. These are functions which are
 safe to call from a signal handler because they are either reentrant or
 can't be interrupted by a signal. A notable missing entry in the list is
 `malloc()`.
@ -120,8 +121,8 @@ another program (but the devil is in the details as demonstrated by the
 logging call issue...).

 One of the approaches often proposed for working around this mine-field is
-to use an auxiliary, small, process to execute any command needed by the
-main one. The small process can just use fork() with no performance
+to use an auxiliary small process to execute any command needed by the main
+one. The small process can just use +fork()+/+exec()+ with no performance
 issues. This has the inconvenient of complicating communication a lot if
 data needs to be transferred one way or another.

@ -164,28 +165,54 @@ descriptors bigger than a specified value (closefrom() equivalent). This is
 available on Solaris and quite necessary in fact, because we have no way to
 be sure that all open descriptors have the CLOEXEC flag set.

-12500 small .doc files:
+So, no `posix_spawn()` for us (support was implemented inside
+*recollindex*, but the code is normally not used).

-fork:  real 0m46.025s user 0m26.574s sys 0m39.494s
-vfork: real 0m18.223s user 0m17.753s sys 0m1.736s
-spawn/fork: real 0m45.726s user 0m27.082s sys 0m40.575s
-spawn/vfork: real 0m18.915s user 0m18.681s sys 0m3.828s
+== The chosen solution

-No surprise here, given the implementation of posix_spawn(), it gets the
-same times as the fork/vfork options.
+The previous version of +recollindex+ used to use +vfork()+ if it was running
+a single thread, and +fork()+ if it ran multiple ones.

-It is difficult to ignore the 60% reduction in execution time offered by
-using 'vfork()'.
+After another careful look at the code, I could see few issues with
+using +vfork()+ in the multithreaded indexer, so this was committed. 

+The only change necessary was to get rid on an implementation of the
+lacking Linux +closefrom()+ call (used to close all open descriptors above a
+given value). The previous Recoll implementation listed the +/proc/self/fd+
+directory to look for open descriptors but this was unsafe because of of
+possible memory allocations in +opendir()+ etc.
+
+== Test results
+
+.Indexing 12500 small .doc files 
+[options="header"]
+|===============================
+|call  |real      |user       |sys
+|fork  |0m46.025s |0m26.574s |0m39.494s
+|vfork |0m18.223s |0m17.753s |0m1.736s
+|spawn/fork| 0m45.726s|0m27.082s| 0m40.575s
+|spawn/vfork|0m18.915s|0m18.681s|0m3.828s
+|recoll 1.18|1m47.589s|0m21.537s|0m29.458s
+|================================
+
+No surprise here, given the implementation of +posix_spawn()+, it gets the
+same times as the +fork()+/+vfork()+ options.
+
+The tests were performed on an Intel Core i5 750 (4 cores, 4 threads).
+
+The last line is just for the fun: *recollindex* 1.18 (single-threaded)
+needed almost 6 times as long to process the same files... 
+
+It would be painful to play it safe and discard the 60% reduction in
+execution time offered by using +vfork()+.
+
+To this day, no problems were discovered, but, still crossing fingers...
+
+////
 Objections to vfork: 
-  ld.so locks
  sigaction locks
-
 https://bugzilla.redhat.com/show_bug.cgi?id=193631
-
 Is Linux vfork thread-safe ? Quoting interesting comments from Solaris
-implementation: 
-No answer to the issues cited though.
-
+implementation: No answer to the issues cited though.
 https://sourceware.org/bugzilla/show_bug.cgi?id=378
-Use vfork() in posix_spawn()
+////
--- a/website/release-1.21.html
+++ b/website/release-1.21.html
@ -0,0 +1,88 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<html>
+  <head>
+    <title>Recoll 1.20 series release notes</title>
+    <meta name="Author" content="Jean-Francois Dockes">
+    <meta name="Description"
+          content="recoll is a simple full-text search system for unix and linux     based on the powerful and mature xapian engine">
+    <meta name="Keywords" content="full text search, desktop search, unix, linux">
+    <meta http-equiv="Content-language" content="en">
+    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
+    <meta name="robots" content="All,Index,Follow">
+    <link type="text/css" rel="stylesheet" href="styles/style.css">
+  </head>
+
+  <body>
+
+    <div class="rightlinks">
+      <ul>
+        <li><a href="index.html">Home</a></li>
+        <li><a href="download.html">Downloads</a></li>
+        <li><a href="doc.html">Documentation</a></li>
+      </ul>
+    </div>
+
+    <div class="content">
+      <h1>Release notes for Recoll 1.20.x</h1>
+
+      <h2>Caveats</h2>
+
+      <p><em>Installing over an older version</em>: 1.19 </p>
+
+      <p>1.20 and 1.21 indexes are fully compatible. Installing 1.21
+        over an 1.19 index is possible, but there have been small
+        changes in the way compound words (e.g. email addresses) are
+        indexed, so it will be best to reset the index. Still, in a
+        pinch, 1.21 search can mostly use an 1.19 index. </p>
+
+      <p>Always reset the index if you do not know by which version it
+        was created (you're not sure it's at least 1.18). The best method
+        is to quit all Recoll programs and delete the index directory 
+        (<span class="literal">
+          rm -rf ~/.recoll/xapiandb</span>), then start <code>recoll</code>
+        or <code>recollindex</code>. <br> 
+
+        <span class="literal">recollindex -z</span> will do the same
+        in most, but not all, cases. It's better to use
+        the <tt>rm</tt> method, which will also ensure that no debris
+        from older releases remain (e.g.: old stemming files which are
+        not used any more).</p>
+
+      <p>Case/diacritics sensitivity is off by default. It can be
+        turned on <em>only</em> by editing
+        recoll.conf (
+        <a href="usermanual/usermanual.html#RCL.INDEXING.CONFIG.SENS">
+          see the manual</a>). If you do so, you must then reset the
+        index.</p> 
+
+
+      <h2>Changes in Recoll 1.21</h2>
+
+      <ul>
+        <li>Allow saving queries to files and reloading them
+          later. Available both for simple and advanced queries, and
+          based on XML files.</li>
+	<li>A Bison-based query parser replaces the old regexp-based
+	  one and allows parenthized sub-expressions and easier future
+	  expansions.</li>  
+        <li>The GUI gets a "close to system tray" function.</li>
+        <li>Avoid retrying to index previously indexed files if
+          nothing seems to have changed in the filters.</li> 
+        <li>Improve indexing speed by always using vfork() for
+          spawning external commands.</li> 
+        <li>The pdf filter gains the capability to run OCR (tesseract) on
+          image-only files.</li> 
+        <li>Improved check about when we should try to uncompress
+          stuff. Will eliminate some of the most dreadful case of
+          recollindex having an impact on system performance.</li>
+        <li>Warn if non-existent paths are listed in the configuration
+          file (help with typos).</li>
+        <li>Adjust background color for webkit-based elements (result
+          list and snippets window) according to desktop setup.</li>
+        <li>Listing the results with the KIO slave is now
+          performed with incremental updates. Bumped max entries to
+          10000.</li>
+      </ul>
+    </div>
+  </body>
+</html>
--- a/website/release-history.html
+++ b/website/release-history.html
@ -25,89 +25,99 @@
    <div class="content">
      <h1>Major Recoll releases at a glance</h1>

-	  <p>A summary of the major releases and the main features which
-		came with them.</p>
+      <p>A summary of the major releases and the main features which
+	came with them.</p>
+
+      <dl>
+	<dt><a href="release-1.21.html">Release 1.21</a> (future): new
+	  query parser</dt> 
+	<dd>
+	  <ul>
+	    <li>A Bison-based query parser replaces the old
+	      regexp-based one and allows parenthized
+	      sub-expressions and easier future
+	      expansions.</li> 
+            <li>Avoid retrying to index previously
+              indexed files if nothing seems to have
+              changed in the filters.</li>
+            <li>Allow saving queries to files and reload them
+              later. Available both for simple and advanced queries, and
+              based on XML files.</li>
+            <li>Improve indexing speed by always using
+              vfork() for spawning external commands.</li>
+            <li>GUI gets "close to system tray" function.</li>
+	  </ul>
+	</dd>
+
+	<dt><a href="release-1.20.html">Release 1.20</a>: small
+	  improvements</dt> 
+	<dd>
+	  <ul>
+	    <li><i>Open With</i> results list popup menu entry.</li>
+	    <li><i>fieldname:term1,term2</i>
+	      and <i>fieldname:term1/term2</i> shortcuts for AND/OR
+	      searches inside fields.</li>
+	    <li><i>Query fragments</i> tool.</li>
+	    <li>Better handling of compound terms like mail
+	      addresses.</li>
+	    <li>Selection on source collection type (Web history / File
+	      system).</li>
+	    <li>Configurable GUI geometry.</li>  
+	    <li>Different handling
+	      of container file / subdocuments file name searches.</li>
+	    <li>Simultaneous -e -i options to recollindex.</li>
+	  </ul>
+	</dd>
+
+        <dt><a href="release-1.19.html">Release 1.19</a>: multithreads
+          indexing</dt> 
+        <dd>
+          <ul>
+            <li>Better indexing performance through
+              multithreading.</li>
+            <li>Display list of subdocuments (e.g. attachments) for a
+              given result.</li>
+            <li>Collapsed duplicate results display link.</li>
+            <li>Path translation facility (for portable indexes).</li>
+            <li>Caches last uncompressed file (e.g. for fast
+              compressed mbox access).</li>
+            <li>Partial recursive reindex option to command line
+              indexer.</li>
+            <li>Can import tags from external application.</li>
+            <li>Extended attributes indexing is on by default.</li>
+            <li>New Python interface for data access. API re-modeled against
+              newer Python Database API 2.0.</li>
+            <li>Shared librecoll.so.</li>
+          </ul>
+        </dd>
+
+        <dt><a href="release-1.18.html">Release 1.18</a>: case and
+          diacritics switchable sensitivity</dt> 
+        <dd>
+          <ul>
+            <li>Index configuration for case and diacritics sensitivity.</li> 
+            <li>Advanced search history.</li>
+            <li>Page-level access when opening PDFs, and snippets
+              window.</li>
+            <li>Use Xapian Synonyms tables for query expansion.</li>
+          </ul>
+        </dd>
+
+        <dt><a href="release-1.17.html">Release 1.17</a>: small
+          improvements</dt> 
+        <dd>
+          <ul>
+            <li>Language-dependant unaccenting.</li>
+            <li>GUI dialogs for indexing schedule setup.</li>
+            <li>Phrase-based <i>dir:</i> filtering, accepting path
+              fragments. Size filtering.</li> 
+            <li>Python module default install and Unity Lens.</li>
+            <li>Result list switched to WebKit: drops qt3 support.</li>
+            <li>Indexing always performed by separate process.</li>
+            <li>Dynamic category filters (defined as language fragments).</li>
+          </ul>
+        </dd>

-	  <dl>
-		<dt><a href="release-1.21.html">Release 1.21</a> (future): new
-		  query parser</dt> 
-		<dd>
-		  <ul>
-			<li>Bison-based query parser replaces old regexp-based
-			  one and allows parenthized sub-expressions and easier
-			  future expansions.</li>
-                        <li>Avoid retrying to index previously
-                          indexed files if nothing seems to have
-                          changed in the filters.</li>
-		  </ul>
-		</dd>
-		<dt><a href="release-1.20.html">Release 1.20</a>: small
-		  improvements</dt> 
-		<dd>
-		  <ul>
-		  <li><i>Open With</i> results list popup menu entry.</li>
-		  <li><i>fieldname:term1,term2</i>
-			and <i>fieldname:term1/term2</i> shortcuts for AND/OR
-			searches inside fields.</li>
-		  <li><i>Query fragments</i> tool.</li>
-		  <li>Better handling of compound terms like mail
-		  addresses.</li>
-		  <li>Selection on source collection type (Web history / File
-			system).</li>
-		  <li>Configurable GUI geometry.</li>  
-		  <li>Different handling
-			of container file / subdocuments file name searches.</li>
-		  <li>Simultaneous -e -i options to recollindex.</li>
-		  </ul>
-		</dd>
-		</dd>
-		<dt><a href="release-1.19.html">Release 1.19</a>: multithreads
-		  indexing</dt> 
-		<dd>
-		  <ul>
-			<li>Better indexing performance through
-			  multithreading.</li>
-			<li>Display list of subdocuments (e.g. attachments) for a
-			  given result.</li>
-			<li>Collapsed duplicate results display link.</li>
-			<li>Path translation facility (for portable indexes).</li>
-			<li>Caches last uncompressed file (e.g. for fast
-			  compressed mbox access).</li>
-			<li>Partial recursive reindex option to command line
-			indexer.</li>
-			<li>Can import tags from external application.</li>
-			<li>Extended attributes indexing is on by default.</li>
-			<li>New Python interface for data access. API re-modeled against
-			  newer Python Database API 2.0.</li>
-			<li>Shared librecoll.so.</li>
-		  </ul>
-		</dd>
-		<dt><a href="release-1.18.html">Release 1.18</a>: case and
-		  diacritics switchable sensitivity</dt> 
-		<dd>
-		  <ul>
-			<li>Index configuration for case and diacritics sensitivity.</li> 
-			<li>Advanced search history.</li>
-			<li>Page-level access when opening PDFs, and snippets
-			  window.</li>
-			<li>Use Xapian Synonyms tables for query expansion.</li>
-		  </ul>
-		</dd>
-		<dt><a href="release-1.17.html">Release 1.17</a>: small
-		  improvements</dt> 
-		<dd>
-		  <ul>
-			<li>Language-dependant unaccenting.</li>
-			<li>GUI dialogs for indexing schedule setup.</li>
-			<li>Phrase-based <i>dir:</i> filtering, accepting path
-			fragments. Size filtering.</li> 
-			<li>Python module default install and Unity Lens.</li>
-			<li>Result list switched to WebKit: drops qt3 support.</li>
-			<li>Indexing always performed by separate process.</li>
-			<li>Dynamic category filters (defined as language fragments).</li>
-		  </ul>
-		</dd>
-		
    </div>
  </body>
 </html>