none

2015-06-15 17:03:38 +02:00 · 2015-06-15 17:03:38 +02:00 · f19f790b36
commit f19f790b36
parent aa4cf84e60
2 changed files with 329 additions and 0 deletions
--- a/website/idxthreads/forkingRecoll.txt
+++ b/website/idxthreads/forkingRecoll.txt
@ -0,0 +1,191 @@
+= Recoll command execution performance
+:Author: Jean-François Dockès
+:Email: jfd@recoll.org
+:Date: 2015-05-22
+
+== Abstract
+
+== Introduction
+
+Recoll is a big process which executes many others, mostly for extracting
+text from documents. Some of the executed processes are quite short-lived,
+and the time used by the process execution machinery can actually dominate
+the time used to translate data. This document explores possible approaches
+to improving performance without adding excessive complexity or damaging
+reliability.
+
+Studying fork/exec performance is not exactly a new venture, and there are
+many texts which address the subject. While researching, though, I found
+out that not so many were accurate and that a lot of questions were left as
+an exercise to the reader.
+
+This document will list the references I found reliable and interesting and
+describe the solution chosen along the other possible approaches.
+
+== Issues with fork
+
+The traditional way for a Unix process to start another is the
+fork()/exec() system call pair. The initial fork() duplicates the address
+space and resources (open files etc.) of the first process, then duplicates
+the thread of execution, ending up with 2 mostly identical processes.
+exec() then replaces part of the newly executing process with an address space
+initialized from an executable file, inheriting some of the old assets
+under various conditions.
+
+As processes became bigger the copying-before-discard operation wasted
+significant resources, and was optimized using two methods (at very
+different points in time):
+
+ - The first approach was to supplement fork() with the vfork() call, which
+   is similar but does not duplicate the address space: the new process
+   thread executes in the old address space. The old thread is blocked
+   until the new one calls exec() and frees up access to the memory
+   space. Any modification performed by the child thread persists when
+   the old one resumes.
+
+ - The more modern approach, which cohexists with vfork(), was to replace
+   the full duplication of the memory space with duplication of the page
+   descriptors only. The pages in the new process are marked copy-on-write
+   so that the new process has write access to its memory without
+   disturbing its parent. The problem with this approach is that the
+   operation can still be a significant resource consumer for big processes
+   mapping a lot of memory. Many processes can fall in this category not
+   because they have huge data segments, but just because they are linked
+   to many shared libraries.
+
+NOTE: Orders of magnitude: a *recollindex* process will easily grow into a
+few hundred of megabytes of virtual space. It executes the small and
+efficient *antiword* command to extract text from *ms-word* files. While
+indexing multiple such files, *recollindex* can spend '60% of its CPU time'
+doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux,
+where `fork()` uses copy-on-write).
+
+Apart from the performance cost, another issue with fork() is that a big
+process can fail executing a small command because of the temporary need to
+allocate twice its address space. This is a much discussed subject which we
+will leave aside because it generally does not concern *recollindex*, which
+in typical conditions uses a small portion of the machine virtual memory,
+so that a temporary doubling is not an issue.
+
+The Recoll indexer is multithreaded, which may introduce other issues. Here
+is what happens to threads during the fork()/exec() interval:
+
+ - fork():
+   * The parent process threads all go on their merry way.
+   * The child process is created with only one thread active, duplicated
+     from the one which called fork()
+ - vfork()
+   * The parent process thread calling vfork() is suspended, the others
+     are unaffected.
+   * The child is created with only one thread, as for fork(). 
+     This thread shares the memory space with the parent ones, without
+     having any means to synchronize with them (pthread locks are not
+     supposed to work across processes): caution needed !
+
+NOTE: for a multithreaded program using the classical pipe method to
+communicate with children, the sequence between the `pipe()` call and the
+parent `close()` of the unused side is a candidate for a critical section:
+if several threads can interleave in there, children process may inherit
+descriptors which 'belong' to other `fork()`/`exec()` operations, which may
+in turn be a problem or not depending on how descriptor cleanup is
+performed in the child (if no cleanup is performed, pipes may remain open
+at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange
+user Celada for explaining this to me.
+
+For multithreaded programs, both fork() and vfork() introduce possibilities
+of deadlock, because the resources held by a non-forking thread in the
+parent process can't be released in the child because the thread is not
+duplicated. This used to happen from time to time in *recollindex* because
+of an error logging call performed if the exec() failed after the fork()
+(e.g. command not found).
+
+With vfork() it is also possible to trigger a deadlock in the parent by
+(inadvertently) modifying data in the child. This could happen just
+link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because
+of dynamic linker operation] (which, seriously, should be considered a
+system bug).
+
+
+In general, the state of program data in the child process is a semi-random
+snapshot of what it was in the parent, and the official word about what you
+can do is that you can only call
+link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library
+functions] between 'fork()' and 'exec()'. These are functions which are
+safe to call from a signal handler because they are either reentrant or
+can't be interrupted by a signal. A notable missing entry in the list is
+`malloc()`.
+
+These are normally not issues for programs which only fork to execute
+another program (but the devil is in the details as demonstrated by the
+logging call issue...).
+
+One of the approaches often proposed for working around this mine-field is
+to use an auxiliary, small, process to execute any command needed by the
+main one. The small process can just use fork() with no performance
+issues. This has the inconvenient of complicating communication a lot if
+data needs to be transferred one way or another.
+
+////
+Passing descriptors around
+http://stackoverflow.com/questions/909064/portable-way-to-pass-file-descriptor-between-different-processes
+http://www.normalesup.org/~george/comp/libancillary/
+http://stackoverflow.com/questions/28003921/sending-file-descriptor-by-linux-socket/
+
+The process would then be:
+ - Tell slave to fork/exec cmd (issue with cmd + args format)
+ - Get fds
+ - Tell slave to wait, recover status.
+////
+
+== The posix_spawn() Linux non-event
+
+Given the performance issues of `fork()` and tricky behaviour of `vfork()`,
+a "simpler" method for starting a child process was introduced by Posix:
+`posix_spawn()`.
+
+The `posix_spawn()` function is a black box, externally equivalent to a
+`fork()`/`exec()` sequence, and has parameters to specify the usual
+house-keeping performed at this time (file descriptors and signals
+management etc.). Hiding the internals gives the system a chance to
+optimize the performance and avoid `vfork()` pitfalls like the `ld.so`
+lockup described in the Oracle article.
+
+The Linux posix_spawn() is implemented by a `fork()`/`exec()` pair by default. 
+
+`vfork()` is used either if specified by an input flag or no
+signal/scheduler/process_group changes are requested. There must be a
+reason why signal handling changes would preclude `vfork()` usage, but I
+could not find it (signal handling data is stored in the kernel task_struct).
+
+The Linux glibc `posix_spawn()` currently does nothing that user code could
+not do. Still, using it would probably be a good future-proofing idea, but
+for a significant problem: there is no way to specify closing all open
+descriptors bigger than a specified value (closefrom() equivalent). This is
+available on Solaris and quite necessary in fact, because we have no way to
+be sure that all open descriptors have the CLOEXEC flag set.
+
+12500 small .doc files:
+
+fork:  real 0m46.025s user 0m26.574s sys 0m39.494s
+vfork: real 0m18.223s user 0m17.753s sys 0m1.736s
+spawn/fork: real 0m45.726s user 0m27.082s sys 0m40.575s
+spawn/vfork: real 0m18.915s user 0m18.681s sys 0m3.828s
+
+No surprise here, given the implementation of posix_spawn(), it gets the
+same times as the fork/vfork options.
+
+It is difficult to ignore the 60% reduction in execution time offered by
+using 'vfork()'.
+
+Objections to vfork: 
+  ld.so locks
+  sigaction locks
+
+https://bugzilla.redhat.com/show_bug.cgi?id=193631
+
+Is Linux vfork thread-safe ? Quoting interesting comments from Solaris
+implementation: 
+No answer to the issues cited though.
+
+https://sourceware.org/bugzilla/show_bug.cgi?id=378
+Use vfork() in posix_spawn()
--- a/website/idxthreads/xapDocCopyCrash.txt
+++ b/website/idxthreads/xapDocCopyCrash.txt
@ -0,0 +1,138 @@
+= The case of the bad Xapian::Document copy
+
+== How things were supposed to work
+
+Coming from the link:threadingRecoll.html[threading *Recoll*] page, 
+you may remember that the third stage of the
+processing pipeline breaks up text into terms, producing a *Xapian*
+document (+Xapian::Document+) which is finally processed by the last stage,
+the index updater. 
+
+What happens in practise is that the main routine in this stage has a local
+Xapian::Document+ object, automatically allocated on the stack, which it
+updates appropriately and then copies into a task object which is placed on
+the input queue for the last stage.
+
+The text-splitting routine then returns, and its local +Xapian::Document+
+object is (implicitely) deleted while the stack unwinds.
+
+The idea is that the *copy* of the document which is on the queue should be
+unaffected, it is independant of the original and will further be processed
+by the index update thread, without interaction with the text-splitting one.
+
+At no point do multiple threads access the +Xapian::Document+ data, so
+there should be no problem.
+
+== The problem 
+
+Most *Xapian* objects are reference-counted, which means that the object
+itself is a small block of house-keeping variables. The actual data is
+allocated on the heap through eventual calls to new/malloc, and is shared
+by multiple copies of the object.  This is the case for +Xapian::Document+
+
+This is aboundantly documented, and users are encouraged to use copies
+instead of passing pointers around (copies are cheap because only a small
+block of auxiliary data is actually duplicated). This in general makes
+memory management easier.
+
+This is well-known, and it would not appear to be a problem in the above
+case as the +Xapian::Document+ actual data is never accessed by multiple
+threads.
+
+The problem is that the reference counter which keeps track of the object
+usage and triggers actual deletion when it goes to zero is accessed by two
+threads:
+
+ - It is decremented while the first local object is destroyed during the
+   stack unwind in the first thread
+ - It is also updated by the last stage thread, incremented if copies are
+   made, then decremented until it finally goes down to 0 when we are done
+   with the object, at which point the document data is unallocated.
+
+As the counter is not protected in any way against concurrent access, the
+actual sequence of events is undefined and at least two kinds of problems
+may occur: double deletion of the data, or accesses to already freed heap
+data (potentially thrashing other threads allocations, or reading modified
+data).
+
+A relatively simple fix for this would be to use atomic test-and-set
+operations for the counter (which is what the GNU +std::string+ does). But
+the choice made by *Xapian* to let the application deal with all
+synchronization issues is legitimate and documented, nothing to complain
+about here. I just goofed.
+
+Because the counter test and update operations are very fast, and occur
+among a lot of processing from the final stage thread, the chances of
+concurrent access are low, which is why the problem manifests itself very
+rarely. Depending on thread scheduling and all manners of semi-random
+conditions, it is basically impossible to reproduce reliably.
+
+== The fix
+
+The implemented fix was trivial: the upstream thread allocates the initial
+Xapian::Document+ on the heap, copies the pointer to the queue object, and
+forgets about it. The index-updating thread peruses the object then
+delete+'s it. Real easy.
+
+An alternative solution would have been to try and use locking to protect
+the counter updates. The only place where such locking operations could
+reasonably occur is inside the +Xapian::Document+ refcounted pointer
+object, which we can't modify. Otherwise, we would have to protect the
+_whole scopes of existence_ of the Xapian::Document object in any routine
+which creates/copies or (implicitely) deletes it, which would cause many
+problems and/or contention issues
+
+== Why did I miss this ?
+
+The mechanism of the crashes is simple enough, quasi-obvious. 
+How on earth could I miss this problem while writing the code ? 
+
+For the sake of anecdote, my first brush with atomicity for updates of
+reference counters was while debugging a System V release 4 kernel VFS file
+system module, at the time when SVR4 got a preemptive kernel with SVR4-MP,
+circa 1990... I ended up replacing a +counter+++ with +atomic_add()+ after
+a set of _interesting_ debugging sessions interspersed with kernel crashes
+and +fsck+ waits. This should have left some memories. So what went wrong ?
+Here follow a list of possible reasons:
+
+- Reasoning by analogy: std::string are safe to use in this way. The other
+  objects used in the indexing pipe are also safe. I just used
+  +Xapian::Document+ in the same way without thinking further.
+- Probably not how I would do it: faced with designing +Xapian::Document+,
+  (not clever enough to do this anyway), I'd probably conclude that not
+  wanting to deal with full-on concurrency is one thing, not protecting the
+  reference counters is another, and going too far.
+- The problem was not so easily visible because the object deletion is
+  implicitely performed during the stack unwind: this provides no clue, no
+  specific operation to think about.
+- Pure lazyness.
+
+
+As a conclusion, a humble request to library designers: when an
+interface works counter to the reasonable expectations of at least some of
+the users (for example because it looks like, but works differently, than a
+standard library interface), it is worth it to be very specific in the
+documentation and header file comments about the gotcha's. Saving people
+from their own deficiencies is a worthy goal.
+
+Here, a simple statement that the reference count was not mt-safe
+(admittedly redundant with the general statement that the *Xapian* library
+does not deal with threads), would have got me thinking and avoided the
+error.
+
++++
+      <h2 id="comments">Comments</h2>
+
+      <div id="disqus_thread"></div>
+      <script type="text/javascript">
+        var disqus_shortname = 'lesbonscomptes'; 
+        (function() {
+            var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
+            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
+            (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
+        })();
+      </script>
+      <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
+      <a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
+
++++