none

2015-06-15 17:03:38 +02:00 · 2015-06-15 17:03:38 +02:00 · f19f790b36
commit f19f790b36
parent aa4cf84e60
2 changed files with 329 additions and 0 deletions
--- a/website/idxthreads/forkingRecoll.txt
+++ b/website/idxthreads/forkingRecoll.txt
@ -0,0 +1,191 @@
 = Recoll command execution performance
 :Author: Jean-François Dockès
 :Email: jfd@recoll.org
 :Date: 2015-05-22
 == Abstract
 == Introduction
 Recoll is a big process which executes many others, mostly for extracting
 text from documents. Some of the executed processes are quite short-lived,
 and the time used by the process execution machinery can actually dominate
 the time used to translate data. This document explores possible approaches
 to improving performance without adding excessive complexity or damaging
 reliability.
 Studying fork/exec performance is not exactly a new venture, and there are
 many texts which address the subject. While researching, though, I found
 out that not so many were accurate and that a lot of questions were left as
 an exercise to the reader.
 This document will list the references I found reliable and interesting and
 describe the solution chosen along the other possible approaches.
 == Issues with fork
 The traditional way for a Unix process to start another is the
 fork()/exec() system call pair. The initial fork() duplicates the address
 space and resources (open files etc.) of the first process, then duplicates
 the thread of execution, ending up with 2 mostly identical processes.
 exec() then replaces part of the newly executing process with an address space
 initialized from an executable file, inheriting some of the old assets
 under various conditions.
 As processes became bigger the copying-before-discard operation wasted
 significant resources, and was optimized using two methods (at very
 different points in time):
 - The first approach was to supplement fork() with the vfork() call, which
   is similar but does not duplicate the address space: the new process
   thread executes in the old address space. The old thread is blocked
   until the new one calls exec() and frees up access to the memory
   space. Any modification performed by the child thread persists when
   the old one resumes.
 - The more modern approach, which cohexists with vfork(), was to replace
   the full duplication of the memory space with duplication of the page
   descriptors only. The pages in the new process are marked copy-on-write
   so that the new process has write access to its memory without
   disturbing its parent. The problem with this approach is that the
   operation can still be a significant resource consumer for big processes
   mapping a lot of memory. Many processes can fall in this category not
   because they have huge data segments, but just because they are linked
   to many shared libraries.
 NOTE: Orders of magnitude: a *recollindex* process will easily grow into a
 few hundred of megabytes of virtual space. It executes the small and
 efficient *antiword* command to extract text from *ms-word* files. While
 indexing multiple such files, *recollindex* can spend '60% of its CPU time'
 doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux,
 where `fork()` uses copy-on-write).
 Apart from the performance cost, another issue with fork() is that a big
 process can fail executing a small command because of the temporary need to
 allocate twice its address space. This is a much discussed subject which we
 will leave aside because it generally does not concern *recollindex*, which
 in typical conditions uses a small portion of the machine virtual memory,
 so that a temporary doubling is not an issue.
 The Recoll indexer is multithreaded, which may introduce other issues. Here
 is what happens to threads during the fork()/exec() interval:
 - fork():
   * The parent process threads all go on their merry way.
   * The child process is created with only one thread active, duplicated
     from the one which called fork()
 - vfork()
   * The parent process thread calling vfork() is suspended, the others
     are unaffected.
   * The child is created with only one thread, as for fork(). 
     This thread shares the memory space with the parent ones, without
     having any means to synchronize with them (pthread locks are not
     supposed to work across processes): caution needed !
 NOTE: for a multithreaded program using the classical pipe method to
 communicate with children, the sequence between the `pipe()` call and the
 parent `close()` of the unused side is a candidate for a critical section:
 if several threads can interleave in there, children process may inherit
 descriptors which 'belong' to other `fork()`/`exec()` operations, which may
 in turn be a problem or not depending on how descriptor cleanup is
 performed in the child (if no cleanup is performed, pipes may remain open
 at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange
 user Celada for explaining this to me.
 For multithreaded programs, both fork() and vfork() introduce possibilities
 of deadlock, because the resources held by a non-forking thread in the
 parent process can't be released in the child because the thread is not
 duplicated. This used to happen from time to time in *recollindex* because
 of an error logging call performed if the exec() failed after the fork()
 (e.g. command not found).
 With vfork() it is also possible to trigger a deadlock in the parent by
 (inadvertently) modifying data in the child. This could happen just
 link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because
 of dynamic linker operation] (which, seriously, should be considered a
 system bug).
 In general, the state of program data in the child process is a semi-random
 snapshot of what it was in the parent, and the official word about what you
 can do is that you can only call
 link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library
 functions] between 'fork()' and 'exec()'. These are functions which are
 safe to call from a signal handler because they are either reentrant or
 can't be interrupted by a signal. A notable missing entry in the list is
 `malloc()`.
 These are normally not issues for programs which only fork to execute
 another program (but the devil is in the details as demonstrated by the
 logging call issue...).
 One of the approaches often proposed for working around this mine-field is
 to use an auxiliary, small, process to execute any command needed by the
 main one. The small process can just use fork() with no performance
 issues. This has the inconvenient of complicating communication a lot if
 data needs to be transferred one way or another.
 ////
 Passing descriptors around
 http://stackoverflow.com/questions/909064/portable-way-to-pass-file-descriptor-between-different-processes
 http://www.normalesup.org/~george/comp/libancillary/
 http://stackoverflow.com/questions/28003921/sending-file-descriptor-by-linux-socket/
 The process would then be:
 - Tell slave to fork/exec cmd (issue with cmd + args format)
 - Get fds
 - Tell slave to wait, recover status.
 ////
 == The posix_spawn() Linux non-event
 Given the performance issues of `fork()` and tricky behaviour of `vfork()`,
 a "simpler" method for starting a child process was introduced by Posix:
 `posix_spawn()`.
 The `posix_spawn()` function is a black box, externally equivalent to a
 `fork()`/`exec()` sequence, and has parameters to specify the usual
 house-keeping performed at this time (file descriptors and signals
 management etc.). Hiding the internals gives the system a chance to
 optimize the performance and avoid `vfork()` pitfalls like the `ld.so`
 lockup described in the Oracle article.
 The Linux posix_spawn() is implemented by a `fork()`/`exec()` pair by default. 
 `vfork()` is used either if specified by an input flag or no
 signal/scheduler/process_group changes are requested. There must be a
 reason why signal handling changes would preclude `vfork()` usage, but I
 could not find it (signal handling data is stored in the kernel task_struct).
 The Linux glibc `posix_spawn()` currently does nothing that user code could
 not do. Still, using it would probably be a good future-proofing idea, but
 for a significant problem: there is no way to specify closing all open
 descriptors bigger than a specified value (closefrom() equivalent). This is
 available on Solaris and quite necessary in fact, because we have no way to
 be sure that all open descriptors have the CLOEXEC flag set.
 12500 small .doc files:
 fork:  real 0m46.025s user 0m26.574s sys 0m39.494s
 vfork: real 0m18.223s user 0m17.753s sys 0m1.736s
 spawn/fork: real 0m45.726s user 0m27.082s sys 0m40.575s
 spawn/vfork: real 0m18.915s user 0m18.681s sys 0m3.828s
 No surprise here, given the implementation of posix_spawn(), it gets the
 same times as the fork/vfork options.
 It is difficult to ignore the 60% reduction in execution time offered by
 using 'vfork()'.
 Objections to vfork: 
  ld.so locks
  sigaction locks
 https://bugzilla.redhat.com/show_bug.cgi?id=193631
 Is Linux vfork thread-safe ? Quoting interesting comments from Solaris
 implementation: 
 No answer to the issues cited though.
 https://sourceware.org/bugzilla/show_bug.cgi?id=378
 Use vfork() in posix_spawn()
--- a/website/idxthreads/xapDocCopyCrash.txt
+++ b/website/idxthreads/xapDocCopyCrash.txt
@ -0,0 +1,138 @@
 = The case of the bad Xapian::Document copy
 == How things were supposed to work
 Coming from the link:threadingRecoll.html[threading *Recoll*] page, 
 you may remember that the third stage of the
 processing pipeline breaks up text into terms, producing a *Xapian*
 document (+Xapian::Document+) which is finally processed by the last stage,
 the index updater. 
 What happens in practise is that the main routine in this stage has a local
 +Xapian::Document+ object, automatically allocated on the stack, which it
 updates appropriately and then copies into a task object which is placed on
 the input queue for the last stage.
 The text-splitting routine then returns, and its local +Xapian::Document+
 object is (implicitely) deleted while the stack unwinds.
 The idea is that the *copy* of the document which is on the queue should be
 unaffected, it is independant of the original and will further be processed
 by the index update thread, without interaction with the text-splitting one.
 At no point do multiple threads access the +Xapian::Document+ data, so
 there should be no problem.
 == The problem 
 Most *Xapian* objects are reference-counted, which means that the object
 itself is a small block of house-keeping variables. The actual data is
 allocated on the heap through eventual calls to new/malloc, and is shared
 by multiple copies of the object.  This is the case for +Xapian::Document+
 This is aboundantly documented, and users are encouraged to use copies
 instead of passing pointers around (copies are cheap because only a small
 block of auxiliary data is actually duplicated). This in general makes
 memory management easier.
 This is well-known, and it would not appear to be a problem in the above
 case as the +Xapian::Document+ actual data is never accessed by multiple
 threads.
 The problem is that the reference counter which keeps track of the object
 usage and triggers actual deletion when it goes to zero is accessed by two
 threads:
 - It is decremented while the first local object is destroyed during the
   stack unwind in the first thread
 - It is also updated by the last stage thread, incremented if copies are
   made, then decremented until it finally goes down to 0 when we are done
   with the object, at which point the document data is unallocated.
 As the counter is not protected in any way against concurrent access, the
 actual sequence of events is undefined and at least two kinds of problems
 may occur: double deletion of the data, or accesses to already freed heap
 data (potentially thrashing other threads allocations, or reading modified
 data).
 A relatively simple fix for this would be to use atomic test-and-set
 operations for the counter (which is what the GNU +std::string+ does). But
 the choice made by *Xapian* to let the application deal with all
 synchronization issues is legitimate and documented, nothing to complain
 about here. I just goofed.
 Because the counter test and update operations are very fast, and occur
 among a lot of processing from the final stage thread, the chances of
 concurrent access are low, which is why the problem manifests itself very
 rarely. Depending on thread scheduling and all manners of semi-random
 conditions, it is basically impossible to reproduce reliably.
 == The fix
 The implemented fix was trivial: the upstream thread allocates the initial
 +Xapian::Document+ on the heap, copies the pointer to the queue object, and
 forgets about it. The index-updating thread peruses the object then
 +delete+'s it. Real easy.
 An alternative solution would have been to try and use locking to protect
 the counter updates. The only place where such locking operations could
 reasonably occur is inside the +Xapian::Document+ refcounted pointer
 object, which we can't modify. Otherwise, we would have to protect the
 _whole scopes of existence_ of the Xapian::Document object in any routine
 which creates/copies or (implicitely) deletes it, which would cause many
 problems and/or contention issues
 == Why did I miss this ?
 The mechanism of the crashes is simple enough, quasi-obvious. 
 How on earth could I miss this problem while writing the code ? 
 For the sake of anecdote, my first brush with atomicity for updates of
 reference counters was while debugging a System V release 4 kernel VFS file
 system module, at the time when SVR4 got a preemptive kernel with SVR4-MP,
 circa 1990... I ended up replacing a +counter+++ with +atomic_add()+ after
 a set of _interesting_ debugging sessions interspersed with kernel crashes
 and +fsck+ waits. This should have left some memories. So what went wrong ?
 Here follow a list of possible reasons:
 - Reasoning by analogy: std::string are safe to use in this way. The other
  objects used in the indexing pipe are also safe. I just used
  +Xapian::Document+ in the same way without thinking further.
 - Probably not how I would do it: faced with designing +Xapian::Document+,
  (not clever enough to do this anyway), I'd probably conclude that not
  wanting to deal with full-on concurrency is one thing, not protecting the
  reference counters is another, and going too far.
 - The problem was not so easily visible because the object deletion is
  implicitely performed during the stack unwind: this provides no clue, no
  specific operation to think about.
 - Pure lazyness.
 As a conclusion, a humble request to library designers: when an
 interface works counter to the reasonable expectations of at least some of
 the users (for example because it looks like, but works differently, than a
 standard library interface), it is worth it to be very specific in the
 documentation and header file comments about the gotcha's. Saving people
 from their own deficiencies is a worthy goal.
 Here, a simple statement that the reference count was not mt-safe
 (admittedly redundant with the general statement that the *Xapian* library
 does not deal with threads), would have got me thinking and avoided the
 error.
 ++++
      <h2 id="comments">Comments</h2>
      <div id="disqus_thread"></div>
      <script type="text/javascript">
        var disqus_shortname = 'lesbonscomptes'; 
        (function() {
            var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
            (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
        })();
      </script>
      <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
      <a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
 ++++