diff --git a/website/idxthreads/forkingRecoll.txt b/website/idxthreads/forkingRecoll.txt new file mode 100644 index 00000000..f672f9b3 --- /dev/null +++ b/website/idxthreads/forkingRecoll.txt @@ -0,0 +1,191 @@ += Recoll command execution performance +:Author: Jean-François Dockès +:Email: jfd@recoll.org +:Date: 2015-05-22 + +== Abstract + +== Introduction + +Recoll is a big process which executes many others, mostly for extracting +text from documents. Some of the executed processes are quite short-lived, +and the time used by the process execution machinery can actually dominate +the time used to translate data. This document explores possible approaches +to improving performance without adding excessive complexity or damaging +reliability. + +Studying fork/exec performance is not exactly a new venture, and there are +many texts which address the subject. While researching, though, I found +out that not so many were accurate and that a lot of questions were left as +an exercise to the reader. + +This document will list the references I found reliable and interesting and +describe the solution chosen along the other possible approaches. + +== Issues with fork + +The traditional way for a Unix process to start another is the +fork()/exec() system call pair. The initial fork() duplicates the address +space and resources (open files etc.) of the first process, then duplicates +the thread of execution, ending up with 2 mostly identical processes. +exec() then replaces part of the newly executing process with an address space +initialized from an executable file, inheriting some of the old assets +under various conditions. + +As processes became bigger the copying-before-discard operation wasted +significant resources, and was optimized using two methods (at very +different points in time): + + - The first approach was to supplement fork() with the vfork() call, which + is similar but does not duplicate the address space: the new process + thread executes in the old address space. The old thread is blocked + until the new one calls exec() and frees up access to the memory + space. Any modification performed by the child thread persists when + the old one resumes. + + - The more modern approach, which cohexists with vfork(), was to replace + the full duplication of the memory space with duplication of the page + descriptors only. The pages in the new process are marked copy-on-write + so that the new process has write access to its memory without + disturbing its parent. The problem with this approach is that the + operation can still be a significant resource consumer for big processes + mapping a lot of memory. Many processes can fall in this category not + because they have huge data segments, but just because they are linked + to many shared libraries. + +NOTE: Orders of magnitude: a *recollindex* process will easily grow into a +few hundred of megabytes of virtual space. It executes the small and +efficient *antiword* command to extract text from *ms-word* files. While +indexing multiple such files, *recollindex* can spend '60% of its CPU time' +doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux, +where `fork()` uses copy-on-write). + +Apart from the performance cost, another issue with fork() is that a big +process can fail executing a small command because of the temporary need to +allocate twice its address space. This is a much discussed subject which we +will leave aside because it generally does not concern *recollindex*, which +in typical conditions uses a small portion of the machine virtual memory, +so that a temporary doubling is not an issue. + +The Recoll indexer is multithreaded, which may introduce other issues. Here +is what happens to threads during the fork()/exec() interval: + + - fork(): + * The parent process threads all go on their merry way. + * The child process is created with only one thread active, duplicated + from the one which called fork() + - vfork() + * The parent process thread calling vfork() is suspended, the others + are unaffected. + * The child is created with only one thread, as for fork(). + This thread shares the memory space with the parent ones, without + having any means to synchronize with them (pthread locks are not + supposed to work across processes): caution needed ! + +NOTE: for a multithreaded program using the classical pipe method to +communicate with children, the sequence between the `pipe()` call and the +parent `close()` of the unused side is a candidate for a critical section: +if several threads can interleave in there, children process may inherit +descriptors which 'belong' to other `fork()`/`exec()` operations, which may +in turn be a problem or not depending on how descriptor cleanup is +performed in the child (if no cleanup is performed, pipes may remain open +at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange +user Celada for explaining this to me. + +For multithreaded programs, both fork() and vfork() introduce possibilities +of deadlock, because the resources held by a non-forking thread in the +parent process can't be released in the child because the thread is not +duplicated. This used to happen from time to time in *recollindex* because +of an error logging call performed if the exec() failed after the fork() +(e.g. command not found). + +With vfork() it is also possible to trigger a deadlock in the parent by +(inadvertently) modifying data in the child. This could happen just +link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because +of dynamic linker operation] (which, seriously, should be considered a +system bug). + + +In general, the state of program data in the child process is a semi-random +snapshot of what it was in the parent, and the official word about what you +can do is that you can only call +link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library +functions] between 'fork()' and 'exec()'. These are functions which are +safe to call from a signal handler because they are either reentrant or +can't be interrupted by a signal. A notable missing entry in the list is +`malloc()`. + +These are normally not issues for programs which only fork to execute +another program (but the devil is in the details as demonstrated by the +logging call issue...). + +One of the approaches often proposed for working around this mine-field is +to use an auxiliary, small, process to execute any command needed by the +main one. The small process can just use fork() with no performance +issues. This has the inconvenient of complicating communication a lot if +data needs to be transferred one way or another. + +//// +Passing descriptors around +http://stackoverflow.com/questions/909064/portable-way-to-pass-file-descriptor-between-different-processes +http://www.normalesup.org/~george/comp/libancillary/ +http://stackoverflow.com/questions/28003921/sending-file-descriptor-by-linux-socket/ + +The process would then be: + - Tell slave to fork/exec cmd (issue with cmd + args format) + - Get fds + - Tell slave to wait, recover status. +//// + +== The posix_spawn() Linux non-event + +Given the performance issues of `fork()` and tricky behaviour of `vfork()`, +a "simpler" method for starting a child process was introduced by Posix: +`posix_spawn()`. + +The `posix_spawn()` function is a black box, externally equivalent to a +`fork()`/`exec()` sequence, and has parameters to specify the usual +house-keeping performed at this time (file descriptors and signals +management etc.). Hiding the internals gives the system a chance to +optimize the performance and avoid `vfork()` pitfalls like the `ld.so` +lockup described in the Oracle article. + +The Linux posix_spawn() is implemented by a `fork()`/`exec()` pair by default. + +`vfork()` is used either if specified by an input flag or no +signal/scheduler/process_group changes are requested. There must be a +reason why signal handling changes would preclude `vfork()` usage, but I +could not find it (signal handling data is stored in the kernel task_struct). + +The Linux glibc `posix_spawn()` currently does nothing that user code could +not do. Still, using it would probably be a good future-proofing idea, but +for a significant problem: there is no way to specify closing all open +descriptors bigger than a specified value (closefrom() equivalent). This is +available on Solaris and quite necessary in fact, because we have no way to +be sure that all open descriptors have the CLOEXEC flag set. + +12500 small .doc files: + +fork: real 0m46.025s user 0m26.574s sys 0m39.494s +vfork: real 0m18.223s user 0m17.753s sys 0m1.736s +spawn/fork: real 0m45.726s user 0m27.082s sys 0m40.575s +spawn/vfork: real 0m18.915s user 0m18.681s sys 0m3.828s + +No surprise here, given the implementation of posix_spawn(), it gets the +same times as the fork/vfork options. + +It is difficult to ignore the 60% reduction in execution time offered by +using 'vfork()'. + +Objections to vfork: + ld.so locks + sigaction locks + +https://bugzilla.redhat.com/show_bug.cgi?id=193631 + +Is Linux vfork thread-safe ? Quoting interesting comments from Solaris +implementation: +No answer to the issues cited though. + +https://sourceware.org/bugzilla/show_bug.cgi?id=378 +Use vfork() in posix_spawn() diff --git a/website/idxthreads/xapDocCopyCrash.txt b/website/idxthreads/xapDocCopyCrash.txt new file mode 100644 index 00000000..91833651 --- /dev/null +++ b/website/idxthreads/xapDocCopyCrash.txt @@ -0,0 +1,138 @@ += The case of the bad Xapian::Document copy + +== How things were supposed to work + +Coming from the link:threadingRecoll.html[threading *Recoll*] page, +you may remember that the third stage of the +processing pipeline breaks up text into terms, producing a *Xapian* +document (+Xapian::Document+) which is finally processed by the last stage, +the index updater. + +What happens in practise is that the main routine in this stage has a local ++Xapian::Document+ object, automatically allocated on the stack, which it +updates appropriately and then copies into a task object which is placed on +the input queue for the last stage. + +The text-splitting routine then returns, and its local +Xapian::Document+ +object is (implicitely) deleted while the stack unwinds. + +The idea is that the *copy* of the document which is on the queue should be +unaffected, it is independant of the original and will further be processed +by the index update thread, without interaction with the text-splitting one. + +At no point do multiple threads access the +Xapian::Document+ data, so +there should be no problem. + +== The problem + +Most *Xapian* objects are reference-counted, which means that the object +itself is a small block of house-keeping variables. The actual data is +allocated on the heap through eventual calls to new/malloc, and is shared +by multiple copies of the object. This is the case for +Xapian::Document+ + +This is aboundantly documented, and users are encouraged to use copies +instead of passing pointers around (copies are cheap because only a small +block of auxiliary data is actually duplicated). This in general makes +memory management easier. + +This is well-known, and it would not appear to be a problem in the above +case as the +Xapian::Document+ actual data is never accessed by multiple +threads. + +The problem is that the reference counter which keeps track of the object +usage and triggers actual deletion when it goes to zero is accessed by two +threads: + + - It is decremented while the first local object is destroyed during the + stack unwind in the first thread + - It is also updated by the last stage thread, incremented if copies are + made, then decremented until it finally goes down to 0 when we are done + with the object, at which point the document data is unallocated. + +As the counter is not protected in any way against concurrent access, the +actual sequence of events is undefined and at least two kinds of problems +may occur: double deletion of the data, or accesses to already freed heap +data (potentially thrashing other threads allocations, or reading modified +data). + +A relatively simple fix for this would be to use atomic test-and-set +operations for the counter (which is what the GNU +std::string+ does). But +the choice made by *Xapian* to let the application deal with all +synchronization issues is legitimate and documented, nothing to complain +about here. I just goofed. + +Because the counter test and update operations are very fast, and occur +among a lot of processing from the final stage thread, the chances of +concurrent access are low, which is why the problem manifests itself very +rarely. Depending on thread scheduling and all manners of semi-random +conditions, it is basically impossible to reproduce reliably. + +== The fix + +The implemented fix was trivial: the upstream thread allocates the initial ++Xapian::Document+ on the heap, copies the pointer to the queue object, and +forgets about it. The index-updating thread peruses the object then ++delete+'s it. Real easy. + +An alternative solution would have been to try and use locking to protect +the counter updates. The only place where such locking operations could +reasonably occur is inside the +Xapian::Document+ refcounted pointer +object, which we can't modify. Otherwise, we would have to protect the +_whole scopes of existence_ of the Xapian::Document object in any routine +which creates/copies or (implicitely) deletes it, which would cause many +problems and/or contention issues + +== Why did I miss this ? + +The mechanism of the crashes is simple enough, quasi-obvious. +How on earth could I miss this problem while writing the code ? + +For the sake of anecdote, my first brush with atomicity for updates of +reference counters was while debugging a System V release 4 kernel VFS file +system module, at the time when SVR4 got a preemptive kernel with SVR4-MP, +circa 1990... I ended up replacing a +counter+++ with +atomic_add()+ after +a set of _interesting_ debugging sessions interspersed with kernel crashes +and +fsck+ waits. This should have left some memories. So what went wrong ? +Here follow a list of possible reasons: + +- Reasoning by analogy: std::string are safe to use in this way. The other + objects used in the indexing pipe are also safe. I just used + +Xapian::Document+ in the same way without thinking further. +- Probably not how I would do it: faced with designing +Xapian::Document+, + (not clever enough to do this anyway), I'd probably conclude that not + wanting to deal with full-on concurrency is one thing, not protecting the + reference counters is another, and going too far. +- The problem was not so easily visible because the object deletion is + implicitely performed during the stack unwind: this provides no clue, no + specific operation to think about. +- Pure lazyness. + + +As a conclusion, a humble request to library designers: when an +interface works counter to the reasonable expectations of at least some of +the users (for example because it looks like, but works differently, than a +standard library interface), it is worth it to be very specific in the +documentation and header file comments about the gotcha's. Saving people +from their own deficiencies is a worthy goal. + +Here, a simple statement that the reference count was not mt-safe +(admittedly redundant with the general statement that the *Xapian* library +does not deal with threads), would have got me thinking and avoided the +error. + +++++ +