none
This commit is contained in:
parent
aa4cf84e60
commit
f19f790b36
191
website/idxthreads/forkingRecoll.txt
Normal file
191
website/idxthreads/forkingRecoll.txt
Normal file
@ -0,0 +1,191 @@
|
|||||||
|
= Recoll command execution performance
|
||||||
|
:Author: Jean-François Dockès
|
||||||
|
:Email: jfd@recoll.org
|
||||||
|
:Date: 2015-05-22
|
||||||
|
|
||||||
|
== Abstract
|
||||||
|
|
||||||
|
== Introduction
|
||||||
|
|
||||||
|
Recoll is a big process which executes many others, mostly for extracting
|
||||||
|
text from documents. Some of the executed processes are quite short-lived,
|
||||||
|
and the time used by the process execution machinery can actually dominate
|
||||||
|
the time used to translate data. This document explores possible approaches
|
||||||
|
to improving performance without adding excessive complexity or damaging
|
||||||
|
reliability.
|
||||||
|
|
||||||
|
Studying fork/exec performance is not exactly a new venture, and there are
|
||||||
|
many texts which address the subject. While researching, though, I found
|
||||||
|
out that not so many were accurate and that a lot of questions were left as
|
||||||
|
an exercise to the reader.
|
||||||
|
|
||||||
|
This document will list the references I found reliable and interesting and
|
||||||
|
describe the solution chosen along the other possible approaches.
|
||||||
|
|
||||||
|
== Issues with fork
|
||||||
|
|
||||||
|
The traditional way for a Unix process to start another is the
|
||||||
|
fork()/exec() system call pair. The initial fork() duplicates the address
|
||||||
|
space and resources (open files etc.) of the first process, then duplicates
|
||||||
|
the thread of execution, ending up with 2 mostly identical processes.
|
||||||
|
exec() then replaces part of the newly executing process with an address space
|
||||||
|
initialized from an executable file, inheriting some of the old assets
|
||||||
|
under various conditions.
|
||||||
|
|
||||||
|
As processes became bigger the copying-before-discard operation wasted
|
||||||
|
significant resources, and was optimized using two methods (at very
|
||||||
|
different points in time):
|
||||||
|
|
||||||
|
- The first approach was to supplement fork() with the vfork() call, which
|
||||||
|
is similar but does not duplicate the address space: the new process
|
||||||
|
thread executes in the old address space. The old thread is blocked
|
||||||
|
until the new one calls exec() and frees up access to the memory
|
||||||
|
space. Any modification performed by the child thread persists when
|
||||||
|
the old one resumes.
|
||||||
|
|
||||||
|
- The more modern approach, which cohexists with vfork(), was to replace
|
||||||
|
the full duplication of the memory space with duplication of the page
|
||||||
|
descriptors only. The pages in the new process are marked copy-on-write
|
||||||
|
so that the new process has write access to its memory without
|
||||||
|
disturbing its parent. The problem with this approach is that the
|
||||||
|
operation can still be a significant resource consumer for big processes
|
||||||
|
mapping a lot of memory. Many processes can fall in this category not
|
||||||
|
because they have huge data segments, but just because they are linked
|
||||||
|
to many shared libraries.
|
||||||
|
|
||||||
|
NOTE: Orders of magnitude: a *recollindex* process will easily grow into a
|
||||||
|
few hundred of megabytes of virtual space. It executes the small and
|
||||||
|
efficient *antiword* command to extract text from *ms-word* files. While
|
||||||
|
indexing multiple such files, *recollindex* can spend '60% of its CPU time'
|
||||||
|
doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux,
|
||||||
|
where `fork()` uses copy-on-write).
|
||||||
|
|
||||||
|
Apart from the performance cost, another issue with fork() is that a big
|
||||||
|
process can fail executing a small command because of the temporary need to
|
||||||
|
allocate twice its address space. This is a much discussed subject which we
|
||||||
|
will leave aside because it generally does not concern *recollindex*, which
|
||||||
|
in typical conditions uses a small portion of the machine virtual memory,
|
||||||
|
so that a temporary doubling is not an issue.
|
||||||
|
|
||||||
|
The Recoll indexer is multithreaded, which may introduce other issues. Here
|
||||||
|
is what happens to threads during the fork()/exec() interval:
|
||||||
|
|
||||||
|
- fork():
|
||||||
|
* The parent process threads all go on their merry way.
|
||||||
|
* The child process is created with only one thread active, duplicated
|
||||||
|
from the one which called fork()
|
||||||
|
- vfork()
|
||||||
|
* The parent process thread calling vfork() is suspended, the others
|
||||||
|
are unaffected.
|
||||||
|
* The child is created with only one thread, as for fork().
|
||||||
|
This thread shares the memory space with the parent ones, without
|
||||||
|
having any means to synchronize with them (pthread locks are not
|
||||||
|
supposed to work across processes): caution needed !
|
||||||
|
|
||||||
|
NOTE: for a multithreaded program using the classical pipe method to
|
||||||
|
communicate with children, the sequence between the `pipe()` call and the
|
||||||
|
parent `close()` of the unused side is a candidate for a critical section:
|
||||||
|
if several threads can interleave in there, children process may inherit
|
||||||
|
descriptors which 'belong' to other `fork()`/`exec()` operations, which may
|
||||||
|
in turn be a problem or not depending on how descriptor cleanup is
|
||||||
|
performed in the child (if no cleanup is performed, pipes may remain open
|
||||||
|
at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange
|
||||||
|
user Celada for explaining this to me.
|
||||||
|
|
||||||
|
For multithreaded programs, both fork() and vfork() introduce possibilities
|
||||||
|
of deadlock, because the resources held by a non-forking thread in the
|
||||||
|
parent process can't be released in the child because the thread is not
|
||||||
|
duplicated. This used to happen from time to time in *recollindex* because
|
||||||
|
of an error logging call performed if the exec() failed after the fork()
|
||||||
|
(e.g. command not found).
|
||||||
|
|
||||||
|
With vfork() it is also possible to trigger a deadlock in the parent by
|
||||||
|
(inadvertently) modifying data in the child. This could happen just
|
||||||
|
link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because
|
||||||
|
of dynamic linker operation] (which, seriously, should be considered a
|
||||||
|
system bug).
|
||||||
|
|
||||||
|
|
||||||
|
In general, the state of program data in the child process is a semi-random
|
||||||
|
snapshot of what it was in the parent, and the official word about what you
|
||||||
|
can do is that you can only call
|
||||||
|
link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library
|
||||||
|
functions] between 'fork()' and 'exec()'. These are functions which are
|
||||||
|
safe to call from a signal handler because they are either reentrant or
|
||||||
|
can't be interrupted by a signal. A notable missing entry in the list is
|
||||||
|
`malloc()`.
|
||||||
|
|
||||||
|
These are normally not issues for programs which only fork to execute
|
||||||
|
another program (but the devil is in the details as demonstrated by the
|
||||||
|
logging call issue...).
|
||||||
|
|
||||||
|
One of the approaches often proposed for working around this mine-field is
|
||||||
|
to use an auxiliary, small, process to execute any command needed by the
|
||||||
|
main one. The small process can just use fork() with no performance
|
||||||
|
issues. This has the inconvenient of complicating communication a lot if
|
||||||
|
data needs to be transferred one way or another.
|
||||||
|
|
||||||
|
////
|
||||||
|
Passing descriptors around
|
||||||
|
http://stackoverflow.com/questions/909064/portable-way-to-pass-file-descriptor-between-different-processes
|
||||||
|
http://www.normalesup.org/~george/comp/libancillary/
|
||||||
|
http://stackoverflow.com/questions/28003921/sending-file-descriptor-by-linux-socket/
|
||||||
|
|
||||||
|
The process would then be:
|
||||||
|
- Tell slave to fork/exec cmd (issue with cmd + args format)
|
||||||
|
- Get fds
|
||||||
|
- Tell slave to wait, recover status.
|
||||||
|
////
|
||||||
|
|
||||||
|
== The posix_spawn() Linux non-event
|
||||||
|
|
||||||
|
Given the performance issues of `fork()` and tricky behaviour of `vfork()`,
|
||||||
|
a "simpler" method for starting a child process was introduced by Posix:
|
||||||
|
`posix_spawn()`.
|
||||||
|
|
||||||
|
The `posix_spawn()` function is a black box, externally equivalent to a
|
||||||
|
`fork()`/`exec()` sequence, and has parameters to specify the usual
|
||||||
|
house-keeping performed at this time (file descriptors and signals
|
||||||
|
management etc.). Hiding the internals gives the system a chance to
|
||||||
|
optimize the performance and avoid `vfork()` pitfalls like the `ld.so`
|
||||||
|
lockup described in the Oracle article.
|
||||||
|
|
||||||
|
The Linux posix_spawn() is implemented by a `fork()`/`exec()` pair by default.
|
||||||
|
|
||||||
|
`vfork()` is used either if specified by an input flag or no
|
||||||
|
signal/scheduler/process_group changes are requested. There must be a
|
||||||
|
reason why signal handling changes would preclude `vfork()` usage, but I
|
||||||
|
could not find it (signal handling data is stored in the kernel task_struct).
|
||||||
|
|
||||||
|
The Linux glibc `posix_spawn()` currently does nothing that user code could
|
||||||
|
not do. Still, using it would probably be a good future-proofing idea, but
|
||||||
|
for a significant problem: there is no way to specify closing all open
|
||||||
|
descriptors bigger than a specified value (closefrom() equivalent). This is
|
||||||
|
available on Solaris and quite necessary in fact, because we have no way to
|
||||||
|
be sure that all open descriptors have the CLOEXEC flag set.
|
||||||
|
|
||||||
|
12500 small .doc files:
|
||||||
|
|
||||||
|
fork: real 0m46.025s user 0m26.574s sys 0m39.494s
|
||||||
|
vfork: real 0m18.223s user 0m17.753s sys 0m1.736s
|
||||||
|
spawn/fork: real 0m45.726s user 0m27.082s sys 0m40.575s
|
||||||
|
spawn/vfork: real 0m18.915s user 0m18.681s sys 0m3.828s
|
||||||
|
|
||||||
|
No surprise here, given the implementation of posix_spawn(), it gets the
|
||||||
|
same times as the fork/vfork options.
|
||||||
|
|
||||||
|
It is difficult to ignore the 60% reduction in execution time offered by
|
||||||
|
using 'vfork()'.
|
||||||
|
|
||||||
|
Objections to vfork:
|
||||||
|
ld.so locks
|
||||||
|
sigaction locks
|
||||||
|
|
||||||
|
https://bugzilla.redhat.com/show_bug.cgi?id=193631
|
||||||
|
|
||||||
|
Is Linux vfork thread-safe ? Quoting interesting comments from Solaris
|
||||||
|
implementation:
|
||||||
|
No answer to the issues cited though.
|
||||||
|
|
||||||
|
https://sourceware.org/bugzilla/show_bug.cgi?id=378
|
||||||
|
Use vfork() in posix_spawn()
|
||||||
138
website/idxthreads/xapDocCopyCrash.txt
Normal file
138
website/idxthreads/xapDocCopyCrash.txt
Normal file
@ -0,0 +1,138 @@
|
|||||||
|
= The case of the bad Xapian::Document copy
|
||||||
|
|
||||||
|
== How things were supposed to work
|
||||||
|
|
||||||
|
Coming from the link:threadingRecoll.html[threading *Recoll*] page,
|
||||||
|
you may remember that the third stage of the
|
||||||
|
processing pipeline breaks up text into terms, producing a *Xapian*
|
||||||
|
document (+Xapian::Document+) which is finally processed by the last stage,
|
||||||
|
the index updater.
|
||||||
|
|
||||||
|
What happens in practise is that the main routine in this stage has a local
|
||||||
|
+Xapian::Document+ object, automatically allocated on the stack, which it
|
||||||
|
updates appropriately and then copies into a task object which is placed on
|
||||||
|
the input queue for the last stage.
|
||||||
|
|
||||||
|
The text-splitting routine then returns, and its local +Xapian::Document+
|
||||||
|
object is (implicitely) deleted while the stack unwinds.
|
||||||
|
|
||||||
|
The idea is that the *copy* of the document which is on the queue should be
|
||||||
|
unaffected, it is independant of the original and will further be processed
|
||||||
|
by the index update thread, without interaction with the text-splitting one.
|
||||||
|
|
||||||
|
At no point do multiple threads access the +Xapian::Document+ data, so
|
||||||
|
there should be no problem.
|
||||||
|
|
||||||
|
== The problem
|
||||||
|
|
||||||
|
Most *Xapian* objects are reference-counted, which means that the object
|
||||||
|
itself is a small block of house-keeping variables. The actual data is
|
||||||
|
allocated on the heap through eventual calls to new/malloc, and is shared
|
||||||
|
by multiple copies of the object. This is the case for +Xapian::Document+
|
||||||
|
|
||||||
|
This is aboundantly documented, and users are encouraged to use copies
|
||||||
|
instead of passing pointers around (copies are cheap because only a small
|
||||||
|
block of auxiliary data is actually duplicated). This in general makes
|
||||||
|
memory management easier.
|
||||||
|
|
||||||
|
This is well-known, and it would not appear to be a problem in the above
|
||||||
|
case as the +Xapian::Document+ actual data is never accessed by multiple
|
||||||
|
threads.
|
||||||
|
|
||||||
|
The problem is that the reference counter which keeps track of the object
|
||||||
|
usage and triggers actual deletion when it goes to zero is accessed by two
|
||||||
|
threads:
|
||||||
|
|
||||||
|
- It is decremented while the first local object is destroyed during the
|
||||||
|
stack unwind in the first thread
|
||||||
|
- It is also updated by the last stage thread, incremented if copies are
|
||||||
|
made, then decremented until it finally goes down to 0 when we are done
|
||||||
|
with the object, at which point the document data is unallocated.
|
||||||
|
|
||||||
|
As the counter is not protected in any way against concurrent access, the
|
||||||
|
actual sequence of events is undefined and at least two kinds of problems
|
||||||
|
may occur: double deletion of the data, or accesses to already freed heap
|
||||||
|
data (potentially thrashing other threads allocations, or reading modified
|
||||||
|
data).
|
||||||
|
|
||||||
|
A relatively simple fix for this would be to use atomic test-and-set
|
||||||
|
operations for the counter (which is what the GNU +std::string+ does). But
|
||||||
|
the choice made by *Xapian* to let the application deal with all
|
||||||
|
synchronization issues is legitimate and documented, nothing to complain
|
||||||
|
about here. I just goofed.
|
||||||
|
|
||||||
|
Because the counter test and update operations are very fast, and occur
|
||||||
|
among a lot of processing from the final stage thread, the chances of
|
||||||
|
concurrent access are low, which is why the problem manifests itself very
|
||||||
|
rarely. Depending on thread scheduling and all manners of semi-random
|
||||||
|
conditions, it is basically impossible to reproduce reliably.
|
||||||
|
|
||||||
|
== The fix
|
||||||
|
|
||||||
|
The implemented fix was trivial: the upstream thread allocates the initial
|
||||||
|
+Xapian::Document+ on the heap, copies the pointer to the queue object, and
|
||||||
|
forgets about it. The index-updating thread peruses the object then
|
||||||
|
+delete+'s it. Real easy.
|
||||||
|
|
||||||
|
An alternative solution would have been to try and use locking to protect
|
||||||
|
the counter updates. The only place where such locking operations could
|
||||||
|
reasonably occur is inside the +Xapian::Document+ refcounted pointer
|
||||||
|
object, which we can't modify. Otherwise, we would have to protect the
|
||||||
|
_whole scopes of existence_ of the Xapian::Document object in any routine
|
||||||
|
which creates/copies or (implicitely) deletes it, which would cause many
|
||||||
|
problems and/or contention issues
|
||||||
|
|
||||||
|
== Why did I miss this ?
|
||||||
|
|
||||||
|
The mechanism of the crashes is simple enough, quasi-obvious.
|
||||||
|
How on earth could I miss this problem while writing the code ?
|
||||||
|
|
||||||
|
For the sake of anecdote, my first brush with atomicity for updates of
|
||||||
|
reference counters was while debugging a System V release 4 kernel VFS file
|
||||||
|
system module, at the time when SVR4 got a preemptive kernel with SVR4-MP,
|
||||||
|
circa 1990... I ended up replacing a +counter+++ with +atomic_add()+ after
|
||||||
|
a set of _interesting_ debugging sessions interspersed with kernel crashes
|
||||||
|
and +fsck+ waits. This should have left some memories. So what went wrong ?
|
||||||
|
Here follow a list of possible reasons:
|
||||||
|
|
||||||
|
- Reasoning by analogy: std::string are safe to use in this way. The other
|
||||||
|
objects used in the indexing pipe are also safe. I just used
|
||||||
|
+Xapian::Document+ in the same way without thinking further.
|
||||||
|
- Probably not how I would do it: faced with designing +Xapian::Document+,
|
||||||
|
(not clever enough to do this anyway), I'd probably conclude that not
|
||||||
|
wanting to deal with full-on concurrency is one thing, not protecting the
|
||||||
|
reference counters is another, and going too far.
|
||||||
|
- The problem was not so easily visible because the object deletion is
|
||||||
|
implicitely performed during the stack unwind: this provides no clue, no
|
||||||
|
specific operation to think about.
|
||||||
|
- Pure lazyness.
|
||||||
|
|
||||||
|
|
||||||
|
As a conclusion, a humble request to library designers: when an
|
||||||
|
interface works counter to the reasonable expectations of at least some of
|
||||||
|
the users (for example because it looks like, but works differently, than a
|
||||||
|
standard library interface), it is worth it to be very specific in the
|
||||||
|
documentation and header file comments about the gotcha's. Saving people
|
||||||
|
from their own deficiencies is a worthy goal.
|
||||||
|
|
||||||
|
Here, a simple statement that the reference count was not mt-safe
|
||||||
|
(admittedly redundant with the general statement that the *Xapian* library
|
||||||
|
does not deal with threads), would have got me thinking and avoided the
|
||||||
|
error.
|
||||||
|
|
||||||
|
++++
|
||||||
|
<h2 id="comments">Comments</h2>
|
||||||
|
|
||||||
|
<div id="disqus_thread"></div>
|
||||||
|
<script type="text/javascript">
|
||||||
|
var disqus_shortname = 'lesbonscomptes';
|
||||||
|
(function() {
|
||||||
|
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
|
||||||
|
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
|
||||||
|
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
|
||||||
|
})();
|
||||||
|
</script>
|
||||||
|
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
|
||||||
|
<a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
|
||||||
|
|
||||||
|
++++
|
||||||
Loading…
x
Reference in New Issue
Block a user