doc
This commit is contained in:
parent
f19f790b36
commit
bab589c7ee
@ -44,6 +44,14 @@
|
||||
|
||||
<p><br></p>
|
||||
|
||||
<h1>Howtos on the Recoll Wiki</h1>
|
||||
|
||||
<p>You will find a number of useful tips for common
|
||||
issues and extensions on the
|
||||
<a href="http://bitbucket.org/medoc/recoll/wiki/">
|
||||
Recoll Wiki</a> on
|
||||
<a href="http://bitbucket.org/medoc/recoll">bitbucket.org</a>.</p>
|
||||
|
||||
<h1>Other documentation</h1>
|
||||
|
||||
<ul>
|
||||
|
||||
@ -371,7 +371,6 @@ Bitbucket).</p>
|
||||
<p>A Danish translation by Morten Langlo:
|
||||
<a href="translations/recoll_da.ts">recoll_da.ts</a>
|
||||
<a href="translations/recoll_da.qm">recoll_da.qm</a><br/>
|
||||
This is in 1.20.6
|
||||
</p>
|
||||
|
||||
<p>Note that, if you are running an older release, you may find updated
|
||||
|
||||
@ -1,2 +1,7 @@
|
||||
threadingRecoll.html : threadingRecoll.txt nothreads.png
|
||||
asciidoc threadingRecoll.txt
|
||||
.SUFFIXES: .txt .html
|
||||
|
||||
.txt.html:
|
||||
asciidoc $<
|
||||
|
||||
all: threadingRecoll.html forkingRecoll.html xapDocCopyCrash.html
|
||||
|
||||
|
||||
@ -19,39 +19,40 @@ many texts which address the subject. While researching, though, I found
|
||||
out that not so many were accurate and that a lot of questions were left as
|
||||
an exercise to the reader.
|
||||
|
||||
This document will list the references I found reliable and interesting and
|
||||
describe the solution chosen along the other possible approaches.
|
||||
|
||||
== Issues with fork
|
||||
|
||||
The traditional way for a Unix process to start another is the
|
||||
fork()/exec() system call pair. The initial fork() duplicates the address
|
||||
space and resources (open files etc.) of the first process, then duplicates
|
||||
the thread of execution, ending up with 2 mostly identical processes.
|
||||
exec() then replaces part of the newly executing process with an address space
|
||||
initialized from an executable file, inheriting some of the old assets
|
||||
+fork()+/+exec()+ system call pair.
|
||||
|
||||
+fork()+ duplicates the process address space and resources (open files
|
||||
etc.), then duplicates the thread of execution, ending up with 2 mostly
|
||||
identical processes.
|
||||
|
||||
+exec()+ then replaces part of the newly executing process with an address
|
||||
space initialized from an executable file, inheriting some of the resources
|
||||
under various conditions.
|
||||
|
||||
As processes became bigger the copying-before-discard operation wasted
|
||||
As processes became bigger the copy-before-discard operation wasted
|
||||
significant resources, and was optimized using two methods (at very
|
||||
different points in time):
|
||||
|
||||
- The first approach was to supplement fork() with the vfork() call, which
|
||||
- The first approach was to supplement +fork()+ with the +vfork()+ call, which
|
||||
is similar but does not duplicate the address space: the new process
|
||||
thread executes in the old address space. The old thread is blocked
|
||||
until the new one calls exec() and frees up access to the memory
|
||||
until the new one calls +exec()+ and frees up access to the memory
|
||||
space. Any modification performed by the child thread persists when
|
||||
the old one resumes.
|
||||
|
||||
- The more modern approach, which cohexists with vfork(), was to replace
|
||||
- The more modern approach, which cohexists with +vfork()+, was to replace
|
||||
the full duplication of the memory space with duplication of the page
|
||||
descriptors only. The pages in the new process are marked copy-on-write
|
||||
so that the new process has write access to its memory without
|
||||
disturbing its parent. The problem with this approach is that the
|
||||
operation can still be a significant resource consumer for big processes
|
||||
mapping a lot of memory. Many processes can fall in this category not
|
||||
because they have huge data segments, but just because they are linked
|
||||
to many shared libraries.
|
||||
disturbing its parent. This approach was supposed to make +vfork()+
|
||||
obsolete, but the operation can still be a significant resource consumer
|
||||
for big processes mapping a lot of memory, so that +vfork()+ is still
|
||||
around. Programs can have big memory spaces not only because they have
|
||||
huge data segments (rare), but just because they are linked to many
|
||||
shared libraries (more common).
|
||||
|
||||
NOTE: Orders of magnitude: a *recollindex* process will easily grow into a
|
||||
few hundred of megabytes of virtual space. It executes the small and
|
||||
@ -60,7 +61,7 @@ indexing multiple such files, *recollindex* can spend '60% of its CPU time'
|
||||
doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux,
|
||||
where `fork()` uses copy-on-write).
|
||||
|
||||
Apart from the performance cost, another issue with fork() is that a big
|
||||
Apart from the performance cost, another issue with +fork()+ is that a big
|
||||
process can fail executing a small command because of the temporary need to
|
||||
allocate twice its address space. This is a much discussed subject which we
|
||||
will leave aside because it generally does not concern *recollindex*, which
|
||||
@ -68,16 +69,16 @@ in typical conditions uses a small portion of the machine virtual memory,
|
||||
so that a temporary doubling is not an issue.
|
||||
|
||||
The Recoll indexer is multithreaded, which may introduce other issues. Here
|
||||
is what happens to threads during the fork()/exec() interval:
|
||||
is what happens to threads during the +fork()+/+exec()+ interval:
|
||||
|
||||
- fork():
|
||||
- +fork()+:
|
||||
* The parent process threads all go on their merry way.
|
||||
* The child process is created with only one thread active, duplicated
|
||||
from the one which called fork()
|
||||
- vfork()
|
||||
* The parent process thread calling vfork() is suspended, the others
|
||||
from the one which called +fork()+
|
||||
- +vfork()+
|
||||
* The parent process thread calling +vfork()+ is suspended, the others
|
||||
are unaffected.
|
||||
* The child is created with only one thread, as for fork().
|
||||
* The child is created with only one thread, as for +fork()+.
|
||||
This thread shares the memory space with the parent ones, without
|
||||
having any means to synchronize with them (pthread locks are not
|
||||
supposed to work across processes): caution needed !
|
||||
@ -92,14 +93,14 @@ performed in the child (if no cleanup is performed, pipes may remain open
|
||||
at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange
|
||||
user Celada for explaining this to me.
|
||||
|
||||
For multithreaded programs, both fork() and vfork() introduce possibilities
|
||||
For multithreaded programs, both +fork()+ and +vfork()+ introduce possibilities
|
||||
of deadlock, because the resources held by a non-forking thread in the
|
||||
parent process can't be released in the child because the thread is not
|
||||
duplicated. This used to happen from time to time in *recollindex* because
|
||||
of an error logging call performed if the exec() failed after the fork()
|
||||
of an error logging call performed if the +exec()+ failed after the +fork()+
|
||||
(e.g. command not found).
|
||||
|
||||
With vfork() it is also possible to trigger a deadlock in the parent by
|
||||
With +vfork()+ it is also possible to trigger a deadlock in the parent by
|
||||
(inadvertently) modifying data in the child. This could happen just
|
||||
link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because
|
||||
of dynamic linker operation] (which, seriously, should be considered a
|
||||
@ -110,7 +111,7 @@ In general, the state of program data in the child process is a semi-random
|
||||
snapshot of what it was in the parent, and the official word about what you
|
||||
can do is that you can only call
|
||||
link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library
|
||||
functions] between 'fork()' and 'exec()'. These are functions which are
|
||||
functions] between +fork()+ and +exec()+. These are functions which are
|
||||
safe to call from a signal handler because they are either reentrant or
|
||||
can't be interrupted by a signal. A notable missing entry in the list is
|
||||
`malloc()`.
|
||||
@ -120,8 +121,8 @@ another program (but the devil is in the details as demonstrated by the
|
||||
logging call issue...).
|
||||
|
||||
One of the approaches often proposed for working around this mine-field is
|
||||
to use an auxiliary, small, process to execute any command needed by the
|
||||
main one. The small process can just use fork() with no performance
|
||||
to use an auxiliary small process to execute any command needed by the main
|
||||
one. The small process can just use +fork()+/+exec()+ with no performance
|
||||
issues. This has the inconvenient of complicating communication a lot if
|
||||
data needs to be transferred one way or another.
|
||||
|
||||
@ -164,28 +165,54 @@ descriptors bigger than a specified value (closefrom() equivalent). This is
|
||||
available on Solaris and quite necessary in fact, because we have no way to
|
||||
be sure that all open descriptors have the CLOEXEC flag set.
|
||||
|
||||
12500 small .doc files:
|
||||
So, no `posix_spawn()` for us (support was implemented inside
|
||||
*recollindex*, but the code is normally not used).
|
||||
|
||||
fork: real 0m46.025s user 0m26.574s sys 0m39.494s
|
||||
vfork: real 0m18.223s user 0m17.753s sys 0m1.736s
|
||||
spawn/fork: real 0m45.726s user 0m27.082s sys 0m40.575s
|
||||
spawn/vfork: real 0m18.915s user 0m18.681s sys 0m3.828s
|
||||
== The chosen solution
|
||||
|
||||
No surprise here, given the implementation of posix_spawn(), it gets the
|
||||
same times as the fork/vfork options.
|
||||
The previous version of +recollindex+ used to use +vfork()+ if it was running
|
||||
a single thread, and +fork()+ if it ran multiple ones.
|
||||
|
||||
It is difficult to ignore the 60% reduction in execution time offered by
|
||||
using 'vfork()'.
|
||||
After another careful look at the code, I could see few issues with
|
||||
using +vfork()+ in the multithreaded indexer, so this was committed.
|
||||
|
||||
The only change necessary was to get rid on an implementation of the
|
||||
lacking Linux +closefrom()+ call (used to close all open descriptors above a
|
||||
given value). The previous Recoll implementation listed the +/proc/self/fd+
|
||||
directory to look for open descriptors but this was unsafe because of of
|
||||
possible memory allocations in +opendir()+ etc.
|
||||
|
||||
== Test results
|
||||
|
||||
.Indexing 12500 small .doc files
|
||||
[options="header"]
|
||||
|===============================
|
||||
|call |real |user |sys
|
||||
|fork |0m46.025s |0m26.574s |0m39.494s
|
||||
|vfork |0m18.223s |0m17.753s |0m1.736s
|
||||
|spawn/fork| 0m45.726s|0m27.082s| 0m40.575s
|
||||
|spawn/vfork|0m18.915s|0m18.681s|0m3.828s
|
||||
|recoll 1.18|1m47.589s|0m21.537s|0m29.458s
|
||||
|================================
|
||||
|
||||
No surprise here, given the implementation of +posix_spawn()+, it gets the
|
||||
same times as the +fork()+/+vfork()+ options.
|
||||
|
||||
The tests were performed on an Intel Core i5 750 (4 cores, 4 threads).
|
||||
|
||||
The last line is just for the fun: *recollindex* 1.18 (single-threaded)
|
||||
needed almost 6 times as long to process the same files...
|
||||
|
||||
It would be painful to play it safe and discard the 60% reduction in
|
||||
execution time offered by using +vfork()+.
|
||||
|
||||
To this day, no problems were discovered, but, still crossing fingers...
|
||||
|
||||
////
|
||||
Objections to vfork:
|
||||
ld.so locks
|
||||
sigaction locks
|
||||
|
||||
https://bugzilla.redhat.com/show_bug.cgi?id=193631
|
||||
|
||||
Is Linux vfork thread-safe ? Quoting interesting comments from Solaris
|
||||
implementation:
|
||||
No answer to the issues cited though.
|
||||
|
||||
implementation: No answer to the issues cited though.
|
||||
https://sourceware.org/bugzilla/show_bug.cgi?id=378
|
||||
Use vfork() in posix_spawn()
|
||||
////
|
||||
|
||||
88
website/release-1.21.html
Normal file
88
website/release-1.21.html
Normal file
@ -0,0 +1,88 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||
<html>
|
||||
<head>
|
||||
<title>Recoll 1.20 series release notes</title>
|
||||
<meta name="Author" content="Jean-Francois Dockes">
|
||||
<meta name="Description"
|
||||
content="recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
||||
<meta name="Keywords" content="full text search, desktop search, unix, linux">
|
||||
<meta http-equiv="Content-language" content="en">
|
||||
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
|
||||
<meta name="robots" content="All,Index,Follow">
|
||||
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="rightlinks">
|
||||
<ul>
|
||||
<li><a href="index.html">Home</a></li>
|
||||
<li><a href="download.html">Downloads</a></li>
|
||||
<li><a href="doc.html">Documentation</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="content">
|
||||
<h1>Release notes for Recoll 1.20.x</h1>
|
||||
|
||||
<h2>Caveats</h2>
|
||||
|
||||
<p><em>Installing over an older version</em>: 1.19 </p>
|
||||
|
||||
<p>1.20 and 1.21 indexes are fully compatible. Installing 1.21
|
||||
over an 1.19 index is possible, but there have been small
|
||||
changes in the way compound words (e.g. email addresses) are
|
||||
indexed, so it will be best to reset the index. Still, in a
|
||||
pinch, 1.21 search can mostly use an 1.19 index. </p>
|
||||
|
||||
<p>Always reset the index if you do not know by which version it
|
||||
was created (you're not sure it's at least 1.18). The best method
|
||||
is to quit all Recoll programs and delete the index directory
|
||||
(<span class="literal">
|
||||
rm -rf ~/.recoll/xapiandb</span>), then start <code>recoll</code>
|
||||
or <code>recollindex</code>. <br>
|
||||
|
||||
<span class="literal">recollindex -z</span> will do the same
|
||||
in most, but not all, cases. It's better to use
|
||||
the <tt>rm</tt> method, which will also ensure that no debris
|
||||
from older releases remain (e.g.: old stemming files which are
|
||||
not used any more).</p>
|
||||
|
||||
<p>Case/diacritics sensitivity is off by default. It can be
|
||||
turned on <em>only</em> by editing
|
||||
recoll.conf (
|
||||
<a href="usermanual/usermanual.html#RCL.INDEXING.CONFIG.SENS">
|
||||
see the manual</a>). If you do so, you must then reset the
|
||||
index.</p>
|
||||
|
||||
|
||||
<h2>Changes in Recoll 1.21</h2>
|
||||
|
||||
<ul>
|
||||
<li>Allow saving queries to files and reloading them
|
||||
later. Available both for simple and advanced queries, and
|
||||
based on XML files.</li>
|
||||
<li>A Bison-based query parser replaces the old regexp-based
|
||||
one and allows parenthized sub-expressions and easier future
|
||||
expansions.</li>
|
||||
<li>The GUI gets a "close to system tray" function.</li>
|
||||
<li>Avoid retrying to index previously indexed files if
|
||||
nothing seems to have changed in the filters.</li>
|
||||
<li>Improve indexing speed by always using vfork() for
|
||||
spawning external commands.</li>
|
||||
<li>The pdf filter gains the capability to run OCR (tesseract) on
|
||||
image-only files.</li>
|
||||
<li>Improved check about when we should try to uncompress
|
||||
stuff. Will eliminate some of the most dreadful case of
|
||||
recollindex having an impact on system performance.</li>
|
||||
<li>Warn if non-existent paths are listed in the configuration
|
||||
file (help with typos).</li>
|
||||
<li>Adjust background color for webkit-based elements (result
|
||||
list and snippets window) according to desktop setup.</li>
|
||||
<li>Listing the results with the KIO slave is now
|
||||
performed with incremental updates. Bumped max entries to
|
||||
10000.</li>
|
||||
</ul>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
@ -25,89 +25,99 @@
|
||||
<div class="content">
|
||||
<h1>Major Recoll releases at a glance</h1>
|
||||
|
||||
<p>A summary of the major releases and the main features which
|
||||
came with them.</p>
|
||||
<p>A summary of the major releases and the main features which
|
||||
came with them.</p>
|
||||
|
||||
<dl>
|
||||
<dt><a href="release-1.21.html">Release 1.21</a> (future): new
|
||||
query parser</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>A Bison-based query parser replaces the old
|
||||
regexp-based one and allows parenthized
|
||||
sub-expressions and easier future
|
||||
expansions.</li>
|
||||
<li>Avoid retrying to index previously
|
||||
indexed files if nothing seems to have
|
||||
changed in the filters.</li>
|
||||
<li>Allow saving queries to files and reload them
|
||||
later. Available both for simple and advanced queries, and
|
||||
based on XML files.</li>
|
||||
<li>Improve indexing speed by always using
|
||||
vfork() for spawning external commands.</li>
|
||||
<li>GUI gets "close to system tray" function.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dt><a href="release-1.20.html">Release 1.20</a>: small
|
||||
improvements</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li><i>Open With</i> results list popup menu entry.</li>
|
||||
<li><i>fieldname:term1,term2</i>
|
||||
and <i>fieldname:term1/term2</i> shortcuts for AND/OR
|
||||
searches inside fields.</li>
|
||||
<li><i>Query fragments</i> tool.</li>
|
||||
<li>Better handling of compound terms like mail
|
||||
addresses.</li>
|
||||
<li>Selection on source collection type (Web history / File
|
||||
system).</li>
|
||||
<li>Configurable GUI geometry.</li>
|
||||
<li>Different handling
|
||||
of container file / subdocuments file name searches.</li>
|
||||
<li>Simultaneous -e -i options to recollindex.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dt><a href="release-1.19.html">Release 1.19</a>: multithreads
|
||||
indexing</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Better indexing performance through
|
||||
multithreading.</li>
|
||||
<li>Display list of subdocuments (e.g. attachments) for a
|
||||
given result.</li>
|
||||
<li>Collapsed duplicate results display link.</li>
|
||||
<li>Path translation facility (for portable indexes).</li>
|
||||
<li>Caches last uncompressed file (e.g. for fast
|
||||
compressed mbox access).</li>
|
||||
<li>Partial recursive reindex option to command line
|
||||
indexer.</li>
|
||||
<li>Can import tags from external application.</li>
|
||||
<li>Extended attributes indexing is on by default.</li>
|
||||
<li>New Python interface for data access. API re-modeled against
|
||||
newer Python Database API 2.0.</li>
|
||||
<li>Shared librecoll.so.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dt><a href="release-1.18.html">Release 1.18</a>: case and
|
||||
diacritics switchable sensitivity</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Index configuration for case and diacritics sensitivity.</li>
|
||||
<li>Advanced search history.</li>
|
||||
<li>Page-level access when opening PDFs, and snippets
|
||||
window.</li>
|
||||
<li>Use Xapian Synonyms tables for query expansion.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dt><a href="release-1.17.html">Release 1.17</a>: small
|
||||
improvements</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Language-dependant unaccenting.</li>
|
||||
<li>GUI dialogs for indexing schedule setup.</li>
|
||||
<li>Phrase-based <i>dir:</i> filtering, accepting path
|
||||
fragments. Size filtering.</li>
|
||||
<li>Python module default install and Unity Lens.</li>
|
||||
<li>Result list switched to WebKit: drops qt3 support.</li>
|
||||
<li>Indexing always performed by separate process.</li>
|
||||
<li>Dynamic category filters (defined as language fragments).</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dl>
|
||||
<dt><a href="release-1.21.html">Release 1.21</a> (future): new
|
||||
query parser</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Bison-based query parser replaces old regexp-based
|
||||
one and allows parenthized sub-expressions and easier
|
||||
future expansions.</li>
|
||||
<li>Avoid retrying to index previously
|
||||
indexed files if nothing seems to have
|
||||
changed in the filters.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt><a href="release-1.20.html">Release 1.20</a>: small
|
||||
improvements</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li><i>Open With</i> results list popup menu entry.</li>
|
||||
<li><i>fieldname:term1,term2</i>
|
||||
and <i>fieldname:term1/term2</i> shortcuts for AND/OR
|
||||
searches inside fields.</li>
|
||||
<li><i>Query fragments</i> tool.</li>
|
||||
<li>Better handling of compound terms like mail
|
||||
addresses.</li>
|
||||
<li>Selection on source collection type (Web history / File
|
||||
system).</li>
|
||||
<li>Configurable GUI geometry.</li>
|
||||
<li>Different handling
|
||||
of container file / subdocuments file name searches.</li>
|
||||
<li>Simultaneous -e -i options to recollindex.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
</dd>
|
||||
<dt><a href="release-1.19.html">Release 1.19</a>: multithreads
|
||||
indexing</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Better indexing performance through
|
||||
multithreading.</li>
|
||||
<li>Display list of subdocuments (e.g. attachments) for a
|
||||
given result.</li>
|
||||
<li>Collapsed duplicate results display link.</li>
|
||||
<li>Path translation facility (for portable indexes).</li>
|
||||
<li>Caches last uncompressed file (e.g. for fast
|
||||
compressed mbox access).</li>
|
||||
<li>Partial recursive reindex option to command line
|
||||
indexer.</li>
|
||||
<li>Can import tags from external application.</li>
|
||||
<li>Extended attributes indexing is on by default.</li>
|
||||
<li>New Python interface for data access. API re-modeled against
|
||||
newer Python Database API 2.0.</li>
|
||||
<li>Shared librecoll.so.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt><a href="release-1.18.html">Release 1.18</a>: case and
|
||||
diacritics switchable sensitivity</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Index configuration for case and diacritics sensitivity.</li>
|
||||
<li>Advanced search history.</li>
|
||||
<li>Page-level access when opening PDFs, and snippets
|
||||
window.</li>
|
||||
<li>Use Xapian Synonyms tables for query expansion.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
<dt><a href="release-1.17.html">Release 1.17</a>: small
|
||||
improvements</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Language-dependant unaccenting.</li>
|
||||
<li>GUI dialogs for indexing schedule setup.</li>
|
||||
<li>Phrase-based <i>dir:</i> filtering, accepting path
|
||||
fragments. Size filtering.</li>
|
||||
<li>Python module default install and Unity Lens.</li>
|
||||
<li>Result list switched to WebKit: drops qt3 support.</li>
|
||||
<li>Indexing always performed by separate process.</li>
|
||||
<li>Dynamic category filters (defined as language fragments).</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user