This commit is contained in:
Jean-Francois Dockes 2012-03-06 07:26:59 +01:00
parent 7fc09a201e
commit de9a06f04c
2 changed files with 105 additions and 50 deletions

View File

@ -372,6 +372,12 @@ I now use the OpenSUSE build service to create Recoll OpenSUSE packages.
<h3>Updated 1.16 translations that became available after the <h3>Updated 1.16 translations that became available after the
release:</h3> release:</h3>
<p>A new Spanish translation for 1.16.2, thanks to JCP.
<a href="translations/recoll_es.ts">recoll_es.ts</a>
<a href="translations/recoll_es.qm">recoll_es.qm</a>
</p>
<p>The following are up to date in 1.16.2, but may be useful if you <p>The following are up to date in 1.16.2, but may be useful if you
are running 1.16.1.</p> are running 1.16.1.</p>
<p>Czech, thanks to Pavel. <p>Czech, thanks to Pavel.

View File

@ -19,20 +19,23 @@
<div class="content"> <div class="content">
<h1>Recoll index format details</h1> <h1>Recoll index format details</h1>
<p>A comparison of index formats for recoll 1.8 and omega <p>A comparison of index formats for recoll 1.17 and omega
1.0.1</p> 1.0.1</p>
<p>Recoll terms are not stemmed before being stored. They are turned to <p>Recoll terms are not stemmed before being stored. They are turned to
all minuscule letters with no accents. An auxiliary database all minuscule letters with no accents. An auxiliary database
handles stem expansion. Omega stores both raw handles stem expansion. Omega stores both raw
terms and stemmed versions (with prefix Z)</p> terms (with prefix R) and stemmed versions (with prefix Z).
The xapian-side of the information here comes from the relevant
xapian-omega <a
href="http://xapian.org/docs/omega/termprefixes.html">documentation
page</a>.
</p>
<h2>Special prefixed terms:</h2> <h2>Special prefixed terms:</h2>
<p>A comparison of prefixed term usage between Recoll and <p>A comparison of prefixed term usage between Recoll and
omega/xapian. <em>xapian-core</em> in the Omega column means omega/xapian.</p>
that the prefix is not used by Omega, but mentionned as
allocated in the xapian prefix definition document.</p>
<table border=1 cellspacing=0 width="90%"> <table border=1 cellspacing=0 width="90%">
<thead> <thead>
@ -40,63 +43,109 @@
</tr> </tr>
</thead> </thead>
<tbody> <tbody>
<tr><td>T</td><td>mime type</td><td>Same</td> <tr><td>A</td><td>Author</td><td>Same</td></tr>
</tr>
<tr><td>P</td><td>Truncated/hashed version of file path. For <tr><td>B</td><td>Unused</td><td>Reserved</td></tr>
single-document files, and for the file part of a <tr><td>C</td><td>Unused</td><td>Reserved</td></tr>
multi-document file. Used for up-to-date checks and for
retrieving a document by path. </td><td>Path part of URL (no
hashing). Uses U for the equivalent
term used for up to date checks.</td>
</tr>
<tr><td>Q</td><td>pathhash+ipath same + internal path for
documents inside multi-document files. Used to set the
existence flag for subdocs when a multi-document file is found
to be up to date, or for deleting all subdocs for a file, or
for retrieving a document by path+ipath. Compatible
with Q definition in xapian/termprefixes.txt: unique
identifier.</td><td>None</td>
</tr>
<tr><td>D</td><td>date: modification date of file, like <tr><td>D</td><td>date: modification date of file, like
YYYYMMDD</td><td>Same</td> YYYYMMDD</td><td>Same</td></tr>
<tr><td>E</td><td>Unused. Recoll uses XE</td>
<td>file name extension folded to lowercase</td></tr>
<tr><td>F</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>G</td><td>Unused</td><td>newGroup / forum name</td></tr>
<tr><td>H</td><td>Unused</td><td>host name</td></tr>
<tr><td>I</td><td>Unused</td><td>"Can see"</td></tr>
<tr><td>J</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>K</td><td>Keyword</td><td>Same</td></tr>
<tr><td>L</td><td>Unused</td><td>ISO language code</td></tr>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td></tr>
<tr><td>N</td><td>Unused</td><td>ISO country code</td></tr>
<tr><td>O</td><td>Unused</td><td>Owner</td></tr>
<tr><td>P</td><td>Unused</td><td>Path part of URL</td></tr>
<tr><td>Q</td><td>Unique Id. fs backend: trunc-hashed path+ipath
Other backends may use a different unique id.
</td><td>Unique Id</td></tr>
<tr><td>R</td><td>Unused</td><td>Raw (unstemmed) term</td></tr>
<tr><td>S</td><td>Subject/title</td><td>Same</td></tr>
<tr><td>T</td><td>mime type</td><td>Same</td></tr>
<tr><td>U</td><td>Unused</td><td>Full Url of indexed
document. Truncated/hashed version of URL. Used for
duplicate checks.</td></tr>
<tr><td>V</td><td>Unused</td><td>"Can't see"</td></tr>
<tr><td>W</td><td>Unused</td><td>Owner</td></tr>
<tr><td>X</td><td>Prefix prefix for multichar prefixes</td>
<td>Same</td></tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td></tr>
<tr><td>Z</td><td>Unused</td><td>Stemmed term</td></tr>
<tr><td>XE</td><td>File name extension folded as lowercase
(omega uses E)</td><td>Unused</td></tr>
<tr><td>XP</td><td>Path elements (for phrase-based directory filtering)
</td><td>Unused</td></tr>
<tr><td>XSFN</td><td>utf8 lowercased/unaccented version of
file name. Used for specific file name searches. NOT SPLIT
(spaces as normal chars).</td><td>None</td>
<tr><td>XTO</td><td>Recipient</td><td>None</td>
<tr><td>XXST</td><td>Not really a prefix: start of field
marker (for anchored phrase searches)</td><td>None</td>
<tr><td>XXND</td><td>Not really a prefix: end of field
marker (for anchored phrase searches)</td><td>None</td>
</tr> </tr>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
</tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td>
</tr>
<tr><td>XSFN</td><td>utf8 version of file name. Used for specific
file name searches</td><td>None</td>
</tr>
<tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
of URL. Used for duplicate checks.</td>
</tr>
<tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
</tr>
<tr><td>A</td><td>Author</td><td>xapian-core</td>
</tr>
<tr><td>K</td><td>Keyword</td><td>xapian-core</td>
</tr>
</tbody> </tbody>
</table> </table>
<p>None of the "date" terms are currently used by recoll queries</p>
<h2>Values</h2> <h2>Values</h2>
<p>Recoll currently stores no document values.</p>
<p>Omega stores 2 values, for the md5 hash of the file, and the <table border=1 cellspacing=0 width="90%">
last modification date (as unix time). The md5 value doesn't <thead>
appear to be currently used ?</p> <tr><th>Value slot</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>0</td><td>Unused</td><td>Unix modification time</td></tr>
<tr><td>1</td><td>MD5</td><td>Same</td></tr>
<tr><td>2</td><td>Unused</td><td>Size</td></tr>
<tr><td>10</td><td>Signature: value to be checked for
up-to-dateness, ie mtime|size for the fs
backend</td><td>Unused</td></tr>
</tbody>
</table>
<h2>Document data record format</h2> <h2>Document data record format</h2>
<p>Recoll has the same line based / prefixed data record format <p>Recoll has the same line based / prefixed data record format
as omega (name=value\n).</p> as omega (name=value\n). The Omega data below is quite out of
date.</p>
<table border=1 cellspacing=0 width="90%"> <table border=1 cellspacing=0 width="90%">
<thead> <thead>
@ -141,7 +190,7 @@
<address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address> <address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
<!-- Created: Thu Dec 7 13:07:40 CET 2006 --> <!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
<!-- hhmts start --> <!-- hhmts start -->
Last modified: Thu Jun 14 11:14:38 CEST 2007 Last modified: Sat Feb 25 09:14:38 CEST 2012
<!-- hhmts end --> <!-- hhmts end -->
</body> </body>
</html> </html>