This commit is contained in:
Jean-Francois Dockes 2012-03-06 07:26:59 +01:00
parent 7fc09a201e
commit de9a06f04c
2 changed files with 105 additions and 50 deletions

View File

@ -372,6 +372,12 @@ I now use the OpenSUSE build service to create Recoll OpenSUSE packages.
<h3>Updated 1.16 translations that became available after the
release:</h3>
<p>A new Spanish translation for 1.16.2, thanks to JCP.
<a href="translations/recoll_es.ts">recoll_es.ts</a>
<a href="translations/recoll_es.qm">recoll_es.qm</a>
</p>
<p>The following are up to date in 1.16.2, but may be useful if you
are running 1.16.1.</p>
<p>Czech, thanks to Pavel.

View File

@ -19,20 +19,23 @@
<div class="content">
<h1>Recoll index format details</h1>
<p>A comparison of index formats for recoll 1.8 and omega
1.0.1</p>
<p>A comparison of index formats for recoll 1.17 and omega
1.0.1</p>
<p>Recoll terms are not stemmed before being stored. They are turned to
all minuscule letters with no accents. An auxiliary database
handles stem expansion. Omega stores both raw
terms and stemmed versions (with prefix Z)</p>
terms (with prefix R) and stemmed versions (with prefix Z).
The xapian-side of the information here comes from the relevant
xapian-omega <a
href="http://xapian.org/docs/omega/termprefixes.html">documentation
page</a>.
</p>
<h2>Special prefixed terms:</h2>
<p>A comparison of prefixed term usage between Recoll and
omega/xapian. <em>xapian-core</em> in the Omega column means
that the prefix is not used by Omega, but mentionned as
allocated in the xapian prefix definition document.</p>
omega/xapian.</p>
<table border=1 cellspacing=0 width="90%">
<thead>
@ -40,63 +43,109 @@
</tr>
</thead>
<tbody>
<tr><td>T</td><td>mime type</td><td>Same</td>
</tr>
<tr><td>A</td><td>Author</td><td>Same</td></tr>
<tr><td>P</td><td>Truncated/hashed version of file path. For
single-document files, and for the file part of a
multi-document file. Used for up-to-date checks and for
retrieving a document by path. </td><td>Path part of URL (no
hashing). Uses U for the equivalent
term used for up to date checks.</td>
</tr>
<tr><td>Q</td><td>pathhash+ipath same + internal path for
documents inside multi-document files. Used to set the
existence flag for subdocs when a multi-document file is found
to be up to date, or for deleting all subdocs for a file, or
for retrieving a document by path+ipath. Compatible
with Q definition in xapian/termprefixes.txt: unique
identifier.</td><td>None</td>
</tr>
<tr><td>B</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>C</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>D</td><td>date: modification date of file, like
YYYYMMDD</td><td>Same</td>
YYYYMMDD</td><td>Same</td></tr>
<tr><td>E</td><td>Unused. Recoll uses XE</td>
<td>file name extension folded to lowercase</td></tr>
<tr><td>F</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>G</td><td>Unused</td><td>newGroup / forum name</td></tr>
<tr><td>H</td><td>Unused</td><td>host name</td></tr>
<tr><td>I</td><td>Unused</td><td>"Can see"</td></tr>
<tr><td>J</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>K</td><td>Keyword</td><td>Same</td></tr>
<tr><td>L</td><td>Unused</td><td>ISO language code</td></tr>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td></tr>
<tr><td>N</td><td>Unused</td><td>ISO country code</td></tr>
<tr><td>O</td><td>Unused</td><td>Owner</td></tr>
<tr><td>P</td><td>Unused</td><td>Path part of URL</td></tr>
<tr><td>Q</td><td>Unique Id. fs backend: trunc-hashed path+ipath
Other backends may use a different unique id.
</td><td>Unique Id</td></tr>
<tr><td>R</td><td>Unused</td><td>Raw (unstemmed) term</td></tr>
<tr><td>S</td><td>Subject/title</td><td>Same</td></tr>
<tr><td>T</td><td>mime type</td><td>Same</td></tr>
<tr><td>U</td><td>Unused</td><td>Full Url of indexed
document. Truncated/hashed version of URL. Used for
duplicate checks.</td></tr>
<tr><td>V</td><td>Unused</td><td>"Can't see"</td></tr>
<tr><td>W</td><td>Unused</td><td>Owner</td></tr>
<tr><td>X</td><td>Prefix prefix for multichar prefixes</td>
<td>Same</td></tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td></tr>
<tr><td>Z</td><td>Unused</td><td>Stemmed term</td></tr>
<tr><td>XE</td><td>File name extension folded as lowercase
(omega uses E)</td><td>Unused</td></tr>
<tr><td>XP</td><td>Path elements (for phrase-based directory filtering)
</td><td>Unused</td></tr>
<tr><td>XSFN</td><td>utf8 lowercased/unaccented version of
file name. Used for specific file name searches. NOT SPLIT
(spaces as normal chars).</td><td>None</td>
<tr><td>XTO</td><td>Recipient</td><td>None</td>
<tr><td>XXST</td><td>Not really a prefix: start of field
marker (for anchored phrase searches)</td><td>None</td>
<tr><td>XXND</td><td>Not really a prefix: end of field
marker (for anchored phrase searches)</td><td>None</td>
</tr>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
</tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td>
</tr>
<tr><td>XSFN</td><td>utf8 version of file name. Used for specific
file name searches</td><td>None</td>
</tr>
<tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
of URL. Used for duplicate checks.</td>
</tr>
<tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
</tr>
<tr><td>A</td><td>Author</td><td>xapian-core</td>
</tr>
<tr><td>K</td><td>Keyword</td><td>xapian-core</td>
</tr>
</tbody>
</table>
<p>None of the "date" terms are currently used by recoll queries</p>
<h2>Values</h2>
<p>Recoll currently stores no document values.</p>
<p>Omega stores 2 values, for the md5 hash of the file, and the
last modification date (as unix time). The md5 value doesn't
appear to be currently used ?</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Value slot</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>0</td><td>Unused</td><td>Unix modification time</td></tr>
<tr><td>1</td><td>MD5</td><td>Same</td></tr>
<tr><td>2</td><td>Unused</td><td>Size</td></tr>
<tr><td>10</td><td>Signature: value to be checked for
up-to-dateness, ie mtime|size for the fs
backend</td><td>Unused</td></tr>
</tbody>
</table>
<h2>Document data record format</h2>
<p>Recoll has the same line based / prefixed data record format
as omega (name=value\n).</p>
as omega (name=value\n). The Omega data below is quite out of
date.</p>
<table border=1 cellspacing=0 width="90%">
<thead>
@ -141,7 +190,7 @@
<address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
<!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
<!-- hhmts start -->
Last modified: Thu Jun 14 11:14:38 CEST 2007
Last modified: Sat Feb 25 09:14:38 CEST 2012
<!-- hhmts end -->
</body>
</html>