*** empty log message ***

This commit is contained in:
dockes 2007-06-26 16:58:26 +00:00
parent 348b4bc717
commit 2674e45f29
16 changed files with 756 additions and 109 deletions

View File

@ -0,0 +1,88 @@
Summary: Desktop full text search tool with a qt gui
Name: recoll
Version: 1.8.1
Release: %mkrel 1
License: GPL
Group: Databases
URL: http://www.recoll.org/
Source0: http://www.lesbonscomptes.com/recoll/%{name}-%{version}.tar.bz2
Patch1: %{name}-configure.patch
BuildRequires: libxapian-devel
BuildRequires: libfam-devel
BuildRequires: libqt-devel >= 3.3.7
BuildRequires: libaspell-devel
Requires: xapian
BuildRoot: %{_tmppath}/%{name}-%{version}--buildroot
%description
Recoll is a personal full text search tool for Unix/Linux.
It is based on the very strong Xapian backend, for which
it provides an easy to use, feature-rich, easy administration,
QT graphical interface.
%prep
%setup -q
%patch1 -p0
%build
%configure2_5x \
--with-fam \
--with-aspell
%make
%install
[ "%{buildroot}" != "/" ] && rm -rf %{buildroot}
%makeinstall_std
desktop-file-install --vendor="" \
--add-category="X-MandrivaLinux-MoreApplications-Databases" \
--dir %{buildroot}%{_datadir}/applications %{buildroot}%{_datadir}/applications/*
%clean
[ "%{buildroot}" != "/" ] && rm -rf %{buildroot}
%files
%defattr(644,root,root,755)
%doc %{_datadir}/%{name}/doc
%attr(755,root,root) %{_bindir}/%{name}*
%{_datadir}/applications/recoll-searchgui.desktop
%{_datadir}/icons/hicolor/48x48/apps/recoll-searchgui.png
%dir %{_datadir}/%{name}
%dir %{_datadir}/%{name}/examples
%dir %{_datadir}/%{name}/filters
%dir %{_datadir}/%{name}/images
%dir %{_datadir}/%{name}/translations
%{_datadir}/%{name}/examples/mime*
%{_datadir}/%{name}/examples/*.conf
%attr(755,root,root) %{_datadir}/%{name}/examples/rclmon.sh
%attr(755,root,root) %{_datadir}/%{name}/filters/rc*
%{_datadir}/%{name}/filters/xdg-open
%{_datadir}/%{name}/images/*png
%{_mandir}/man1/recoll*
%{_mandir}/man5/recoll*
%{_datadir}/%{name}/translations/*.qm
%changelog
* Fri Apr 20 2007 Tomasz Pawel Gajc <tpg@mandriva.org> 1.8.1-1mdv2008.0
+ Revision: 16093
- new version
- drop P0
+ Mandriva <devel@mandriva.com>
* Tue Mar 06 2007 Tomasz Pawel Gajc <tpg@mandriva.org> 1.7.5-2mdv2007.0
+ Revision: 134128
- rebuild
* Tue Jan 30 2007 Tomasz Pawel Gajc <tpg@mandriva.org> 1.7.5-1mdv2007.1
+ Revision: 115423
- add patch 1 - fix build on x86_64
- add patch 0 - fix menu entry
- fix group
- add buildrequires
- set correct bits on files
- Import recoll

View File

@ -24,11 +24,12 @@
Dockes</holder>
</copyright>
<releaseinfo>$Id: usermanual.sgml,v 1.44 2007-06-08 16:46:53 dockes Exp $</releaseinfo>
<releaseinfo>$Id: usermanual.sgml,v 1.45 2007-06-26 16:58:25 dockes Exp $</releaseinfo>
<abstract>
<para>This document introduces full text search notions
and describes the installation and use of the &RCL; application.</para>
and describes the installation and use of the &RCL;
application. It currently describes &RCL; 1.9.</para>
</abstract>
@ -771,30 +772,6 @@ fvwm
<replaceable>unplugged</replaceable> but not
<replaceable>potatoes</replaceable> (in any part of the document).</para>
<para>The first element <literal>author:"john doe"</literal> is
a phrase search limited to a specific field. Phrase searches are
specified as usual by enclosing the words in double quotes. The
field specification appears before the colon (of course this is
not limited to phrases, <literal>author:Balzac</literal> would
be ok too). &RCL; currently manages the following fields:</para>
<itemizedlist>
<listitem><para><literal>title</literal>,
<literal>subject</literal> or <literal>caption</literal> are
synonyms which specify data to be searched for in the
document title or subject.</para>
</listitem>
<listitem><para><literal>author</literal> or
<literal>from</literal> for searching the documents originators.</para>
</listitem>
<listitem><para><literal>keyword</literal> for searching the
document specified keywords (few documents actually have any).</para>
</listitem>
</itemizedlist>
<para>The query language is currently the only way to use the
&RCL; field search capability.</para>
<para>All elements in the search entry are normally combined
with an implicit AND. It is possible to specify that elements be
OR'ed instead, as in <replaceable>Beatles</replaceable>
@ -817,8 +794,54 @@ fvwm
<para>An entry preceded by a <literal>-</literal> specifies a
term that should <emphasis>not</emphasis> appear.</para>
<para>The first element in the above exemple,
<literal>author:"john doe"</literal> is a phrase search limited
to a specific field. Phrase searches are specified as usual by
enclosing the words in double quotes. The field specification
appears before the colon (of course this is not limited to
phrases, <literal>author:Balzac</literal> would be ok
too). &RCL; currently manages the following fields:</para>
<itemizedlist>
<listitem><para><literal>title</literal>,
<literal>subject</literal> or <literal>caption</literal> are
synonyms which specify data to be searched for in the
document title or subject.</para>
</listitem>
<listitem><para><literal>author</literal> or
<literal>from</literal> for searching the documents originators.</para>
</listitem>
<listitem><para><literal>keyword</literal> for searching the
document specified keywords (few documents actually have any).</para>
</listitem>
</itemizedlist>
<para>As of release 1.9, the filters have the possibility to
create other fields with arbitrary names. No standard filters
use this possibility yet.</para>
<para>There are two other elements which may be specified
through the field syntax, but are somewhat special:</para>
<itemizedlist>
<listitem><para><literal>ext</literal> for specifying the file
name extension (Ex: <literal>ext:html</literal>)</para>
</listitem>
<listitem><para><literal>mime</literal> for specifying the
mime type. This one is quite special because you can specify
several values which will be OR'ed (the normal default for the
language is AND). Ex: <literal>mime:text/plain
mime:text/html</literal>. Specifying an explicit boolean
operator or negation (<literal>-</literal>) before a
<literal>mime</literal> specification is not supported and
will produce strange results.</para>
</listitem>
</itemizedlist>
<para>The query language is currently the only way to use the
&RCL; field search capability.</para>
<para>Words inside phrases and capitalized words are not
stem-expanded. Wildcards may be used anywhere.</para>
stem-expanded. Wildcards may be used anywhere inside a term.
Specifying a wild-card on the left of a term can produce a very
slow search.</para>
<para>You can use the <literal>show query</literal> link at the
top of the result list to check the exact query which was
@ -2089,36 +2112,91 @@ skippedPaths = ~/somedir/*.txt
will be given a file name as argument and should output the
text contents in html format on the standard output.</para>
<para>The html could be very minimal like the following
example:</para>
<programlisting>&lt;html>&lt;head>
&lt;meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
&lt/head>
&lt;body>some text content&lt;/body>&lt;/html>
</programlisting>
<para>You should take care to escape some characters inside
the text by transforming them into appropriate
entities. "<literal>&amp;</literal>" should be transformed into
"<literal>&amp;amp;</literal>", "<literal>&lt;</literal>"
should be transformed into "<literal>&amp;lt;</literal>".</para>
<para>The character set needs to be specified in the
header. It does not need to be UTF-8 (&RCL; will take care
of translating it), but it must be accurate for good
results.</para>
<para>&RCL; will also make use of other header fields if
they are present: <literal>title</literal>,
<literal>description</literal>, <literal>keywords</literal>.
<para>
<para>The easiest way to write a new filter is probably to start
from an existing one.</para>
<para>You can find more details about writing a &RCL; filter
in the <link linkend="rcl.extending.filters">section about
writing filters</link></para>
</sect3>
</sect2>
</sect1>
<sect1 id="rcl.extending">
<title>Extending &RCL;</title>
<sect2 id="rcl.extending.filters">
<title>Writing a document filter</title>
<para>&RCL; filters are executable programs which
translate from a specific format (ie:
<application>openoffice</application>,
<application>acrobat</application>, etc.) to the &RCL;
indexing input format, which was chosen to be HTML.</para>
<para>&RCL; filters are usually shell-scripts, but this is in
no way necessary. These programs are extremely simple and most
of the difficulty lies in extracting the text from the native
format, not outputting what is expected by &RCL;. Happily
enough, most document formats already have translators or text
extractors which handle the difficult part and can be called
from the filter.</para>
<para>Filters are called with a single argument which is the
source file name. They should output the result to stdout.</para>
<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
environment variable (values <literal>yes</literal>,
<literal>no</literal>) tells the filter if the operation is
for indexing or previewing. Some filters use this to output a
slightly different format. This is not essential.</para>
<para>The output HTML could be very minimal like the following
example:</para>
<programlisting>&lt;html>&lt;head>
&lt;meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
&lt/head>
&lt;body>some text content&lt;/body>&lt;/html>
</programlisting>
<para>You should take care to escape some characters inside
the text by transforming them into appropriate
entities. "<literal>&amp;</literal>" should be transformed into
"<literal>&amp;amp;</literal>", "<literal>&lt;</literal>"
should be transformed into "<literal>&amp;lt;</literal>".</para>
<para>The character set needs to be specified in the
header. It does not need to be UTF-8 (&RCL; will take care
of translating it), but it must be accurate for good
results.</para>
<para>&RCL; will also make use of other header fields if
they are present: <literal>title</literal>,
<literal>description</literal>,
<literal>keywords</literal>.</para>
<para>As of &RCL; release 1.9, filters also have the
possibility to "invent" field names. This should be output as
meta tags:</para>
<programlisting>
&lt;meta name="somefield" content="Some textual data" /&gt;
</programlisting>
<para>In this case, a correspondance between field name and
&XAP; prefix should also be added to the
<filename>mimeconf</filename> file. See the existing entries
for inspiration. The field can then be used inside the query
language to narrow searches.</para>
<para>The easiest way to write a new filter is probably to start
from an existing one.</para>
</sect2>
</sect1>
</chapter>
</book>

View File

@ -4,10 +4,21 @@ Bugs that are listed in an older version section are supposedly fixed in
later versions. Bugs listed in the topmost section may also exist in older
versions.
Latest (1.8.1):
Latest (1.8.2):
- There are a few problems in the qt4 version of recoll: some accelerators
(esc-spc, ctl-arrow) do not work, neither do copy/paste between the
result list and preview windows and x11 applications.
- The dates shown for email attachments in a result list are the email
folder modification date. This should be inherited from the parent
message instead.
- There are sometimes problems with document deletions: the index can
get in a state where deleted or moved documents are not purged from the
index (the log file says that the doc are deleted, but they aren't
actually). When this happens, the only solution currently is to reindex
from scratch (recollindex -z). This is due to a xapian bug, which will be
fixed in a future release. You can apply the following patch to xapian
1.0.1 to fix it:
http://www.lesbonscomptes.com/recoll/xapian/xapian-delete-document.patch
- NEAR crashes: 1.6 has added NEAR searches. Unlike what recoll did
with PHRASES, stemming expansion is performed on terms inside NEAR
clauses (except if prevented by a capitalized entry of course). There is
@ -39,9 +50,9 @@ Latest (1.8.1):
compressed (ie: xxx.txt.gz), recoll will try to start the external viewer
on the compressed file, which will not work in most cases.
- There are problems which have been reported indexing big mailstores
(several hundreds of thousands of messages): resulting in a very big
database and even crashes during indexation.
- Problems have been reported indexing big mailstores (several hundreds of
thousands of messages): resulting in a very big database and even
crashes.
- Under some versions of KDE (ie: Fedora FC5 KDE 3.5.4-0.5.fc5), there is a
problem with the window stacking order. Opening the "browse" file

View File

@ -1,5 +1,31 @@
CHANGES
1.9.0
- Add option to remember sort tool state between program invocations (it is
reset to inactive by default)
- Improve qt4 build: no more need for --enable-qt4
- Fixed a number of qt4 glitches: selection and keyboard shortcuts.
- When searching for an empty string inside the preview window, position
the window to the next occurrence of the primary search terms.
- Have email attachments inherit date and author from their parent message
- Added an adjustable flush threshold during indexing: should help control
memory usage. See the idxflushmb configuration parameter.
- Added a check for file system free space. Indexing will stop if the
threshold is reached. See the maxfsoccuppc configuration parameter.
- Fix bus error on rclmon exit
- Better handle aspell errors inside rclmon
- Added File menu entry to erase document history.
- Added ext: and mime: selectors to the query language.
- Added support for arbitrary fields. Filters can now produce any number of
fields which will be selectively searchable through the query language.
- Added abiword and kword support.
- Contributed filter: rcljpeg. This should be extended to use the new field
support.
- Changed the icon to an ugly one. The previous one was nicer but looked
too much like Xapian's.
- Added some kind of support for a stopword list.
- Bound space and backspace to PgUp/PgDown in preview.
1.8.2 2007-05-19
- Fixed method name for compatibility with xapian 1.0.0
- Add .beagle to default list of skipped names (avoids indexing beagle

View File

@ -38,7 +38,7 @@
<p>First of all, many thanks to the users who provided criticism
and ideas to make <span class="application">Recoll</span> go
forward ! Please
<a href="mailto:jean-francois.dockes@wanadoo.fr>
<a href="mailto:jean-francois.dockes@wanadoo.fr">
contact me</a> if you have something to suggest.</p>
<p><span class="application">Recoll</span> borrows

View File

@ -30,16 +30,24 @@
<div class="content">
<h1>Recoll user manuals</h1>
<h1>Recoll user manual</h1>
<blockquote>
<ul>
<li><a href="usermanual/index.html">English</a></li>
<li><a href="http://mcz.altervista.org/Pagine/usermanual-italian.html">
Italian</a></li>
</ul>
</blockquote>
<p><br></p>
<h1>Other documentation</h1>
<ul>
<li><a href="perfs.html">Index size and indexing performance
data.</a></li>
</ul>
</div>
</body>
</html>

View File

@ -24,7 +24,7 @@
<ul>
<li><a href="index.html">Home</a></li>
<li><b>Downloads</b></li>
<li><a href="usermanual/index.html">User manual</a></li>
<li><a href="doc.html">Documentation</a></li>
<li><a href="usermanual/rcl.install.html">Installation</a></li>
<li><a href="index.html#support">Support</a></li>
</ul>
@ -47,6 +47,8 @@
</table>
</p>
<h2><a name="source">General information</a></h2>
<p>You will probably need to have a look at the
<a href="usermanual/rcl.install.html">installation manual</a> for
building and/or installation instructions.</p>
@ -68,12 +70,17 @@
<a href="usermanual/index.html#RCL.INSTALL.EXTERNAL">list</a> to
decide what you may want to install.</p>
<p>In addition, optional functionality in Recoll (the term explorer
tool in phonetic mode) uses the <b>aspell</b> package. The
installed version should be at least 0.60 (utf-8 support) for
this to run smoothly. This function is far from essential.</p>
<p>If you find problems with the package or its
installation, <em>please</em>
<a href="mailto:jean-francois.dockes@wanadoo.fr">
report them</a>.</p>
<h4>What do the release numbers mean?</h4>
<h3>What do the release numbers mean?</h3>
<p>The Recoll releases are numbered X.Y.Z. </p>
@ -110,7 +117,16 @@
1.8.2 was released purely for fixing a small issue of
compatibility with xapian 1.0.0 and small config/install
glitches. There is no functional reason to upgrade from
1.8.1, (or update packages).
1.8.1, (or update packages).</p>
<p>Recoll 1.8.2 is the first release that will let you take
advantage of the new Xapian 1.0, the main user-visible change
of which is the new default index format. In order to take
advantage of the new format (which is not mandatory) Recoll
users updating from an older release need to delete their old
index. There are <a
href="usermanual/usermanual.html#RCL.INDEXING.STORAGE.FORMAT">more
details in the user manual</a>.</p>
<p>Older recoll releases:
<a href="recoll-1.8.1.tar.gz">1.8.1</a>
@ -128,8 +144,8 @@
<h2><a name="rpms">Packages</a></h2>
<p>The executables inside the binary rpms have a static link to
xapian, there is no dependency except Qt 3.3. Of course you need
xapian-core installed to use the source rpm. </p>
xapian 0.9.x, there is no dependency except Qt 3.3. Of course
you need xapian-core installed to use the source rpm. </p>
<p><b>Fedora Core</b>
FC6 RPM:
@ -168,10 +184,16 @@
<a href="debian/edgy/">debian/edgy</a>
</p>
<p><b>Ubuntu 6.06 dapper</b> (the feisty version does not work
on dapper). This has a static link on xapian 0.9.10:
<a href="debian/dapper/recoll_1.8.2-0ubuntu1_i386.deb">
recoll_1.8.2-0ubuntu1_i386.deb</a> </p>
<p><b>Debian unstable</b> Recoll is in the package repository,
you can install it with the usual <em>apt-get install
recoll</em>. <a
href="http://packages.qa.debian.org/r/recoll.html">Package page</a></p>
you can install it with the usual <em>apt-get install
recoll</em>. <a
href="http://packages.qa.debian.org/r/recoll.html">
Package page</a></p>
<p><b>Debian 3.1</b> Thanks to Mario (<img align="top" src="mario.png">)
for these: i386:

View File

@ -142,6 +142,7 @@
</dd>
</ul>
<h2><a name="#stemming"></a>Stemming</h2>
<p>Stemming is a process which transforms inflected words into

205
website/fr/features.html Normal file
View File

@ -0,0 +1,205 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>RECOLL: un outil personnel de recherche textuelle pour
Unix et Linux</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll est un logiciel personnel de recherche textuelle pour unix et linux basé sur Xapian, un moteur d'indexation puissant et mature.">
<meta name="Keywords" content=
"recherche textuelle,desktop,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="fr">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="../styles/style.css">
</head>
<body>
<div class="rightlinks">
<ul>
<li><a href="../index.html">Base</a></li>
<li><a href="../pics/index.html">Copies d'écrans</a></li>
<li><a href="../download.html">Téléchargements</a></li>
<li><a href="../manuals.html">Documentation</a></li>
<li><a href="../index.html#support">Support</a></li>
<li><a href="../devel.html">Développement</a></li>
</ul>
</div>
<div class="content">
<h1 class="intro">Caractéristiques de Recoll</h1>
<dl>
<dt><a name="systems">Systèmes</a></dt>
<dd><span class="application">Recoll</span> a été compilé et
testé sur FreeBSD, Linux, Darwin, Solaris (versions
FreeBSD 5.5, Fedora Core 5, Suse 10.1, Gentoo,
Debian 3.1, Ubuntu Edgy, Solaris 8/9, mais d'autres versions
récentes conviennent sans doute également).</dd>
<dd>Versions de QT: 3.2, 3.3 et 4.2</dd>
<dt><a name="doctypes">Types de documents</a></dt>
<dd>Recoll peut traiter les types de documents suivants, ainsi
que des fichiers compressés du même type:
<dl>
<dt>En interne</dt>
<dd>
<ul>
<li><var class="literal">text</var>.</li>
<li><var class="literal">html</var>.</li>
<li><span class="application">OpenOffice</span>
(avec l'aide de la commande <b>unzip</b>).</li>
<li><var class="literal">maildir</var> et <var
class="literal">mailbox</var> (<span class=
"application">Mozilla</span>, <span class=
"application">Thunderbird</span>, <span class=
"application">Evolution</span> et sans doute
d'autres).</li>
<li>Fichiers de conversation <span class="application">
gaim</span>.</li>
<li><span class="application">Scribus</span>.</li>
</ul>
</dd>
<dt>With external helpers</dt>
<dd>
<ul>
<li><var class="literal">pdf</var> avec <a href=
"http://www.foolabs.com/xpdf/">xpdf</a>.</li>
<li><var class="literal">postscript</var> avec
<a href="http://www.gnu.org/software/ghostscript/ghostscript.html">
ghostscript</a> et
<a href="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
pstotext</a>.</li>
<li>Fichiers <span class="application">Lyx</span>
(nécessite l'application
<span class="application">Lyx</span>).</li>
<li><span class="application">msword</span> avec <a href=
"http://www.winfield.demon.nl/">antiword</a>.</li>
<li><span class="application">Powerpoint</span> et
<span class="application">Excel</span> avec les utilitaires
<a href="http://www.45.free.net/~vitus/software/catdoc/">
catdoc</a>.</li>
<li><var class="literal">rtf</var> avec <a href=
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
<li><var class="literal">dvi</var> avec
<a href="http://www.radicaleye.com/dvips.html">dvips</a>.
</li>
<li><var class="literal">djvu</var> avec
<a href="http://djvulibre.djvuzone.org/doc/index.html">
DjVuLibre</a>. </li>
<li>Tags <var class="literal">mp3</var> avec
<a href="http://id3lib.sourceforge.net/">
id3info (id3lib)</a>. </li>
</ul>
</dd>
</dl>
</dd>
<dt>Autres caractéristiques</dt>
<dd>
<ul>
<li>Index multiples interrogeables ensemble ou séparément.</li>
<li>Fonctions de recherche puissantes, avec expressions
booléennes, phrases et proximité, caractères jokers,
filtrage sur les types de fichiers où l'emplacement.</li>
<li>Fonction spécifique de recherche de noms de fichiers.</li>
<li>Support de jeux de caractères multiples. Les traitements
internes et l'index utilisent l'encodage Unicode UTF-8.</li>
<li>L'extraction des racines de mots <a href="#Stemming">
Stemming</a> est effectuée au moment de la recherche
(permet de changer de langue après l'indexation).</li>
<li>Installation facile. Pas de processus permanent, de
serveur web ou environnement exotique.</li>
<li>Un indexeur qui peut fonctionner soit comme un
processus léger dans l'interface de consultation, comme un
programme batch externe intégrable par
<span class="application">cron</span>, ou comme un processus
permanent pour l'indexation au fil de l'eau.</li>
</ul>
</dd>
</ul>
<h2><a name="#stemming"></a>Lemmatisation</h2>
<p><em>Note: je serais preneur d'une traduction française
agréable pour "stemming".</em></p>
<p>La lemmatisation transforme un mot dérivé vers sa racine.
Par exemple, <i>aimer</i>, <i>aimerai</i>, <i>aimait</i>,
<i>aimez</i> etc. seraient transformés en <i>aim</i> en
français. Une recherche de l'un quelconque des dérivés peut
automatiquement être étendue vers tous les autres</p>
<p>Certains moteurs de recherche appliquent la transformation
pendant l'indexation. L'index ne stocke que les racines des
mots, avec des exceptions pour les termes qui sont reconnus
comme des noms propres (capitalisation). Au moment de la
recherche, les termes de la requête sont également transformés
avant comparaison à l'index.</p>
<p>Cette approche permet un index plus petit, mais elle perd
irrévocablement de l'information pendant l'indexation.</p>
<p>Recoll fonctionne différemment. Les termes sont indexés sans
transformation. L'index résultant est plus gros, ce qui n'a
probablement pas beaucoup d'importance à une époque de disques
de 100 Go principalement remplis d'information multimédia
<em>non indexée</em>.
<p>À la fin de l'indexation, Recoll construit un ou plusieurs
dictionnaires de transformation (pour différents langages), où
toutes les racines sont listées avec leurs transformations
possibles.</p>
<p>Au moment de la recherche, par défaut, les termes de
l'utilisateurs sont transformés, et étendus aux dérivés par
utilisation du dictionnaire.
Les résultats obtenus sont analogues à ceux de
l'autre méthode. L'avantage est que l'expansion peut être
contrôlée au moment de la recherche:
<ul>
<li>On peut la supprimer pour n'importe quel terme de la
requête, (en le faisant débuter par une capitale:
<em>Aime</em> par exemple pour chercher la ville d'Aime la
Plagne). </li>
<li>Le langage de transformation peut également être changé,
en supposant que plusieurs dictionnaires de transformation
aient été construits lors de l'indexation.</li>
</ul>
</div>
</body>
</html>

View File

@ -81,6 +81,16 @@
<li><a class="weak" href="features.html">(more detail)</a></li>
</ul>
<h2>News: </h1>
<p>There are new filters for
<span class="application">kword</span> and
<span class="application">abiword</span> files in the
<a href="filters/filters.html">new filters section</a>. These
are usable with an existing <span
class="application">Recoll</span> 1.8 installation.</p>
<h2><a name="support">Support</a></h3>
<p>If you have any problem with Recoll, its

View File

@ -97,6 +97,15 @@
</ul>
<h2>Nouvelles: </h1>
<p>Il y a de nouveaux filtres d'indexation pour les fichiers
<span class="application">kword</span> et
<span class="application">abiword</span>. Ils sont téléchargeables
dans la <a href="filters/filters.html">zone des nouveaux
filtres</a>, et sont utilisable avec une installation existante de
<span class="application">Recoll</span> 1.8.</p>
<h2><a name="support">Support</a></h3>
<p>Si vous avez un problème quelconque avec le logiciel ou son

BIN
website/mario.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

114
website/perfs.html Normal file
View File

@ -0,0 +1,114 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>RECOLL: a personal text search system for
Unix/Linux</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
<meta name="Keywords" content=
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="en">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="rightlinks">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="pics/index.html">Screenshots</a></li>
<li><a href="download.html">Downloads</a></li>
<li><a href="doc.html">Documentation</a></li>
</ul>
</div>
<div class="content">
<h1 class="intro">Recoll: Indexing performance and index sizes</h1>
<p>The time needed to index a given set of documents, and the
resulting index size depend of many factors, such as file size
and proportion of actual text content for the index size, cpu
speed, available memory, average file size and format for the
speed of indexing.</p>
<p>We try here to give a number of reference points which can
be used to roughly estimate the resources needed to create and
store an index. Obviously, your data set will never fit one of
the samples, so the results cannot be exactly predicted.</p>
<p>The following data was obtained on a machine with a 1800 Mhz
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
disk, running Suse 10.1.</p>
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
executed with the default flush threshold value.
The process memory usage is the one given by <b>ps</b></p>
<table border=1>
<thead>
<tr>
<th>Data</th>
<th>Data size</th>
<th>Indexing time</th>
<th>Index size</th>
<th>Peak process memory usage</th>
</tr>
<tbody>
<tr>
<td>Random pdfs harvested on Google</td>
<td>1.7 GB, 3564 files</td>
<td>27 mn</td>
<td>230 MB</td>
<td>225 MB</td>
</tr>
<tr>
<td>Ietf mailing list archive</td>
<td>211 MB, 44,000 messages</td>
<td>8 mn</td>
<td>350 MB</td>
<td>90 MB</td>
</tr>
<tr>
<td>Partial Wikipedia dump</td>
<td>15 GB, one million files</td>
<td>6H30</td>
<td>10 GB</td>
<td>324 MB</td>
</tr>
<tr>
<!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
<td>Random pdfs harvested on Google<br>
Recoll 1.9, <em>idxflushmb</em> set to 10</td>
<td>1.7 GB, 3564 files</td>
<td>25 mn</td>
<td>262 MB</td>
<td>65 MB</td>
</tr>
</tbody>
</table>
<p>Notice how the index size for the mail archive is bigger than
the data size. Myriads of small pure text documents will do
this. The factor of expansion would be even much worse with
compressed folders of course (the test was on uncompressed
data).</p>
<p>The last test was performed with Recoll 1.9.0 which has an
ajustable flush threshold (<em>idxflushmb</em> parameter), here
set to 10 MB. Notice the much lower peak memory usage, with no
performance degradation. The resulting index is bigger though,
the exact reason is not known to me, possibly because of
additional fragmentation </p>
</p>
</div>
</body>
</html>

View File

@ -2,72 +2,146 @@
<html>
<head>
<title>Recoll Index format</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll est un logiciel personnel de recherche textuelle pour unix et linux basé sur Xapian, un moteur d'indexation puissant et mature.">
<meta name="Keywords" content=
"recherche textuelle,desktop,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="fr">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="content">
<h1>Recoll index format details</h1>
<p>Terms are not stemmed before being stored. They are turned to
all minuscule letters with no accents.</p>
<p>A comparison of index formats for recoll 1.8 and omega
1.0.1</p>
<p>Special prefixed terms:</p>
<ul>
<li>Ddate: modification date of file, like YYYYMMDD</li>
<p>Recoll terms are not stemmed before being stored. They are turned to
all minuscule letters with no accents. An auxiliary database
handles stem expansion. Omega stores both raw
terms and stemmed versions (with prefix Z)</p>
<li>Mmonth: YYYYMM</li>
<h2>Special prefixed terms:</h2>
<li>Ppathhash truncated/hashed version of file path. For
<p>A comparison of prefixed term usage between Recoll and
omega/xapian. <em>xapian-core</em> in the Omega column means
that the prefix is not used by Omega, but mentionned as
allocated in the xapian prefix definition document.</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>T</td><td>mime type</td><td>Same</td>
</tr>
<tr><td>P</td><td>Truncated/hashed version of file path. For
single-document files, and for the file part of a
multi-document file. Used for up-to-date checks and for
retrieving a document by path. omega uses U for the equivalent
term used for up to date checks.</li>
retrieving a document by path. </td><td>Path part of URL (no
hashing). Uses U for the equivalent
term used for up to date checks.</td>
</tr>
<li>Qpathhash+ipath same + internal path for documents inside
multi-document files. Used to set the existence flag for
subdocs when a multi-document file is found to be up to date,
or for deleting all subdocs for a file, or for retrieving a
document by path+ipath. No real omega equivalent. Compatible
with Q definition in termprefixes.txt: unique identifier.</li>
<tr><td>Q</td><td>pathhash+ipath same + internal path for
documents inside multi-document files. Used to set the
existence flag for subdocs when a multi-document file is found
to be up to date, or for deleting all subdocs for a file, or
for retrieving a document by path+ipath. Compatible
with Q definition in xapian/termprefixes.txt: unique
identifier.</td><td>None</td>
</tr>
<li>Tmimetype: document mime type.</li>
<tr><td>D</td><td>date: modification date of file, like
YYYYMMDD</td><td>Same</td>
</tr>
<li>Wweak: 10 days period (not used any more by omega)</li>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
</tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td>
</tr>
<li>Yyear YYYY</li>
<tr><td>XSFN</td><td>utf8 version of file name. Used for specific
file name searches</td><td>None</td>
</tr>
<tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
of URL. Used for duplicate checks.</td>
</tr>
<li>XSFNfilename utf8 version of file name. Used for specific
file name searches</li>
<tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
</tr>
<tr><td>A</td><td>Author</td><td>xapian-core</td>
</tr>
<tr><td>K</td><td>Keyword</td><td>xapian-core</td>
</tr>
</tbody>
</table>
</ul>
<p>Omega prefixes with no equivalents in Recoll: P, R, U</p>
<p>None of the "date" terms are currently used by recoll queries</p>
<p>Values: Recoll currently stores no document values.</p>
<h2>Values</h2>
<p>Recoll currently stores no document values.</p>
<p>Omega stores 2 values, for the md5 hash of the file, and the
last modification date (as unix time). The md5 value doesn't
appear to be currently used ?</p>
<p>Document data record format<p>
<ul>
<li>url= Full url. Always file://abspath. The path is not
<h2>Document data record format</h2>
<p>Recoll has the same line based / prefixed data record format
as omega (name=value\n).</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>url=</td><td>Full url. Always file://abspath. The path is not
encoded to utf-8, this is the system file name ,usable as an
argument to open(). (omega: sort of same)</li>
<li>mtype= mime type (omega: type)</li>
<li>fmtime= file modification date (omega: modtime)</li>
<li>dmtime= document modification date (omega: none)</li>
<li>origcharset= character set the text was converted from
(omega: none)</li>
<li>fbytes= file size in bytes (omega: size)</li>
<li>dbytes= document size in bytes (omega: none)</li>
<li>ipath= internal path for docs in multidoc files. (omega: none)</li>
<li>caption= title of document, utf8 (omega: same)</li>
<li>keywords= key words, utf8 (omega: none)</li>
<li>abstract= document abstract, utf8 (omega: sample)</li>
</ul>
argument to open()</td><td>Same</td>
</tr>
<tr><td>mtype=</td><td>mime type (omega: type)</td><td>type=</td>
</tr>
<tr><td>fmtime=</td><td>file modification date</td><td>modtime=</td>
</tr>
<tr><td>dmtime=</td><td> document modification date</td><td>None</td>
</tr>
<tr><td>origcharset=</td><td> character set the text was
converted from</td><td>None</td>
</tr>
<tr><td>fbytes=</td><td> file size in bytes</td><td>size=</td>
</tr>
<tr><td>dbytes=</td><td>document size in bytes</td><td>None</td>
</tr>
<tr><td>ipath=</td><td>internal path for docs in multidoc
files</td><td>None</td>
</tr>
<tr><td>caption=</td><td>title of document, utf8</td><td>Same</td>
</tr>
<tr><td>keywords=</td><td>key words, utf8</td><td>None</td>
</tr>
<tr><td>abstract=</td><td>document abstract, utf8</td><td>sample=</td>
</tr>
</tbody>
</table>
</div>
<hr>
<address><a href="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois Dockes</a></address>
<!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
<!-- hhmts start -->
Last modified: Thu Dec 7 14:19:02 CET 2006
Last modified: Thu Jun 14 11:14:38 CEST 2007
<!-- hhmts end -->
</body>
</html>

BIN
website/smile.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 KiB

View File

@ -92,3 +92,4 @@ a.weak {
color: #aaaaaa;
}
table { empty-cells:show; }