*** empty log message ***

2007-06-26 16:58:26 +00:00 · 2007-06-26 16:58:26 +00:00 · 2674e45f29
commit 2674e45f29
parent 348b4bc717
16 changed files with 756 additions and 109 deletions
--- a/packaging/rpm/recollCooker.spec
+++ b/packaging/rpm/recollCooker.spec
@ -0,0 +1,88 @@
+Summary:	Desktop full text search tool with a qt gui
+Name:           recoll
+Version:        1.8.1
+Release:        %mkrel 1
+License:	GPL
+Group:          Databases
+URL:            http://www.recoll.org/
+Source0:	http://www.lesbonscomptes.com/recoll/%{name}-%{version}.tar.bz2
+Patch1:		%{name}-configure.patch
+BuildRequires:	libxapian-devel
+BuildRequires:	libfam-devel
+BuildRequires:	libqt-devel	>= 3.3.7
+BuildRequires:	libaspell-devel
+Requires:	xapian
+BuildRoot:      %{_tmppath}/%{name}-%{version}--buildroot
+
+%description
+Recoll is a personal full text search tool for Unix/Linux.
+It is based on the very strong Xapian backend, for which 
+it provides an easy to use, feature-rich, easy administration, 
+QT graphical interface.
+
+%prep
+%setup -q 
+%patch1 -p0
+
+%build
+%configure2_5x \
+	--with-fam \
+	--with-aspell
+
+%make
+
+%install
+[ "%{buildroot}" != "/" ] && rm -rf %{buildroot}
+
+%makeinstall_std
+desktop-file-install --vendor="" \
+	--add-category="X-MandrivaLinux-MoreApplications-Databases" \
+	--dir %{buildroot}%{_datadir}/applications %{buildroot}%{_datadir}/applications/*
+
+%clean
+[ "%{buildroot}" != "/" ] && rm -rf %{buildroot}
+
+%files
+%defattr(644,root,root,755)
+%doc %{_datadir}/%{name}/doc
+%attr(755,root,root) %{_bindir}/%{name}*
+%{_datadir}/applications/recoll-searchgui.desktop
+%{_datadir}/icons/hicolor/48x48/apps/recoll-searchgui.png
+%dir %{_datadir}/%{name}
+%dir %{_datadir}/%{name}/examples
+%dir %{_datadir}/%{name}/filters
+%dir %{_datadir}/%{name}/images
+%dir %{_datadir}/%{name}/translations
+%{_datadir}/%{name}/examples/mime*
+%{_datadir}/%{name}/examples/*.conf
+%attr(755,root,root) %{_datadir}/%{name}/examples/rclmon.sh
+%attr(755,root,root) %{_datadir}/%{name}/filters/rc*
+%{_datadir}/%{name}/filters/xdg-open
+%{_datadir}/%{name}/images/*png
+%{_mandir}/man1/recoll*
+%{_mandir}/man5/recoll*
+%{_datadir}/%{name}/translations/*.qm
+
+
+%changelog
+* Fri Apr 20 2007 Tomasz Pawel Gajc <tpg@mandriva.org> 1.8.1-1mdv2008.0
+ Revision: 16093
+- new version
+- drop P0
+
+  + Mandriva <devel@mandriva.com>
+
+
+* Tue Mar 06 2007 Tomasz Pawel Gajc <tpg@mandriva.org> 1.7.5-2mdv2007.0
+ Revision: 134128
+- rebuild
+
+* Tue Jan 30 2007 Tomasz Pawel Gajc <tpg@mandriva.org> 1.7.5-1mdv2007.1
+ Revision: 115423
+- add patch 1 - fix build on x86_64
+- add patch 0 - fix menu entry
+- fix group
+- add buildrequires
+- set correct bits on files
+- Import recoll
+
--- a/src/doc/user/usermanual.sgml
+++ b/src/doc/user/usermanual.sgml
@ -24,11 +24,12 @@
      Dockes</holder>
    </copyright>

-    <releaseinfo>$Id: usermanual.sgml,v 1.44 2007-06-08 16:46:53 dockes Exp $</releaseinfo>
+    <releaseinfo>$Id: usermanual.sgml,v 1.45 2007-06-26 16:58:25 dockes Exp $</releaseinfo>

    <abstract>
      <para>This document introduces full text search notions
-      and describes the installation and use of the &RCL; application.</para>
+      and describes the installation and use of the &RCL;
+      application. It currently describes &RCL; 1.9.</para>
    </abstract>


@ -771,30 +772,6 @@ fvwm
      <replaceable>unplugged</replaceable> but not
      <replaceable>potatoes</replaceable> (in any part of the document).</para>

-      <para>The first element <literal>author:"john doe"</literal> is
-      a phrase search limited to a specific field. Phrase searches are
-      specified as usual by enclosing the words in double quotes. The
-      field specification appears before the colon (of course this is
-      not limited to phrases, <literal>author:Balzac</literal> would
-      be ok too). &RCL; currently manages the following fields:</para>
-
-      <itemizedlist>
-	<listitem><para><literal>title</literal>,
-	<literal>subject</literal> or <literal>caption</literal> are
-	synonyms which specify data to be searched for in the
-	document title or subject.</para>
-	</listitem>
-	<listitem><para><literal>author</literal> or
-	<literal>from</literal> for searching the documents originators.</para>
-	</listitem>
-	<listitem><para><literal>keyword</literal> for searching the
-	document specified keywords (few documents actually have any).</para>
-	</listitem>
-      </itemizedlist>
-
-      <para>The query language is currently the only way to use the
-      &RCL; field search capability.</para>
-
      <para>All elements in the search entry are normally combined
      with an implicit AND. It is possible to specify that elements be
      OR'ed instead, as in <replaceable>Beatles</replaceable>
@ -817,8 +794,54 @@ fvwm
      <para>An entry preceded by a <literal>-</literal> specifies a
      term that should <emphasis>not</emphasis> appear.</para>

+      <para>The first element in the above exemple,
+      <literal>author:"john doe"</literal> is a phrase search limited
+      to a specific field. Phrase searches are specified as usual by
+      enclosing the words in double quotes. The field specification
+      appears before the colon (of course this is not limited to
+      phrases, <literal>author:Balzac</literal> would be ok
+      too). &RCL; currently manages the following fields:</para>
+      <itemizedlist>
+	<listitem><para><literal>title</literal>,
+	<literal>subject</literal> or <literal>caption</literal> are
+	synonyms which specify data to be searched for in the
+	document title or subject.</para>
+	</listitem>
+	<listitem><para><literal>author</literal> or
+	<literal>from</literal> for searching the documents originators.</para>
+	</listitem>
+	<listitem><para><literal>keyword</literal> for searching the
+	document specified keywords (few documents actually have any).</para>
+	</listitem>
+      </itemizedlist>
+
+      <para>As of release 1.9, the filters have the possibility to
+      create other fields with arbitrary names. No standard filters
+      use this possibility yet.</para>
+
+      <para>There are two other elements which may be specified
+      through the field syntax, but are somewhat special:</para>
+      <itemizedlist>
+	<listitem><para><literal>ext</literal> for specifying the file
+	name extension (Ex: <literal>ext:html</literal>)</para>
+	</listitem>
+	<listitem><para><literal>mime</literal> for specifying the
+	mime type. This one is quite special because you can specify
+	several values which will be OR'ed (the normal default for the
+	language is AND). Ex: <literal>mime:text/plain
+	mime:text/html</literal>. Specifying an explicit boolean
+	operator or negation (<literal>-</literal>) before a
+	<literal>mime</literal> specification is not supported and
+	will produce strange results.</para>
+	</listitem>
+      </itemizedlist>
+      <para>The query language is currently the only way to use the
+      &RCL; field search capability.</para>
+
      <para>Words inside phrases and capitalized words are not
-      stem-expanded. Wildcards may be used anywhere.</para>
+      stem-expanded. Wildcards may be used anywhere inside a term.
+      Specifying a wild-card on the left of a term can produce a very
+      slow search.</para>

      <para>You can use the <literal>show query</literal> link at the
      top of the result list to check the exact query which was
@ -2089,36 +2112,91 @@ skippedPaths = ~/somedir/*.txt
 	  will be given a file name as argument and should output the
 	  text contents in html format on the standard output.</para>

-	  <para>The html could be very minimal like the following
-	  example:</para>
-	  <programlisting>&lt;html>&lt;head>
-&lt;meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
-&lt/head>
-&lt;body>some text content&lt;/body>&lt;/html>
-          </programlisting>
-
-	  <para>You should take care to escape some characters inside
-	  the text by transforming them into appropriate
-	  entities. "<literal>&amp;</literal>" should be transformed into
-	  "<literal>&amp;amp;</literal>", "<literal>&lt;</literal>"
-	  should be transformed into "<literal>&amp;lt;</literal>".</para>
-
-	  <para>The character set needs to be specified in the
-	  header. It does not need to be UTF-8 (&RCL; will take care
-	  of translating it), but it must be accurate for good
-	  results.</para>
-
-	  <para>&RCL; will also make use of other header fields if
-	  they are present: <literal>title</literal>,
-	  <literal>description</literal>, <literal>keywords</literal>.
-          <para>
-          <para>The easiest way to write a new filter is probably to start
-          from an existing one.</para>
+	  <para>You can find more details about writing a &RCL; filter
+	  in the <link linkend="rcl.extending.filters">section about
+	  writing filters</link></para>
 	</sect3>

      </sect2>

    </sect1>
+
+    <sect1 id="rcl.extending">
+      <title>Extending &RCL;</title>
+      
+      <sect2 id="rcl.extending.filters">
+	<title>Writing a document filter</title>
+
+	<para>&RCL; filters are executable programs which 
+	translate from a specific format (ie:
+	<application>openoffice</application>,
+	<application>acrobat</application>, etc.) to the &RCL;
+	indexing input format, which was chosen to be HTML.</para>
+
+	<para>&RCL; filters are usually shell-scripts, but this is in
+	no way necessary. These programs are extremely simple and most
+	of the difficulty lies in extracting the text from the native
+	format, not outputting what is expected by &RCL;. Happily
+	enough, most document formats already have translators or text
+	extractors which handle the difficult part and can be called
+	from the filter.</para>
+
+	<para>Filters are called with a single argument which is the
+	source file name. They should output the result to stdout.</para>
+
+	<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
+	environment variable (values <literal>yes</literal>,
+	<literal>no</literal>) tells the filter if the operation is
+	for indexing or previewing. Some filters use this to output a
+	slightly different format. This is not essential.</para>
+
+	<para>The output HTML could be very minimal like the following
+	example:</para>
+
+	<programlisting>&lt;html>&lt;head>
+&lt;meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
+&lt/head>
+&lt;body>some text content&lt;/body>&lt;/html>
+          </programlisting>
+
+	<para>You should take care to escape some characters inside
+	  the text by transforming them into appropriate
+	  entities. "<literal>&amp;</literal>" should be transformed into
+	  "<literal>&amp;amp;</literal>", "<literal>&lt;</literal>"
+	  should be transformed into "<literal>&amp;lt;</literal>".</para>
+
+	<para>The character set needs to be specified in the
+	  header. It does not need to be UTF-8 (&RCL; will take care
+	  of translating it), but it must be accurate for good
+	  results.</para>
+
+	<para>&RCL; will also make use of other header fields if
+	  they are present: <literal>title</literal>,
+	  <literal>description</literal>,
+	  <literal>keywords</literal>.</para>
+
+	<para>As of &RCL; release 1.9, filters also have the
+	possibility to "invent" field names. This should be output as
+	meta tags:</para>
+
+	<programlisting>
+&lt;meta name="somefield" content="Some textual data" /&gt;
+</programlisting>
+	
+	<para>In this case, a correspondance between field name and
+	&XAP; prefix should also be added to the
+	<filename>mimeconf</filename> file. See the existing entries
+	for inspiration. The field can then be used inside the query
+	language to narrow searches.</para>
+
+	<para>The easiest way to write a new filter is probably to start
+          from an existing one.</para>
+
+	
+      </sect2>
+
+    </sect1>
+
  </chapter>

 </book>
--- a/website/BUGS.txt
+++ b/website/BUGS.txt
@ -4,10 +4,21 @@ Bugs that are listed in an older version section are supposedly fixed in
 later versions. Bugs listed in the topmost section may also exist in older
 versions. 

-Latest (1.8.1):
+Latest (1.8.2):
+- There are a few problems in the qt4 version of recoll: some accelerators
+  (esc-spc, ctl-arrow) do not work, neither do copy/paste between the
+  result list and preview windows and x11 applications.
 - The dates shown for email attachments in a result list are the email
  folder modification date. This should be inherited from the parent
  message instead.
+- There are sometimes problems with document deletions: the index can
+  get in a state where deleted or moved documents are not purged from the
+  index (the log file says that the doc are deleted, but they aren't
+  actually). When this happens, the only solution currently is to reindex
+  from scratch (recollindex -z). This is due to a xapian bug, which will be
+  fixed in a future release. You can apply the following patch to xapian
+  1.0.1 to fix it:
+      http://www.lesbonscomptes.com/recoll/xapian/xapian-delete-document.patch 
 - NEAR crashes: 1.6 has added NEAR searches. Unlike what recoll did
  with PHRASES, stemming expansion is performed on terms inside NEAR
  clauses (except if prevented by a capitalized entry of course). There is
@ -39,9 +50,9 @@ Latest (1.8.1):
  compressed (ie: xxx.txt.gz), recoll will try to start the external viewer
  on the compressed file, which will not work in most cases.

- There are problems which have been reported indexing big mailstores
-  (several hundreds of thousands of messages): resulting in a very big
-  database and even crashes during indexation.
+- Problems have been reported indexing big mailstores (several hundreds of
+  thousands of messages): resulting in a very big database and even
+  crashes.

 - Under some versions of KDE (ie: Fedora FC5 KDE 3.5.4-0.5.fc5), there is a
  problem with the window stacking order. Opening the "browse" file
--- a/website/CHANGES.txt
+++ b/website/CHANGES.txt
@ -1,5 +1,31 @@
 CHANGES 

+1.9.0
+- Add option to remember sort tool state between program invocations (it is
+  reset to inactive by default)
+- Improve qt4 build: no more need for --enable-qt4
+- Fixed a number of qt4 glitches: selection and keyboard shortcuts.
+- When searching for an empty string inside the preview window, position
+  the window to the next occurrence of the primary search terms.
+- Have email attachments inherit date and author from their parent message
+- Added an adjustable flush threshold during indexing: should help control
+  memory usage. See the idxflushmb configuration parameter.
+- Added a check for file system free space. Indexing will stop if the
+  threshold is reached. See the maxfsoccuppc configuration parameter.
+- Fix bus error on rclmon exit
+- Better handle aspell errors inside rclmon
+- Added File menu entry to erase document history.
+- Added ext: and mime: selectors to the query language.
+- Added support for arbitrary fields. Filters can now produce any number of
+  fields which will be selectively searchable through the query language.
+- Added abiword and kword support. 
+- Contributed filter: rcljpeg. This should be extended to use the new field
+  support.
+- Changed the icon to an ugly one. The previous one was nicer but looked
+  too much like Xapian's.
+- Added some kind of support for a stopword list.
+- Bound space and backspace to PgUp/PgDown in preview.
+
 1.8.2 2007-05-19
 - Fixed method name for compatibility with xapian 1.0.0
 - Add .beagle to default list of skipped names (avoids indexing beagle
--- a/website/credits.html
+++ b/website/credits.html
@ -38,7 +38,7 @@
      <p>First of all, many thanks to the users who provided criticism
 	and ideas to make <span class="application">Recoll</span> go
 	forward ! Please 
-	<a href="mailto:jean-francois.dockes@wanadoo.fr>
+	<a href="mailto:jean-francois.dockes@wanadoo.fr">
 	  contact me</a> if you have something to suggest.</p>

      <p><span class="application">Recoll</span> borrows
--- a/website/doc.html
+++ b/website/doc.html
@ -30,16 +30,24 @@
    
    <div class="content">

-      <h1>Recoll user manuals</h1>
+      <h1>Recoll user manual</h1>
      
-      <blockquote>
      <ul>
      <li><a href="usermanual/index.html">English</a></li>
      <li><a href="http://mcz.altervista.org/Pagine/usermanual-italian.html">
 	  Italian</a></li>
      </ul>
-      </blockquote>

+      <p><br></p>
+
+      <h1>Other documentation</h1>
+
+      <ul>
+      <li><a href="perfs.html">Index size and indexing performance
+	      data.</a></li> 
+      </ul>
+
+      
    </div>
  </body>
 </html>
--- a/website/download.html
+++ b/website/download.html
@ -24,7 +24,7 @@
      <ul>
 	<li><a href="index.html">Home</a></li>
 	<li><b>Downloads</b></li>
-	<li><a href="usermanual/index.html">User manual</a></li>
+	<li><a href="doc.html">Documentation</a></li>
 	<li><a href="usermanual/rcl.install.html">Installation</a></li>
 	<li><a href="index.html#support">Support</a></li>
      </ul>
@ -47,6 +47,8 @@
      </table>
      </p>

+      <h2><a name="source">General information</a></h2>
+
      <p>You will probably need to have a look at the
 	<a href="usermanual/rcl.install.html">installation manual</a> for
 	building and/or installation instructions.</p>
@ -68,12 +70,17 @@
 	<a href="usermanual/index.html#RCL.INSTALL.EXTERNAL">list</a> to
 	decide what you may want to install.</p>

+      <p>In addition, optional functionality in Recoll (the term explorer
+	tool in phonetic mode) uses the <b>aspell</b> package. The
+	installed version should be at least 0.60 (utf-8 support) for
+	this to run smoothly. This function is far from essential.</p>
+
      <p>If you find problems with the package or its
 	installation, <em>please</em> 
 	<a href="mailto:jean-francois.dockes@wanadoo.fr">
 	  report them</a>.</p>

-      <h4>What do the release numbers mean?</h4>
+      <h3>What do the release numbers mean?</h3>

      <p>The Recoll releases are numbered X.Y.Z. </p>

@ -110,7 +117,16 @@
 	1.8.2 was released purely for fixing a small issue of
 	compatibility with xapian 1.0.0 and small config/install
 	glitches.  There is no functional reason to upgrade from
-	1.8.1, (or update packages).
+	1.8.1, (or update packages).</p>
+
+      <p>Recoll 1.8.2 is the first release that will let you take
+	advantage of the new Xapian 1.0, the main user-visible change
+	of which is the new default index format. In order to take
+	advantage of the new format (which is not mandatory) Recoll
+	users updating from an older release need to delete their old
+	index. There are <a
+	href="usermanual/usermanual.html#RCL.INDEXING.STORAGE.FORMAT">more
+	details in the user manual</a>.</p>

      <p>Older recoll releases:
 	<a href="recoll-1.8.1.tar.gz">1.8.1</a>
@ -128,8 +144,8 @@
      <h2><a name="rpms">Packages</a></h2>

      <p>The executables inside the binary rpms have a static link to
-	xapian, there is no dependency except Qt 3.3. Of course you need
-	xapian-core installed to use the source rpm. </p>
+	xapian 0.9.x, there is no dependency except Qt 3.3. Of course
+	you need xapian-core installed to use the source rpm. </p>

      <p><b>Fedora Core</b>
 	FC6 RPM: 
@ -168,10 +184,16 @@
 	<a href="debian/edgy/">debian/edgy</a>
      </p>

+      <p><b>Ubuntu 6.06 dapper</b> (the feisty version does not work
+      on dapper). This has a static link on xapian 0.9.10:
+	<a href="debian/dapper/recoll_1.8.2-0ubuntu1_i386.deb">
+	  recoll_1.8.2-0ubuntu1_i386.deb</a> </p>
+
      <p><b>Debian unstable</b> Recoll is in the package repository,
-      you can install it with the usual <em>apt-get install
-      recoll</em>. <a
-      href="http://packages.qa.debian.org/r/recoll.html">Package page</a></p>
+	you can install it with the usual <em>apt-get install
+	  recoll</em>. <a
+	  href="http://packages.qa.debian.org/r/recoll.html">
+	  Package page</a></p>

      <p><b>Debian 3.1</b> Thanks to Mario (<img align="top" src="mario.png">)
      for these: i386: 
--- a/website/features.html
+++ b/website/features.html
@ -142,6 +142,7 @@
 	</dd>
      </ul>

+
      <h2><a name="#stemming"></a>Stemming</h2>

      <p>Stemming is a process which transforms inflected words into
--- a/website/fr/features.html
+++ b/website/fr/features.html
@ -0,0 +1,205 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+
+<html>
+  <head>
+    <title>RECOLL: un outil personnel de recherche textuelle pour 
+    Unix et Linux</title>
+    <meta name="generator" content="HTML Tidy, see www.w3.org">
+    <meta name="Author" content="Jean-Francois Dockes">
+    <meta name="Description" content=
+    "recoll est un logiciel personnel de recherche textuelle pour unix et linux basé sur Xapian, un moteur d'indexation puissant et mature.">
+    <meta name="Keywords" content=
+      "recherche textuelle,desktop,unix,linux,solaris,open source,free">
+    <meta http-equiv="Content-language" content="fr">
+    <meta http-equiv="content-type" content=
+    "text/html; charset=iso-8859-1">
+    <meta name="robots" content="All,Index,Follow">
+    <link type="text/css" rel="stylesheet" href="../styles/style.css">
+  </head>
+
+  <body>
+
+    <div class="rightlinks">
+      <ul>
+	<li><a href="../index.html">Base</a></li>
+	<li><a href="../pics/index.html">Copies d'écrans</a></li>
+	<li><a href="../download.html">Téléchargements</a></li>
+	<li><a href="../manuals.html">Documentation</a></li>
+	<li><a href="../index.html#support">Support</a></li>
+	<li><a href="../devel.html">Développement</a></li>
+      </ul>
+    </div>
+
+    <div class="content">
+
+      <h1 class="intro">Caractéristiques de Recoll</h1>
+
+      <dl>
+	<dt><a name="systems">Systèmes</a></dt>
+	<dd><span class="application">Recoll</span> a été compilé et
+	testé sur FreeBSD, Linux, Darwin, Solaris (versions
+	  FreeBSD 5.5, Fedora Core 5, Suse 10.1, Gentoo,
+	  Debian 3.1, Ubuntu Edgy, Solaris 8/9, mais d'autres versions
+	  récentes conviennent sans doute également).</dd>
+
+	<dd>Versions de QT: 3.2, 3.3 et 4.2</dd>
+
+        <dt><a name="doctypes">Types de documents</a></dt>
+	<dd>Recoll peut traiter les types de documents suivants, ainsi
+	que des fichiers compressés du même type: 
+
+          <dl>
+            <dt>En interne</dt>
+
+            <dd>
+              <ul>
+                <li><var class="literal">text</var>.</li>
+
+                <li><var class="literal">html</var>.</li>
+
+                <li><span class="application">OpenOffice</span>
+                (avec l'aide de la commande <b>unzip</b>).</li>
+
+                <li><var class="literal">maildir</var> et <var
+		    class="literal">mailbox</var> (<span class=
+		    "application">Mozilla</span>, <span class=
+		    "application">Thunderbird</span>, <span class=
+		    "application">Evolution</span> et sans doute
+		    d'autres).</li> 
+
+                <li>Fichiers de conversation <span class="application">
+		    gaim</span>.</li>
+
+                <li><span class="application">Scribus</span>.</li>
+
+              </ul>
+            </dd>
+
+            <dt>With external helpers</dt>
+
+            <dd>
+              <ul>
+                <li><var class="literal">pdf</var> avec <a href=
+                "http://www.foolabs.com/xpdf/">xpdf</a>.</li>
+
+                <li><var class="literal">postscript</var> avec 
+           <a href="http://www.gnu.org/software/ghostscript/ghostscript.html">
+                ghostscript</a> et 
+           <a href="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
+		    pstotext</a>.</li>
+
+                <li>Fichiers <span class="application">Lyx</span>
+                (nécessite l'application 
+		  <span class="application">Lyx</span>).</li>
+
+                <li><span class="application">msword</span> avec <a href=
+                "http://www.winfield.demon.nl/">antiword</a>.</li>
+
+                <li><span class="application">Powerpoint</span> et 
+		  <span class="application">Excel</span> avec les utilitaires
+		  <a href="http://www.45.free.net/~vitus/software/catdoc/">
+		    catdoc</a>.</li>
+
+                <li><var class="literal">rtf</var> avec <a href=
+                "http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
+
+		<li><var class="literal">dvi</var> avec 
+		  <a href="http://www.radicaleye.com/dvips.html">dvips</a>.
+		</li>
+
+		<li><var class="literal">djvu</var> avec 
+		  <a href="http://djvulibre.djvuzone.org/doc/index.html">
+		    DjVuLibre</a>. </li>
+
+		<li>Tags <var class="literal">mp3</var> avec 
+		  <a href="http://id3lib.sourceforge.net/">
+		    id3info (id3lib)</a>. </li>
+
+              </ul>
+            </dd>
+          </dl>
+	</dd>
+
+	<dt>Autres caractéristiques</dt>
+	<dd>
+	  <ul>
+	    <li>Index multiples interrogeables ensemble ou séparément.</li>
+
+	    <li>Fonctions de recherche puissantes, avec expressions
+	    booléennes, phrases et proximité, caractères jokers,
+	    filtrage sur les types de fichiers où l'emplacement.</li>
+
+	    <li>Fonction spécifique de recherche de noms de fichiers.</li>
+
+	    <li>Support de jeux de caractères multiples. Les traitements
+	      internes et l'index utilisent l'encodage Unicode UTF-8.</li>
+
+	    <li>L'extraction des racines de mots <a href="#Stemming">
+		Stemming</a> est effectuée au moment de la recherche
+		(permet de changer de langue après l'indexation).</li>
+
+	    <li>Installation facile. Pas de processus permanent, de
+	      serveur web ou environnement exotique.</li>
+
+	    <li>Un indexeur qui peut fonctionner soit comme un
+	      processus léger dans l'interface de consultation, comme un
+	      programme batch externe intégrable par 
+	      <span class="application">cron</span>, ou comme un processus
+	      permanent pour l'indexation au fil de l'eau.</li>
+
+	  </ul>
+	</dd>
+      </ul>
+
+      <h2><a name="#stemming"></a>Lemmatisation</h2>
+
+      <p><em>Note: je serais preneur d'une traduction française
+	agréable pour "stemming".</em></p>
+      <p>La lemmatisation transforme un mot dérivé vers sa racine.
+       Par exemple, <i>aimer</i>, <i>aimerai</i>, <i>aimait</i>,
+	<i>aimez</i> etc. seraient transformés en <i>aim</i> en
+	français. Une recherche de l'un quelconque des dérivés peut
+	automatiquement être étendue vers tous les autres</p>
+
+      <p>Certains moteurs de recherche appliquent la transformation
+      pendant l'indexation. L'index ne stocke que les racines des
+      mots, avec des exceptions pour les termes qui sont reconnus
+      comme des noms propres (capitalisation). Au moment de la
+      recherche, les termes de la requête sont également transformés
+      avant comparaison à l'index.</p>
+      
+      <p>Cette approche permet un index plus petit, mais elle perd
+	irrévocablement de l'information pendant l'indexation.</p>
+
+      <p>Recoll fonctionne différemment. Les termes sont indexés sans
+	transformation. L'index résultant est plus gros, ce qui n'a
+	probablement pas beaucoup d'importance à une époque de disques
+	de 100 Go principalement remplis d'information multimédia
+	<em>non indexée</em>.
+
+      <p>À la fin de l'indexation, Recoll construit un ou plusieurs
+      dictionnaires de transformation (pour différents langages), où
+      toutes les racines sont listées avec leurs transformations
+      possibles.</p>
+
+
+      <p>Au moment de la recherche, par défaut, les termes de
+      l'utilisateurs sont transformés, et étendus aux dérivés par
+      utilisation du dictionnaire.
+	Les résultats obtenus sont analogues à ceux de
+	l'autre méthode. L'avantage est que l'expansion peut être
+	contrôlée au moment de la recherche:
+	<ul>
+	<li>On peut la supprimer pour n'importe quel terme de la
+	  requête, (en le faisant débuter par une capitale:
+	  <em>Aime</em> par exemple pour chercher la ville d'Aime la
+	  Plagne). </li>
+	<li>Le langage de transformation peut également être changé,
+	en supposant que plusieurs dictionnaires de transformation
+	aient été construits lors de l'indexation.</li>
+      </ul>
+	
+    </div>
+  </body>
+</html>
+
--- a/website/index.html.en
+++ b/website/index.html.en
@ -81,6 +81,16 @@
 	<li><a class="weak" href="features.html">(more detail)</a></li>
      </ul>

+
+      <h2>News: </h1>
+      <p>There are new filters for 
+	<span class="application">kword</span> and 
+	<span class="application">abiword</span> files in the 
+	<a href="filters/filters.html">new filters section</a>. These
+	are usable with an existing <span
+	class="application">Recoll</span> 1.8 installation.</p>
+
+	
      <h2><a name="support">Support</a></h3>

      <p>If you have any problem with Recoll, its
--- a/website/index.html.fr
+++ b/website/index.html.fr
@ -97,6 +97,15 @@

      </ul>

+      <h2>Nouvelles: </h1>
+      <p>Il y a de nouveaux filtres d'indexation pour les fichiers
+	<span class="application">kword</span> et 
+	<span class="application">abiword</span>. Ils sont téléchargeables
+	dans la   <a href="filters/filters.html">zone des nouveaux
+	filtres</a>, et sont utilisable avec une installation existante de 
+	<span class="application">Recoll</span> 1.8.</p>
+
+
      <h2><a name="support">Support</a></h3>

      <p>Si vous avez un problème quelconque avec le logiciel ou son
--- a/website/mario.png
+++ b/website/mario.png
--- a/website/perfs.html
+++ b/website/perfs.html
@ -0,0 +1,114 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+
+<html>
+  <head>
+    <title>RECOLL: a personal text search system for
+    Unix/Linux</title>
+    <meta name="generator" content="HTML Tidy, see www.w3.org">
+    <meta name="Author" content="Jean-Francois Dockes">
+    <meta name="Description" content=
+    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
+    <meta name="Keywords" content=
+      "full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
+    <meta http-equiv="Content-language" content="en">
+    <meta http-equiv="content-type" content=
+    "text/html; charset=iso-8859-1">
+    <meta name="robots" content="All,Index,Follow">
+    <link type="text/css" rel="stylesheet" href="styles/style.css">
+  </head>
+
+  <body>
+
+    <div class="rightlinks">
+      <ul>
+	<li><a href="index.html">Home</a></li>
+	<li><a href="pics/index.html">Screenshots</a></li>
+	<li><a href="download.html">Downloads</a></li>
+	<li><a href="doc.html">Documentation</a></li>
+      </ul>
+    </div>
+
+    <div class="content">
+
+      <h1 class="intro">Recoll: Indexing performance and index sizes</h1>
+
+      <p>The time needed to index a given set of documents, and the
+	resulting index size depend of many factors, such as file size
+	and proportion of actual text content for the index size, cpu
+	speed, available memory, average file size and format for the
+	speed of indexing.</p>
+
+      <p>We try here to give a number of reference points which can
+	be used to roughly estimate the resources needed to create and
+	store an index. Obviously, your data set will never fit one of
+	the samples, so the results cannot be exactly predicted.</p>
+
+      <p>The following data was obtained on a machine with a 1800 Mhz
+	AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
+	disk, running Suse 10.1.</p>
+
+      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
+	executed with the default flush threshold value. 
+	The process memory usage is the one given by <b>ps</b></p>
+
+      <table border=1>
+	<thead>
+	  <tr>
+	    <th>Data</th>
+	    <th>Data size</th>
+	    <th>Indexing time</th>
+	    <th>Index size</th>
+	    <th>Peak process memory usage</th>
+	  </tr>
+	<tbody>
+	  <tr>
+	    <td>Random pdfs harvested on Google</td>
+	    <td>1.7 GB, 3564 files</td>
+	    <td>27 mn</td>
+	    <td>230 MB</td>
+	    <td>225 MB</td>
+	  </tr>
+	  <tr>
+	    <td>Ietf mailing list archive</td>
+	    <td>211 MB, 44,000 messages</td>
+	    <td>8 mn</td>
+	    <td>350 MB</td>
+	    <td>90 MB</td>
+	  </tr>
+	  <tr>
+	    <td>Partial Wikipedia dump</td>
+	    <td>15 GB, one million files</td>
+	    <td>6H30</td>
+	    <td>10 GB</td>
+	    <td>324 MB</td>
+	  </tr>
+	  <tr>
+	    <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
+	    <td>Random pdfs harvested on Google<br>
+	    Recoll 1.9, <em>idxflushmb</em> set to 10</td>
+	    <td>1.7 GB, 3564 files</td>
+	    <td>25 mn</td>
+	    <td>262 MB</td>
+	    <td>65 MB</td>
+	  </tr>
+	</tbody>
+      </table>
+
+      <p>Notice how the index size for the mail archive is bigger than
+	the data size. Myriads of small pure text documents will do
+	this. The factor of expansion would be even much worse with
+	compressed folders of course (the test was on uncompressed
+	data).</p>
+
+      <p>The last test was performed with Recoll 1.9.0 which has an
+	ajustable flush threshold (<em>idxflushmb</em> parameter), here
+	set to 10 MB. Notice the much lower peak memory usage, with no
+	performance degradation. The resulting index is bigger though,
+	the exact reason is not known to me, possibly because of
+	additional fragmentation </p>
+      </p>
+
+    </div>
+  </body>
+</html>
+
--- a/website/rclidxfmt.html
+++ b/website/rclidxfmt.html
@ -2,72 +2,146 @@
 <html>
  <head>
    <title>Recoll Index format</title>
+    <meta name="generator" content="HTML Tidy, see www.w3.org">
+    <meta name="Author" content="Jean-Francois Dockes">
+    <meta name="Description" content=
+    "recoll est un logiciel personnel de recherche textuelle pour unix et linux basé sur Xapian, un moteur d'indexation puissant et mature.">
+    <meta name="Keywords" content=
+      "recherche textuelle,desktop,unix,linux,solaris,open source,free">
+    <meta http-equiv="Content-language" content="fr">
+    <meta http-equiv="content-type" content=
+    "text/html; charset=iso-8859-1">
+    <meta name="robots" content="All,Index,Follow">
+    <link type="text/css" rel="stylesheet" href="styles/style.css">
  </head>

  <body>
+    <div class="content">
    <h1>Recoll index format details</h1>

-    <p>Terms are not stemmed before being stored. They are turned to
-      all minuscule letters with no accents.</p>
+    <p>A comparison of index formats for recoll 1.8 and omega
+    1.0.1</p>

-    <p>Special prefixed terms:</p>
-    <ul>
-      <li>Ddate: modification date of file, like YYYYMMDD</li>
+    <p>Recoll terms are not stemmed before being stored. They are turned to
+      all minuscule letters with no accents. An auxiliary database
+      handles stem expansion. Omega stores both raw
+      terms and stemmed versions (with prefix Z)</p>

-      <li>Mmonth: YYYYMM</li>
+    <h2>Special prefixed terms:</h2>

-      <li>Ppathhash truncated/hashed version of file path. For
+    <p>A comparison of prefixed term usage between Recoll and
+      omega/xapian. <em>xapian-core</em> in the Omega column means
+      that the prefix is not used by Omega, but mentionned as
+      allocated in the xapian prefix definition document.</p>
+
+    <table border=1 cellspacing=0 width="90%">
+	<thead>
+	<tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
+	</tr>
+      </thead>
+      <tbody>
+	<tr><td>T</td><td>mime type</td><td>Same</td>
+	</tr>
+
+	<tr><td>P</td><td>Truncated/hashed version of file path. For
 	single-document files, and for the file part of a
 	multi-document file. Used for up-to-date checks and for
-	retrieving a document by path. omega uses U for the equivalent
-	term used for up to date checks.</li>
+	retrieving a document by path. </td><td>Path part of URL (no
+	hashing). Uses U for the equivalent
+	term used for up to date checks.</td> 
+	</tr>

-      <li>Qpathhash+ipath same + internal path for documents inside
-	multi-document files. Used to set the existence flag for
-	subdocs when a multi-document file is found to be up to date,
-	or for deleting all subdocs for a file, or for retrieving a
-	document by path+ipath. No real omega equivalent. Compatible
-	with Q definition in termprefixes.txt: unique identifier.</li>
+	<tr><td>Q</td><td>pathhash+ipath same + internal path for
+	documents inside multi-document files. Used to set the
+	existence flag for subdocs when a multi-document file is found
+	to be up to date, or for deleting all subdocs for a file, or
+	for retrieving a document by path+ipath. Compatible
+	with Q definition in xapian/termprefixes.txt: unique
+	identifier.</td><td>None</td> 
+	</tr>

-      <li>Tmimetype: document mime type.</li>
+	<tr><td>D</td><td>date: modification date of file, like
+	YYYYMMDD</td><td>Same</td>
+	</tr>

-      <li>Wweak: 10 days period (not used any more by omega)</li>
+	<tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
+	</tr>
+	<tr><td>Y</td><td>year YYYY</td><td>Same</td>
+	</tr>

-      <li>Yyear YYYY</li>
+	<tr><td>XSFN</td><td>utf8 version of file name. Used for specific
+	file name searches</td><td>None</td>
+	</tr>
+	<tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
+	    of URL. Used for duplicate checks.</td>
+	</tr>

-      <li>XSFNfilename utf8 version of file name. Used for specific
-	file name searches</li>
+	<tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
+	</tr>
+	<tr><td>A</td><td>Author</td><td>xapian-core</td>
+	</tr>
+	<tr><td>K</td><td>Keyword</td><td>xapian-core</td>
+	</tr>
+	
+      </tbody>
+    </table>

-    </ul>
-
-    <p>Omega prefixes with no equivalents in Recoll: P, R, U</p>
    <p>None of the "date" terms are currently used by recoll queries</p>

-    <p>Values: Recoll currently stores no document values.</p>
+    <h2>Values</h2>
+    <p>Recoll currently stores no document values.</p>
+    <p>Omega stores 2 values, for the md5 hash of the file, and the
+      last modification date (as unix time). The md5 value doesn't
+      appear to be currently used ?</p>

-    <p>Document data record format<p>
-    <ul>
-      <li>url= Full url. Always file://abspath. The path is not
+    <h2>Document data record format</h2>
+      <p>Recoll has the same line based / prefixed data record format
+      as omega (name=value\n).</p>
+
+    <table border=1 cellspacing=0 width="90%">
+	<thead>
+	<tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
+	</tr>
+      </thead>
+      <tbody>
+	
+      <tr><td>url=</td><td>Full url. Always file://abspath. The path is not
 	encoded to utf-8, this is the system file name ,usable as an
-	argument to open(). (omega: sort of same)</li>
-      <li>mtype= mime type (omega: type)</li>
-      <li>fmtime= file modification date (omega: modtime)</li>
-      <li>dmtime= document modification date (omega: none)</li>
-      <li>origcharset= character set the text was converted from
-	(omega: none)</li>
-      <li>fbytes= file size in bytes (omega: size)</li>
-      <li>dbytes= document size in bytes (omega: none)</li>
-      <li>ipath= internal path for docs in multidoc files. (omega: none)</li>
-      <li>caption= title of document, utf8 (omega: same)</li>
-      <li>keywords= key words, utf8 (omega: none)</li>
-      <li>abstract= document abstract, utf8 (omega: sample)</li>
-    </ul>
+	argument to open()</td><td>Same</td>
+	</tr>
+
+	<tr><td>mtype=</td><td>mime type (omega: type)</td><td>type=</td>
+	</tr>
+	<tr><td>fmtime=</td><td>file modification date</td><td>modtime=</td>
+	</tr>
+	<tr><td>dmtime=</td><td> document modification date</td><td>None</td>
+	</tr>
+	<tr><td>origcharset=</td><td> character set the text was
+	    converted from</td><td>None</td>
+	</tr>
+	<tr><td>fbytes=</td><td> file size in bytes</td><td>size=</td>
+	</tr>
+	<tr><td>dbytes=</td><td>document size in bytes</td><td>None</td>
+	</tr>
+	<tr><td>ipath=</td><td>internal path for docs in multidoc
+	    files</td><td>None</td>
+	</tr>
+
+	<tr><td>caption=</td><td>title of document, utf8</td><td>Same</td>
+	</tr>
+	<tr><td>keywords=</td><td>key words, utf8</td><td>None</td>
+	</tr>
+	<tr><td>abstract=</td><td>document abstract, utf8</td><td>sample=</td>
+	</tr>
+      </tbody>
+    </table>
+    </div>

    <hr>
    <address><a href="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois Dockes</a></address>
 <!-- Created: Thu Dec  7 13:07:40 CET 2006 -->
 <!-- hhmts start -->
-Last modified: Thu Dec  7 14:19:02 CET 2006
+Last modified: Thu Jun 14 11:14:38 CEST 2007
 <!-- hhmts end -->
  </body>
 </html>
--- a/website/smile.png
+++ b/website/smile.png
--- a/website/styles/style.css
+++ b/website/styles/style.css
@ -92,3 +92,4 @@ a.weak {
    color: #aaaaaa;
 }

+table { empty-cells:show; }