241 lines
8.8 KiB
HTML
241 lines
8.8 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>RECOLL: a personal text search system for
|
|
Unix/Linux</title>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta name="Author" content="Jean-Francois Dockes">
|
|
<meta name="Description" content=
|
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
<meta name="Keywords" content=
|
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
<meta http-equiv="Content-language" content="en">
|
|
<meta http-equiv="content-type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<meta name="robots" content="All,Index,Follow">
|
|
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div class="rightlinks">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
<li><a href="pics/index.html">Screenshots</a></li>
|
|
<li><a href="download.html">Downloads</a></li>
|
|
<li><a href="usermanual/index.html">User manual</a></li>
|
|
<li><a href="index.html#support">Support</a></li>
|
|
<li><a href="devel.html">Development</a></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="content">
|
|
|
|
<h1 class="intro">Recoll features</h1>
|
|
|
|
<dl>
|
|
<dt><a name="systems">Supported systems</a></dt>
|
|
<dd><span class="application">Recoll</span> has been compiled and
|
|
tested on FreeBSD, Linux, Darwin and Solaris (versions
|
|
FreeBSD 5-7, Redhat 7/8/9, Fedora Core 5-10, Suse 10/11,
|
|
Gentoo, Debian 3.1, Solaris 8/9/10. Other not too distant
|
|
releases should be ok too).</dd>
|
|
|
|
<dd>Qt versions from 3.1 to 4.5</dd>
|
|
|
|
<dt><a name="doctypes">Document types</a></dt>
|
|
<dd>Supports the following document types (along with their
|
|
compressed versions):
|
|
|
|
<dl>
|
|
<dt>Natively</dt>
|
|
|
|
<dd>
|
|
<ul>
|
|
<li><span class="literal">text</span>.</li>
|
|
|
|
<li><span class="literal">html</span>.</li>
|
|
|
|
<li><span class="literal">maildir</span> and <span
|
|
class="literal">mailbox</span> (<span class=
|
|
"literal">Mozilla</span>, <span class=
|
|
"literal">Thunderbird</span> and <span class=
|
|
"literal">Evolution</span> mail ok).</li>
|
|
|
|
<li><span class="literal">OpenOffice</span>
|
|
files (needs <span class="command">unzip</span> command).</li>
|
|
|
|
<li><span class="literal">Microsoft Office Open XML</span>
|
|
files (needs <span class="command">unzip</span> command).</li>
|
|
|
|
<li><span class="literal">Abiword</span>
|
|
files.</li>
|
|
|
|
<li><span class="literal">Kword</span>
|
|
files.</li>
|
|
|
|
<li><span class="literal">gaim</span> log files.</li>
|
|
|
|
<li><span class="literal">Lyx</span> files (needs
|
|
<span class="literal">Lyx</span> to be installed).</li>
|
|
|
|
<li><span class="literal">Scribus</span> files.</li>
|
|
|
|
</ul>
|
|
</dd>
|
|
|
|
<dt>With external helpers</dt>
|
|
|
|
<dd>
|
|
<ul>
|
|
<li><span class="literal">pdf</span> with the <span
|
|
class="command">pdftotext</span> command, which can be
|
|
installed as part of <a href=
|
|
"http://www.foolabs.com/xpdf/">xpdf</a> or <a
|
|
href="http://poppler.freedesktop.org/">poppler</a>,
|
|
depending on your distribution.</li>
|
|
|
|
<li><span class="literal">msword</span> with <a href=
|
|
"http://www.winfield.demon.nl/">antiword</a>.</li>
|
|
|
|
<li><span class="literal">Powerpoint</span> and
|
|
<span class="literal">Excel</span> with the
|
|
<a href="http://catdoc.klik.atekon.de">
|
|
catdoc</a> utilities.</li>
|
|
|
|
<li><span class="literal">CHM (Microsoft help)</span>
|
|
files (needs <span class="command">Python, pychm,
|
|
chmlib</span>).</li>
|
|
|
|
<li><span class="literal">Zip</span>
|
|
archives (needs <span class="command">Python</span>).</li>
|
|
|
|
<li><span class="literal">iCalendar</span>(.ics) files
|
|
(needs <span class="command">Python,
|
|
<a href="http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
|
|
|
<li><span class="literal">Wordperfect</span> with <a href=
|
|
"http://libwpd.sourceforge.net">libwpd</a>.</li>
|
|
|
|
<li><span class="literal">postscript</span> with <a
|
|
href=
|
|
"http://www.gnu.org/software/ghostscript/ghostscript.html">
|
|
ghostscript</a> and <a href=
|
|
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.</li>
|
|
|
|
<li><span class="literal">rtf</span> with <a href=
|
|
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
|
|
|
|
<li><span class="literal">TeX</span> with
|
|
<span class="command">untex</span>. If there is no untex
|
|
package for your distribution,
|
|
<a href="untex/untex-1.3.jf.tar.gz">a source package is
|
|
stored on this site</a> (as untex has no obvious
|
|
home).
|
|
Will also work
|
|
with <a
|
|
href="http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
|
if this is installed.
|
|
</li>
|
|
|
|
<li><span class="literal">dvi</span> with
|
|
<a href="http://www.radicaleye.com/dvips.html">dvips</a>.
|
|
</li>
|
|
|
|
<li><span class="literal">djvu</span> with
|
|
<a href="http://djvu.sourceforge.net">DjVuLibre</a>.
|
|
</li>
|
|
<li><span class="literal">mp3</span> tags support with
|
|
<a href="http://id3lib.sourceforge.net/">id3info (id3lib)</a>.
|
|
</li>
|
|
<li>Image file tags support with
|
|
<a href="http://www.sno.phy.queensu.ca/~phil/exiftool/">
|
|
exiftool</a>. This is a perl program, so you also
|
|
need perl on the system. This works with about any
|
|
possible image file and tag format (jpg, png, tiff,
|
|
gif etc.).
|
|
</li>
|
|
|
|
</ul>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt>Other features</dt>
|
|
<dd>
|
|
<ul>
|
|
<li>Processes all email attachments.</li>
|
|
|
|
<li>Multiple selectable databases.</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, filter on file types and directory tree.</li>
|
|
|
|
<li>Xesam-compatible query language.</li>
|
|
|
|
<li>Specific file name searches with wildcards.</li>
|
|
|
|
<li>Support for multiple charsets. Internal processing and
|
|
storage uses Unicode UTF-8.</li>
|
|
|
|
<li><a href="#Stemming">Stemming</a> performed at query
|
|
time (can switch stemming language after indexing).</li>
|
|
|
|
<li>Easy installation. No database daemon, web server or
|
|
exotic language necessary.</li>
|
|
|
|
<li>An indexer which runs either as a thread inside the GUI,
|
|
as an external, batch, cron'able program, or as a
|
|
real-time indexing daemon.</li>
|
|
</ul>
|
|
</dd>
|
|
</ul>
|
|
|
|
|
|
<h2><a name="#stemming"></a>Stemming</h2>
|
|
|
|
<p>Stemming is a process which transforms inflected words into
|
|
their most basic form. For example, <i>flooring</i>,
|
|
<i>floors</i>, <i>floored</i> would probably all be transformed
|
|
to <i>floor</i> by a stemmer for the English language.</p>
|
|
|
|
<p>In many search engines, the stemming process occurs during
|
|
indexing. The index will only contain the stemmed form of words,
|
|
with exceptions for terms which are detected as being probably
|
|
proper nouns (ie: capitalized). At query time, the terms entered
|
|
by the user are stemmed, then matched against the index.</p>
|
|
|
|
<p>This process results into a smaller index, but it has the
|
|
grave inconvenient of irrevocably losing information during
|
|
indexing.</p>
|
|
|
|
<p>Recoll works in a different way. No stemming is performed at
|
|
query time, so that all information gets into the index. The
|
|
resulting index is bigger, but most people probably don't care
|
|
much about this nowadays, because they have a 100Gb disk 95%
|
|
full of binary data <em>which does not get indexed</em>.</p>
|
|
<p>At the end of an indexing pass, Recoll builds one or several
|
|
stemming dictionaries, where all word stems are listed in
|
|
correspondence to the list of their derivatives.</p>
|
|
|
|
<p>At query time, by default, user-entered terms are stemmed,
|
|
then matched against the stem database, and the query is
|
|
expanded to include all derivatives. This will yield search
|
|
results analogous to those obtained by a classical engine.
|
|
The benefits of this approach is that stem expansion can be
|
|
controlled instantly at query time in several ways:
|
|
<ul>
|
|
<li>It can be selectively turned-off for any query term by
|
|
capitalizing it (<i>Floor</i>).</li>
|
|
<li>The stemming language (ie: english, french...) can be
|
|
selected (this supposes that several stemming databases have
|
|
been built, which can be configured as part of the indexing,
|
|
or done later, in a reasonably fast way).</li>
|
|
</ul>
|
|
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|