182 lines
6.7 KiB
HTML
182 lines
6.7 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>RECOLL: a personal text search system for
|
|
Unix/Linux</title>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta name="Author" content="Jean-Francois Dockes">
|
|
<meta name="Description" content=
|
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
<meta name="Keywords" content=
|
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
<meta http-equiv="Content-language" content="en">
|
|
<meta http-equiv="content-type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<meta name="robots" content="All,Index,Follow">
|
|
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<div class="rightlinks">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
<li><a href="pics/index.html">Screenshots</a></li>
|
|
<li><a href="download.html">Downloads</a></li>
|
|
<li><a href="usermanual/index.html">User manual</a></li>
|
|
<li><a href="index.html#support">Support</a></li>
|
|
<li><a href="devel.html">Development</a></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="content">
|
|
|
|
<h1 class="intro">Recoll features</h1>
|
|
|
|
<dl>
|
|
<dt><a name="systems">Supported systems</a></dt>
|
|
<dd><span class="application">Recoll</span> has been compiled and
|
|
tested on FreeBSD, Linux, Darwin and Solaris (versions
|
|
FreeBSD 5.5, Redhat 7.3, Fedora Core 5, Suse 10.1, Gentoo,
|
|
Debian 3.1, Solaris 8/9, but other not too distant releases
|
|
should be ok too). You can download the source code and some
|
|
precompiled packages <a href="download.html">here</a>.</dd>
|
|
|
|
<dd>Qt versions from 3.1</dd>
|
|
|
|
<dt><a name="doctypes">Document types</a></dt>
|
|
<dd>Supports the following document types (along with their
|
|
compressed versions):
|
|
|
|
<dl>
|
|
<dt>Natively</dt>
|
|
|
|
<dd>
|
|
<ul>
|
|
<li><var class="literal">text</var>.</li>
|
|
|
|
<li><var class="literal">html</var>.</li>
|
|
|
|
<li><span class="application">OpenOffice</span>
|
|
files (needs <b>unzip</b> command).</li>
|
|
|
|
<li><var class="literal">maildir</var> and <var
|
|
class="literal">mailbox</var> (<span class=
|
|
"application">Mozilla</span>, <span class=
|
|
"application">Thunderbird</span> and <span class=
|
|
"application">Evolution</span> mail ok).</li>
|
|
|
|
<li><span class="application">gaim</span> log files.</li>
|
|
</ul>
|
|
</dd>
|
|
<dt>With external helpers</dt>
|
|
|
|
<dd>
|
|
<ul>
|
|
<li><var class="literal">pdf</var> with <a href=
|
|
"http://www.foolabs.com/xpdf/">xpdf</a>.</li>
|
|
|
|
<li><var class="literal">postscript</var> with <a
|
|
href=
|
|
"http://www.gnu.org/software/ghostscript/ghostscript.html">
|
|
ghostscript</a> and <a href=
|
|
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.</li>
|
|
|
|
<li><var class="literal">msword</var> with <a href=
|
|
"http://www.winfield.demon.nl/">antiword</a>.</li>
|
|
<li><var class="literal">Powerpoint</var> and
|
|
<var class="literal">Excel</var> with the
|
|
<a href="http://www.45.free.net/~vitus/software/catdoc/">
|
|
catdoc</a> utilities.</li>
|
|
|
|
<li><var class="literal">rtf</var> with <a href=
|
|
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
|
|
|
|
<li><var class="literal">dvi</var> with
|
|
<a href="http://www.radicaleye.com/dvips.html">dvips</a>.
|
|
</li>
|
|
|
|
<li><var class="literal">djvu</var> with
|
|
<a href="http://djvulibre.djvuzone.org/doc/index.html">DjVuLibre</a>.
|
|
</li>
|
|
<li><var class="literal">mp3</var> tags support with
|
|
<a href="http://id3lib.sourceforge.net/">id3info (id3lib)</a>.
|
|
</li>
|
|
|
|
</ul>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
|
|
<dt>Other features</dt>
|
|
<dd>
|
|
<ul>
|
|
<li>Multiple selectable databases.</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, filter on file types and directory tree.</li>
|
|
|
|
<li>Specific file name searches with wildcards.</li>
|
|
|
|
<li>Support for multiple charsets. Internal processing and
|
|
storage uses Unicode UTF-8.</li>
|
|
|
|
<li><a href="#Stemming">Stemming</a> performed at query
|
|
time (can switch stemming language after indexing).</li>
|
|
|
|
<li>Easy installation. No database daemon, web server or
|
|
exotic language necessary.</li>
|
|
|
|
<li>An indexer which runs either as a thread inside the GUI
|
|
or as an external, cron'able program.</li>
|
|
</ul>
|
|
</dd>
|
|
</ul>
|
|
|
|
<h2><a name="#stemming"></a>Stemming</h2>
|
|
|
|
<p>Stemming is a process which transforms inflected words into
|
|
their most basic form. For exemple, <i>flooring</i>,
|
|
<i>floors</i>, <i>floored</i> would probably all be transformed
|
|
to <i>floor</i> by a stemmer for the English language.</p>
|
|
|
|
<p>In many search engines, the stemming process occurs during
|
|
indexing. The index will only contain the stemmed form of words,
|
|
with exceptions for terms which are detected as being probably
|
|
proper nouns (ie: capitalized). At query time, the terms entered
|
|
by the user are stemmed, then matched against the index.</p>
|
|
|
|
<p>This process results into a smaller index, but it has the
|
|
grave inconvenient of irrevocably losing information during
|
|
indexing.</p>
|
|
|
|
<p>Recoll works in a different way. No stemming is performed at
|
|
query time, so that all information gets into the index. The
|
|
resulting index is bigger, but most people probably don't care
|
|
much about this nowadays, because they have a 100Gb disk 95%
|
|
full of binary data <em>which does not get indexed</em>.</p>
|
|
<p>At the end of an indexing pass, Recoll builds one or several
|
|
stemming dictionaries, where all word stems are listed in
|
|
correspondence to the list of their derivatives.</p>
|
|
|
|
<p>At query time, by default, user-entered terms are stemmed,
|
|
then matched against the stem database, and the query is
|
|
expanded to include all derivatives. This will yield search
|
|
results analogous to those obtained by a classical engine.
|
|
The benefits of this approach is that stem expansion can be
|
|
controlled instantly at query time in several ways:
|
|
<ul>
|
|
<li>It can be selectively turned-off for any query term by
|
|
capitalizing it (<i>Floor</i>).</li>
|
|
<li>The stemming language (ie: english, french...) can be
|
|
selected (this supposes that several stemming databases have
|
|
been built, which can be configured as part of the indexing,
|
|
or done later, in a reasonably fast way).</li>
|
|
</ul>
|
|
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|