325 lines
13 KiB
HTML
325 lines
13 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>RECOLL: a personal text search system for
|
|
Unix/Linux</title>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta name="Author" content="Jean-Francois Dockes">
|
|
<meta name="Description" content=
|
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
<meta name="Keywords" content=
|
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
<meta http-equiv="Content-language" content="en">
|
|
<meta http-equiv="content-type" content=
|
|
"text/html; charset=iso-8859-1">
|
|
<meta name="robots" content="All,Index,Follow">
|
|
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
</head>
|
|
|
|
<body>
|
|
<div class="rightlinks">
|
|
<ul>
|
|
<li><a href="index.html">Home</a></li>
|
|
|
|
<li><a href="pics/index.html">Screenshots</a></li>
|
|
|
|
<li><a href="download.html">Downloads</a></li>
|
|
|
|
<li><a href="usermanual/index.html">User manual</a></li>
|
|
|
|
<li><a href="index.html#support">Support</a></li>
|
|
|
|
<li><a href="devel.html">Development</a></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class="content">
|
|
<h1 class="intro">Recoll features</h1>
|
|
|
|
<h2><a name="systems">Supported systems</a></h2>
|
|
|
|
<p><span class="application">Recoll</span> has been compiled
|
|
and tested on FreeBSD, Linux, Darwin and Solaris (initial
|
|
versions FreeBSD 5, Redhat 7, Fedora Core 5, Suse 10, Gentoo,
|
|
Debian 3.1, Solaris 8). It should compile and run on all
|
|
subsequent releases of these systems and probably a few
|
|
others too.</p>
|
|
|
|
<p>Qt versions from 3.1 to 4.7</p>
|
|
|
|
<h2><a name="doctypes">Document types</a></h2>
|
|
|
|
<p>Recoll can index many document types (along with their
|
|
compressed versions). Some types are handled internally (no
|
|
external application needed). Other types need a separate
|
|
application to be installed to extract the text. Types that
|
|
only need very common utilities (awk/sed/groff etc.) are
|
|
listed in the native section.</p>
|
|
|
|
<h4>File types indexed natively</h4>
|
|
|
|
<ul>
|
|
<li><span class="literal">text</span>.</li>
|
|
|
|
<li><span class="literal">html</span>.</li>
|
|
|
|
<li><span class="literal">maildir</span> and <span class=
|
|
"literal">mailbox</span> (<span class=
|
|
"literal">Mozilla</span>, <span class=
|
|
"literal">Thunderbird</span> and <span class=
|
|
"literal">Evolution</span> mail ok).</li>
|
|
|
|
<li><span class="literal">gaim</span> and <span class=
|
|
"literal">purple</span> log files.</li>
|
|
|
|
<li><span class="literal">Lyx</span> files (needs <span
|
|
class="literal">Lyx</span> to be installed).</li>
|
|
|
|
<li><span class="literal">Scribus</span> files.</li>
|
|
|
|
<li><span class="literal">Man pages</span> (need <span
|
|
class="command">groff</span>).</li>
|
|
</ul>
|
|
|
|
<h4>File types indexed with external helpers</h4>
|
|
|
|
<p>Many document types need the <span class="command">iconv</span>
|
|
command in addition to the applications specifically listed.</p>
|
|
|
|
<h5>The XML ones</h5>
|
|
<p>The following types need <span class=
|
|
"command">xsltproc</span> from the <b>libxslt</b> package.
|
|
Quite a few also need <span class="command">unzip</span>:</p>
|
|
|
|
<ul>
|
|
<li><span class="literal">Abiword</span> files.</li>
|
|
|
|
<li><span class="literal">Fb2</span> ebooks.</li>
|
|
|
|
<li><span class="literal">Kword</span> files.</li>
|
|
|
|
<li><span class="literal">Microsoft Office Open XML</span>
|
|
files.</li>
|
|
|
|
<li><span class="literal">OpenOffice</span> files.</li>
|
|
|
|
<li><span class="literal">SVG</span> files.</li>
|
|
</ul>
|
|
|
|
<h5>Other formats</h5>
|
|
|
|
<ul>
|
|
<li><span class="literal">pdf</span> with the <span class=
|
|
"command">pdftotext</span> command, which can be installed
|
|
as part of <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
|
or <a href="http://poppler.freedesktop.org/">poppler</a>,
|
|
depending on your distribution.</li>
|
|
|
|
<li><span class="literal">msword</span> with <a href=
|
|
"http://www.winfield.demon.nl/">antiword</a>. It is also useful to
|
|
have <a href="http://wvware.sourceforge.net/">wvWare</a> installed
|
|
as it may be be used as a fallback for some files which antiword
|
|
does not handle.</li>
|
|
|
|
<li><span class="literal">Powerpoint</span> and <span
|
|
class="literal">Excel</span> with the <a href=
|
|
"http://catdoc.klik.atekon.de">catdoc</a> utilities.</li>
|
|
|
|
<li><span class="literal">CHM (Microsoft help)</span> files
|
|
with <span class="command">Python, <a href="http://gnochm.sourceforge.net/pychm.html">pychm</a>
|
|
and <a href="http://www.jedrea.com/chmlib/">chmlib</a></span>.</li>
|
|
|
|
<li><span class="literal">GNU info</span> files
|
|
with <span class="command">Python</span> and the
|
|
<span class="command">info</span> command.</li>
|
|
|
|
<li><span class="literal">Zip</span> archives (needs <span
|
|
class="command">Python</span>).</li>
|
|
|
|
<li><span class="literal">Rar</span> archives (needs <span
|
|
class="command">Python</span>), the
|
|
<a href="http://pypi.python.org/pypi/rarfile/">rarfile</a> Python
|
|
module and the <a
|
|
href="http://www.rarlab.com/rar_add.htm">unrar</a> utility.</li>
|
|
|
|
<li><span class="literal">iCalendar</span>(.ics) files
|
|
(needs <span class="command">Python, <a href=
|
|
"http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
|
|
|
<li><span class="literal">Mozilla calendar data</span> See
|
|
<a href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
|
the wiki</a> about this.</li>
|
|
|
|
<li><span class="literal">Wordperfect</span> with the
|
|
<span class="command">wpd2html</span> command from <a href=
|
|
"http://libwpd.sourceforge.net">libwpd</a>. On some distributions,
|
|
the command may come with an package named <span
|
|
class="literal">libwpd-tools</span> or such, not the base <a
|
|
span="literal">libwpd</a> package.</li>
|
|
|
|
<li><span class="literal">postscript</span> with <a href=
|
|
"http://www.gnu.org/software/ghostscript/ghostscript.html">
|
|
ghostscript</a> and <a href=
|
|
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
|
|
Actually the pstotext 1.9 found at the latter link has a
|
|
problem with file names using special shell characters, and
|
|
you should either use the version packaged for your system
|
|
which is probably patched, or apply the Debian patch which
|
|
is stored <a href=
|
|
"files/pstotext-1.9_4-debian.patch">here</a> for
|
|
convenience. See
|
|
http://packages.debian.org/squeeze/pstotext and
|
|
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988 for
|
|
references/explanations.</li>
|
|
|
|
<li><span class="literal">RTF</span> files with <a href=
|
|
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>. Please
|
|
note that up to version
|
|
0.21, <span class="command">unrtf</span> mostly does not work
|
|
with non western-european character sets. If you have a need
|
|
for indexing, ie, russian or chinese RTF files, I have
|
|
produced a modified version which works much better (as
|
|
indicated by my tests and a few external ones). You can
|
|
download the <a href="unrtf/unrtf-0.22.2beta.tar.gz">source
|
|
here</a>. The development is hosted
|
|
on <a href="http://www.bitbucket.org/medoc/unrtf-int">
|
|
bitbucket.org</a>.</li>
|
|
|
|
<li><span class="literal">TeX</span> with <span class=
|
|
"command">untex</span>. If there is no untex package for
|
|
your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
|
|
source package is stored on this site</a> (as untex has no
|
|
obvious home). Will also work with <a href=
|
|
"http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
|
if this is installed.</li>
|
|
|
|
<li><span class="literal">dvi</span> with <a href=
|
|
"http://www.radicaleye.com/dvips.html">dvips</a>.</li>
|
|
|
|
<li><span class="literal">djvu</span> with <a href=
|
|
"http://djvu.sourceforge.net">DjVuLibre</a>.</li>
|
|
|
|
<li>Audio file tags: Recoll releases 1.13 and older use <a
|
|
href="http://id3lib.sourceforge.net/">id3info (id3lib)</a>
|
|
(compiling id3lib on recent systems may need a small patch,
|
|
see <a href="id3lib.html">here.</a>) or the ogg and flac
|
|
tools.<br>
|
|
Recoll releases 1.14 and later use a Python filter based
|
|
on <a href="http://code.google.com/p/mutagen/">mutagen</a>
|
|
for all audio types.</li>
|
|
|
|
<li>Image file tags with <a href=
|
|
"http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
|
|
This is a perl program, so you also need perl on the
|
|
system. This works with about any possible image file and
|
|
tag format (jpg, png, tiff, gif etc.).</li>
|
|
|
|
<li>Midi karaoke files with Python, the
|
|
<a href="http://pypi.python.org/pypi/midi/0.2.1">
|
|
midi module</a>, and some help
|
|
from <a href="http://chardet.feedparser.org/">chardet</a>. There
|
|
is probably a <tt>chardet</tt> package for your distribution,
|
|
but you will quite probably need to build the midi
|
|
package. This is easy but see the
|
|
to <a href="helpernotes.html#midi">notes here</a>.
|
|
</li>
|
|
|
|
<li>Konqueror webarchive format with Python (uses the tarfile
|
|
module).</li>
|
|
|
|
<li>mimehtml web archive format (support based on the mail
|
|
filter, which introduces some mild weirdness, but still
|
|
usable).</li>
|
|
|
|
</ul>
|
|
|
|
<h2>Other features</h2>
|
|
|
|
<ul>
|
|
<li>Can use <b>Beagle</b> browser plug-ins to index web
|
|
history. See the <a href=
|
|
"http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">the
|
|
Wiki</a> for more detail.</li>
|
|
|
|
<li>Processes all email attachments, and more generally any
|
|
realistic level of container imbrication (the "msword attachment to
|
|
a message inside a mailbox in a zip" thingy...) .</li>
|
|
|
|
<li>Multiple selectable databases.</li>
|
|
|
|
<li>Powerful query facilities, with boolean searches,
|
|
phrases, filter on file types and directory tree.</li>
|
|
|
|
<li>Xesam-compatible query language.</li>
|
|
|
|
<li>Wildcard searches (with a specific and faster function
|
|
for file names).</li>
|
|
|
|
<li>Support for multiple charsets. Internal processing and
|
|
storage uses Unicode UTF-8.</li>
|
|
|
|
<li><a href="#Stemming">Stemming</a> performed at query
|
|
time (can switch stemming language after indexing).</li>
|
|
|
|
<li>Easy installation. No database daemon, web server or
|
|
exotic language necessary.</li>
|
|
|
|
<li>An indexer which runs either as a thread inside the
|
|
GUI, as an external, batch, cron'able program, or as a
|
|
real-time indexing daemon.</li>
|
|
</ul>
|
|
|
|
<h2><a name="#stemming"></a>Stemming</h2>
|
|
|
|
<p>Stemming is a process which transforms inflected words
|
|
into their most basic form. For example, <i>flooring</i>,
|
|
<i>floors</i>, <i>floored</i> would probably all be
|
|
transformed to <i>floor</i> by a stemmer for the English
|
|
language.</p>
|
|
|
|
<p>In many search engines, the stemming process occurs during
|
|
indexing. The index will only contain the stemmed form of
|
|
words, with exceptions for terms which are detected as being
|
|
probably proper nouns (ie: capitalized). At query time, the
|
|
terms entered by the user are stemmed, then matched against
|
|
the index.</p>
|
|
|
|
<p>This process results into a smaller index, but it has the
|
|
grave inconvenient of irrevocably losing information during
|
|
indexing.</p>
|
|
|
|
<p>Recoll works in a different way. No stemming is performed
|
|
at query time, so that all information gets into the index.
|
|
The resulting index is bigger, but most people probably don't
|
|
care much about this nowadays, because they have a 100Gb disk
|
|
95% full of binary data <em>which does not get
|
|
indexed</em>.</p>
|
|
|
|
<p>At the end of an indexing pass, Recoll builds one or
|
|
several stemming dictionaries, where all word stems are
|
|
listed in correspondence to the list of their
|
|
derivatives.</p>
|
|
|
|
<p>At query time, by default, user-entered terms are stemmed,
|
|
then matched against the stem database, and the query is
|
|
expanded to include all derivatives. This will yield search
|
|
results analogous to those obtained by a classical engine.
|
|
The benefits of this approach is that stem expansion can be
|
|
controlled instantly at query time in several ways:</p>
|
|
|
|
<ul>
|
|
<li>It can be selectively turned-off for any query term by
|
|
capitalizing it (<i>Floor</i>).</li>
|
|
|
|
<li>The stemming language (ie: english, french...) can be
|
|
selected (this supposes that several stemming databases
|
|
have been built, which can be configured as part of the
|
|
indexing, or done later, in a reasonably fast way).</li>
|
|
</ul>
|
|
</div>
|
|
</body>
|
|
</html>
|
|
|