doc
This commit is contained in:
parent
fe108af875
commit
9d89fc2061
@ -1,7 +1,8 @@
|
||||
<!DOCTYPE BOOK PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
|
||||
|
||||
<!ENTITY RCL "<application>Recoll</application>">
|
||||
<!ENTITY RCLVERSION "1.12-1.13">
|
||||
<!ENTITY RCLAPPS "<ulink url='http://www.recoll.org/features.html'>Recoll helper applications page</ulink>">
|
||||
<!ENTITY RCLVERSION "1.14">
|
||||
<!ENTITY XAP "<application>Xapian</application>">
|
||||
]>
|
||||
|
||||
@ -2620,138 +2621,119 @@ while query.next >= 0 and query.next < nres:
|
||||
specific file type).</para>
|
||||
|
||||
<para>After an indexing pass, the commands that were found
|
||||
missing can be displayed from the <command>recoll</command>
|
||||
<guilabel>File</guilabel> menu. The list is stored in the
|
||||
<filename>missing</filename> text file inside the configuration
|
||||
directory.</para>
|
||||
missing can be displayed from the <command>recoll</command>
|
||||
<guilabel>File</guilabel> menu. The list is stored in the
|
||||
<filename>missing</filename> text file inside the configuration
|
||||
directory.</para>
|
||||
|
||||
<para>A list of common file types which need external
|
||||
commands follows. Many of the filters need the
|
||||
<command>iconv</command> command, which is not always listed as a
|
||||
dependancy.</para>
|
||||
|
||||
<para>As of &RCL; release 1.14, a number of XML-based formats that
|
||||
were handled by ad hoc filter code now use
|
||||
<command>xsltproc</command>, which usually comes with
|
||||
<ulink
|
||||
url="http://xmlsoft.org/XSLT/index.html">libxslt</ulink>. These
|
||||
are: abiword, fb2 (ebooks), kword, openoffice, svg.</para>
|
||||
<para>Please note that, due to the relatively dynamic nature of this
|
||||
information, the most up to date version is now kept on the &RCLAPPS;
|
||||
along with links to the home pages or best source/patches download
|
||||
links. The list below is not updated often and may be quite
|
||||
stale.</para>
|
||||
|
||||
<para>For many Linux distributions, most of the commands listed can
|
||||
be installed from the package repositories. However, the packages
|
||||
are sometimes outdated, or not the best version for &RCL;, so you
|
||||
should take a look at the &RCLAPPS; if a file
|
||||
type is important to you.</para>
|
||||
|
||||
<para>As of &RCL; release 1.14, a number of XML-based formats that
|
||||
were handled by ad hoc filter code now use the
|
||||
<command>xsltproc</command>, which usually comes with
|
||||
<application>libxslt</application>. These are: abiword, fb2
|
||||
(ebooks), kword, openoffice, svg.</para>
|
||||
|
||||
<para>Now for the list:</para>
|
||||
<itemizedlist>
|
||||
|
||||
<listitem><para>Openoffice: supported natively, but needs the
|
||||
<command>unzip</command> command to be installed.</para>
|
||||
<listitem><para>Openoffice files need <command>unzip</command> and
|
||||
<command>xsltproc</command>.</para></listitem>
|
||||
|
||||
<listitem><para>PDF files need <command>pdftotext</command> which
|
||||
is part of the <application>Xpdf</application> or
|
||||
<application>Poppler</application> packages.</para></listitem>
|
||||
|
||||
<listitem><para>Postscript files need <command>pstotext</command>.
|
||||
The original version has an issue with shell
|
||||
character in file names, which is corrected in recent
|
||||
packages. See the the &RCLAPPS; for more detail.
|
||||
</listitem>
|
||||
|
||||
<listitem><para>PDF: pdftotext is part of the <ulink
|
||||
url="http://www.foolabs.com/xpdf/">Xpdf</ulink> or <ulink
|
||||
url="http://poppler.freedesktop.org/">Poppler</ulink> packages.</para>
|
||||
</listitem>
|
||||
<listitem><para>MS Word needs
|
||||
<command>antiword</command>. It is also useful to have
|
||||
<command>wvWare</command> installed as it may be
|
||||
be used as a fallback for some files which
|
||||
<command>antiword</command> does not handle.</para></listitem>
|
||||
|
||||
<listitem><para>Postscript: <ulink
|
||||
url="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
|
||||
pstotext</ulink>. The original version has an issue with shell
|
||||
character in file names. Most recent package repositories /
|
||||
ports system use a patched version (ie FreeBSD, Debian). If
|
||||
compiling from source, it would be better to apply the patch
|
||||
found
|
||||
<ulink url="http://www.recoll.org/files/pstotext-1.9_4-debian.patch">
|
||||
here</ulink>.</para>
|
||||
</listitem>
|
||||
<listitem><para>MS Excel and PowerPoint need <command>
|
||||
catdoc</command>.</para></listitem>
|
||||
|
||||
<listitem><para>MS Word: <ulink url="http://www.winfield.demon.nl">
|
||||
antiword</ulink>.</para>
|
||||
</listitem>
|
||||
<listitem><para>MS Open XML (docx) needs <command>
|
||||
xsltproc</command>.</para></listitem>
|
||||
|
||||
<listitem><para>MS Excel and PowerPoint:
|
||||
<ulink url="http://catdoc.klik.atekon.de/">
|
||||
catdoc</ulink>.</para>
|
||||
</listitem>
|
||||
<listitem><para>Wordperfect files need <command>wpd2html</command>
|
||||
from the <application>libwpd</application> package.</para></listitem>
|
||||
|
||||
<listitem><para>MS Open XML (docx): needs
|
||||
<command>xsltproc</command>.</para>
|
||||
</listitem>
|
||||
<listitem><para>RTF files need <command>unrtf</command>, which, in
|
||||
its standard version, has much trouble with non-western character
|
||||
sets. Check the &RCLAPPS;.</para></listitem>
|
||||
|
||||
<listitem><para>Wordperfect files:
|
||||
<ulink url="http://libwpd.sourceforge.net/download.html">
|
||||
libwpd</ulink>.</para>
|
||||
</listitem>
|
||||
<listitem><para>TeX files need <command>untex</command> or
|
||||
<command>detex</command>. Check the &RCLAPPS; for sources if it's not
|
||||
packaged for your distribution.</para></listitem>
|
||||
|
||||
<listitem>
|
||||
<para>RTF: <ulink
|
||||
url="http://www.gnu.org/software/unrtf/unrtf.html">unrtf</ulink>
|
||||
</para>
|
||||
<listitem><para>dvi files need <command>dvips</command>.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>TeX: &RCL; uses the <application>untex</application>
|
||||
program. Your distribution may have a package for it. If it doesn't,
|
||||
<ulink url="http://www.recoll.org/untex/untex-1.3.jf.tar.gz">
|
||||
there is a copy of the source on the &RCL; web site</ulink>,
|
||||
because the program has no obvious home. The filter can
|
||||
also work with
|
||||
<ulink url="http://www.cs.purdue.edu/homes/trinkle/detex/">
|
||||
detex</ulink> and will use it if it is installed.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>dvi: <ulink
|
||||
url="http://www.radicaleye.com/dvips.html">dvips</ulink></para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>djvu:
|
||||
<ulink
|
||||
url="http://djvu.sourceforge.net">DjVuLibre
|
||||
</ulink></para>
|
||||
</listitem>
|
||||
<listitem><para>djvu files need <command>djvutxt</command> and
|
||||
<command>djvused</command> from the
|
||||
<application>DjVuLibre</application> package.</para></listitem>
|
||||
|
||||
<listitem><para>mp3, flac, ogg vorbis: &RCL; releases before 1.13
|
||||
use the <command>id3info</command> command from the <ulink
|
||||
url="http://id3lib.sourceforge.net/">id3lib</ulink> package to
|
||||
extract mp3 tag information. (Some gcc versions after 4.4 may have
|
||||
trouble compiling <application>id3lib</application>. <ulink
|
||||
url="http://www.recoll.org/id3lib.html">You can find a
|
||||
workaround here</ulink>), metaflac (standard flac tools) for flac
|
||||
files, and ogginfo (vorbis tools) for ogg files. Releases 1.14
|
||||
and later use a single Python filter based on
|
||||
<ulink url="http://code.google.com/p/mutagen/">mutagen</ulink>
|
||||
for all audio file types.</para>
|
||||
<listitem><para>Audio files: &RCL; releases before 1.13
|
||||
used the <command>id3info</command> command from the <application>
|
||||
id3lib</application> package to extract mp3 tag information,
|
||||
<command>metaflac</command> (standard flac tools) for flac files,
|
||||
and <command>ogginfo</command> (vorbis tools) for ogg
|
||||
files. Releases 1.14 and later use a single
|
||||
<application>Python</application> filter based
|
||||
on <application>mutagen</application> for all audio file
|
||||
types.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Pictures: &RCL; uses the
|
||||
<ulink url="http://www.sno.phy.queensu.ca/~phil/exiftool/">
|
||||
Exiftool</ulink> <application>Perl</application> package to
|
||||
extract tag information. Most image file formats are
|
||||
supported. Note that there may not be much interest in indexing
|
||||
the technical tags (image size, aperture, etc.). This is only of
|
||||
interest if you store personal tags or textual descriptions inside
|
||||
the image files.</para>
|
||||
</listitem>
|
||||
<listitem><para>Pictures: &RCL; uses the
|
||||
<application>Exiftool</application>
|
||||
<application>Perl</application> package to extract tag
|
||||
information. Most image file formats are supported. Note that
|
||||
there may not be much interest in indexing the technical tags
|
||||
(image size, aperture, etc.). This is only of interest if you
|
||||
store personal tags or textual descriptions inside the image
|
||||
files.</para></listitem>
|
||||
|
||||
<listitem><para>chm: files in microsoft help format need Python and
|
||||
the <ulink
|
||||
url="http://gnochm.sourceforge.net/pychm.html">pychm</ulink>
|
||||
module (which needs <ulink
|
||||
url="http://www.jedrea.com/chmlib/">chmlib</ulink>).</para>
|
||||
</listitem>
|
||||
the <application>pychm</application> module (which needs
|
||||
<application>chmlib</application>).</para></listitem>
|
||||
|
||||
<listitem><para>ics: up to &RCL; 1.13, iCalendar files need Python
|
||||
and the <application>icalendar</application> module. For newer
|
||||
versions, <application>icalendar</application> is not needed
|
||||
</para></listitem>
|
||||
<listitem><para>ICS: up to &RCL; 1.13, iCalendar files need
|
||||
<application>Python</application>
|
||||
and the <application>icalendar</application>
|
||||
module. <application>icalendar</application> is not needed for newer
|
||||
versions, which use internal code.</para></listitem>
|
||||
|
||||
<listitem><para>zip: Zip archives need Python (and the standard
|
||||
zipfile module).</para>
|
||||
</listitem>
|
||||
<listitem><para>Zip archives need <application>Python</application>
|
||||
(and the standard zipfile module).</para></listitem>
|
||||
|
||||
</itemizedlist>
|
||||
|
||||
<para>Text, HTML, mail folders, Openoffice and Scribus files
|
||||
are processed internally. Lyx is used to index Lyx files. Many
|
||||
filters need <command>iconv</command> and the standard
|
||||
<command>sed</command> and <command>awk</command>.
|
||||
<para>Text, HTML, mail folders, and Scribus files are
|
||||
processed internally. <application>Lyx</application> is used to
|
||||
index Lyx files. Many filters need <command>iconv</command> and the
|
||||
standard <command>sed</command> and <command>awk</command>.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
@ -46,20 +46,13 @@
|
||||
<li><a href="perfs.html">Index size and indexing performance
|
||||
data.</a></li>
|
||||
|
||||
<li>Faqs and Howtos are now kept in the
|
||||
<a href="http://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos">
|
||||
Recoll Wiki</a> on
|
||||
<a href="http://bitbucket.org/medoc/recoll">bitbucket.org</a>.</li>
|
||||
|
||||
<p>Current list of HowTos:</p>
|
||||
<ul>
|
||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/PreventIndexingDir">Prevent indexing of a directory</a></li>
|
||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/MultipleIndexes">Creating and using multiple indexes</a></li>
|
||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/SavingConfig.wiki">Recoll configuration backup</a></p>
|
||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">Indexing Mozilla Sunbird / Lightning calendar data</a></li>
|
||||
</ul>
|
||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos">
|
||||
Faqs and Howtos</a> are now kept in the
|
||||
<a href="http://bitbucket.org/medoc/recoll/wiki/">
|
||||
Recoll Wiki</a> on
|
||||
<a href="http://bitbucket.org/medoc/recoll">bitbucket.org</a>.</li>
|
||||
</ul>
|
||||
|
||||
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
@ -384,7 +384,8 @@ sudo add-apt-repository ppa:recoll-backports/ppa
|
||||
|
||||
<h2><a name="translations">Translations</a></h2>
|
||||
|
||||
<p>Most of the translations for 1.13 are incomplete. The source
|
||||
<p>Most of the translations for 1.13 are incomplete (and I
|
||||
forgot to update the message files for 1.14, ugh). The source
|
||||
translation files are included in the source release. If your
|
||||
language has some english messages left and you want to take a
|
||||
shot at fixing the problem, you can send the results to
|
||||
@ -400,17 +401,17 @@ sudo add-apt-repository ppa:recoll-backports/ppa
|
||||
</p>
|
||||
|
||||
<p><a href="translations/recoll_xx.ts">recoll_xx.ts</a> is a blank
|
||||
Recoll 1.13 message file, handy to work on a new translation.</p>
|
||||
Recoll 1.14 message file, handy to work on a new translation.</p>
|
||||
|
||||
<h3>Updated 1.13 translations that became available after the
|
||||
<h3>Updated 1.13/1.14 translations that became available after the
|
||||
release:</h3>
|
||||
|
||||
<p>None for now :(</p>
|
||||
<!--
|
||||
<p>German.
|
||||
<a href="translations/recoll_de.ts">recoll_de.ts</a>
|
||||
<a href="translations/recoll_de.qm">recoll_de.qm</a>
|
||||
<!-- <p>None for now :(</p> -->
|
||||
<p>Lithuanian.
|
||||
<a href="translations/recoll_lt.ts">recoll_lt.ts</a>
|
||||
<a href="translations/recoll_lt.qm">recoll_lt.qm</a>
|
||||
</p>
|
||||
<!--
|
||||
<p>Ukrainian.
|
||||
<a href="translations/recoll_uk.ts">recoll_uk.ts</a>
|
||||
<a href="translations/recoll_uk.qm">recoll_uk.qm</a>
|
||||
|
||||
@ -9,7 +9,7 @@
|
||||
<meta name="Description" content=
|
||||
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
||||
<meta name="Keywords" content=
|
||||
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
||||
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
||||
<meta http-equiv="Content-language" content="en">
|
||||
<meta http-equiv="content-type" content=
|
||||
"text/html; charset=iso-8859-1">
|
||||
@ -18,260 +18,268 @@
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="rightlinks">
|
||||
<ul>
|
||||
<li><a href="index.html">Home</a></li>
|
||||
<li><a href="pics/index.html">Screenshots</a></li>
|
||||
<li><a href="download.html">Downloads</a></li>
|
||||
<li><a href="usermanual/index.html">User manual</a></li>
|
||||
<li><a href="index.html#support">Support</a></li>
|
||||
<li><a href="devel.html">Development</a></li>
|
||||
<li><a href="index.html">Home</a></li>
|
||||
|
||||
<li><a href="pics/index.html">Screenshots</a></li>
|
||||
|
||||
<li><a href="download.html">Downloads</a></li>
|
||||
|
||||
<li><a href="usermanual/index.html">User manual</a></li>
|
||||
|
||||
<li><a href="index.html#support">Support</a></li>
|
||||
|
||||
<li><a href="devel.html">Development</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="content">
|
||||
|
||||
<h1 class="intro">Recoll features</h1>
|
||||
|
||||
<dl>
|
||||
<dt><a name="systems">Supported systems</a></dt>
|
||||
<dd><span class="application">Recoll</span> has been compiled and
|
||||
tested on FreeBSD, Linux, Darwin and Solaris (versions
|
||||
FreeBSD 5-7, Redhat 7/8/9, Fedora Core 5-13, Suse 10/11,
|
||||
Gentoo, Debian 3.1, Solaris 8/9/10. Other not too distant
|
||||
releases should be ok too).</dd>
|
||||
<h2><a name="systems">Supported systems</a></h2>
|
||||
|
||||
<dd>Qt versions from 3.1 to 4.5</dd>
|
||||
<p><span class="application">Recoll</span> has been compiled
|
||||
and tested on FreeBSD, Linux, Darwin and Solaris (initial
|
||||
versions FreeBSD 5, Redhat 7, Fedora Core 5, Suse 10, Gentoo,
|
||||
Debian 3.1, Solaris 8). It should compile and run on all
|
||||
subsequent releases of these systems and probably a few
|
||||
others too.</p>
|
||||
|
||||
<dt><a name="doctypes">Document types</a></dt>
|
||||
<dd>Recoll can index many document types (along with their
|
||||
compressed versions). Some types are handled internally (no
|
||||
external application needed). Other types need some application to
|
||||
be installed to extract the text. Types that only need common
|
||||
very common utilities (awk/sed/groff etc.) are listed in the
|
||||
native section.</dd>
|
||||
<p>Qt versions from 3.1 to 4.7</p>
|
||||
|
||||
<dl>
|
||||
<dt>Natively</dt>
|
||||
<h2><a name="doctypes">Document types</a></h2>
|
||||
|
||||
<dd>
|
||||
<ul>
|
||||
<li><span class="literal">text</span>.</li>
|
||||
<p>Recoll can index many document types (along with their
|
||||
compressed versions). Some types are handled internally (no
|
||||
external application needed). Other types need a separate
|
||||
application to be installed to extract the text. Types that
|
||||
only need very common utilities (awk/sed/groff etc.) are
|
||||
listed in the native section.</p>
|
||||
|
||||
<li><span class="literal">html</span>.</li>
|
||||
<h4>File types indexed natively</h4>
|
||||
|
||||
<li><span class="literal">maildir</span> and <span
|
||||
class="literal">mailbox</span> (<span class=
|
||||
"literal">Mozilla</span>, <span class=
|
||||
"literal">Thunderbird</span> and <span class=
|
||||
"literal">Evolution</span> mail ok).</li>
|
||||
<ul>
|
||||
<li><span class="literal">text</span>.</li>
|
||||
|
||||
<li><span class="literal">OpenOffice</span>
|
||||
files (needs <span class="command">unzip</span> command).</li>
|
||||
<li><span class="literal">html</span>.</li>
|
||||
|
||||
<li><span class="literal">Abiword</span> files.</li>
|
||||
<li><span class="literal">maildir</span> and <span class=
|
||||
"literal">mailbox</span> (<span class=
|
||||
"literal">Mozilla</span>, <span class=
|
||||
"literal">Thunderbird</span> and <span class=
|
||||
"literal">Evolution</span> mail ok).</li>
|
||||
|
||||
<li><span class="literal">Kword</span> files.</li>
|
||||
<li><span class="literal">gaim</span> and <span class=
|
||||
"literal">purple</span> log files.</li>
|
||||
|
||||
<li><span class="literal">gaim</span> and <span
|
||||
class="literal">purple</span> log files.</li>
|
||||
<li><span class="literal">Lyx</span> files (needs <span
|
||||
class="literal">Lyx</span> to be installed).</li>
|
||||
|
||||
<li><span class="literal">Lyx</span> files (needs
|
||||
<span class="literal">Lyx</span> to be installed).</li>
|
||||
<li><span class="literal">Scribus</span> files.</li>
|
||||
|
||||
<li><span class="literal">Scribus</span> files.</li>
|
||||
|
||||
<li><span class="literal">Man pages</span> (need <span
|
||||
class="command">groff</span>).</li>
|
||||
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dt>With external helpers</dt>
|
||||
|
||||
<dd>
|
||||
<para>In addition to the applications listed below, many
|
||||
document types need the <span
|
||||
class="command">iconv</span> command.</para>
|
||||
|
||||
<ul>
|
||||
<li><span class="literal">Microsoft Office Open XML</span>
|
||||
files with the <span class="command">unzip</span>
|
||||
and <span class="command">xsltproc</span> commands.</li>
|
||||
|
||||
<li><span class="literal">pdf</span> with the <span
|
||||
class="command">pdftotext</span> command, which can be
|
||||
installed as part of <a href=
|
||||
"http://www.foolabs.com/xpdf/">xpdf</a> or <a
|
||||
href="http://poppler.freedesktop.org/">poppler</a>,
|
||||
depending on your distribution.</li>
|
||||
|
||||
<li><span class="literal">msword</span> with <a href=
|
||||
"http://www.winfield.demon.nl/">antiword</a>.</li>
|
||||
|
||||
<li><span class="literal">Powerpoint</span> and
|
||||
<span class="literal">Excel</span> with the
|
||||
<a href="http://catdoc.klik.atekon.de">
|
||||
catdoc</a> utilities.</li>
|
||||
|
||||
<li><span class="literal">CHM (Microsoft help)</span>
|
||||
files (needs <span class="command">Python, pychm or
|
||||
chmlib</span>).</li>
|
||||
|
||||
<li><span class="literal">Zip</span>
|
||||
archives (needs <span class="command">Python</span>).</li>
|
||||
|
||||
<li><span class="literal">iCalendar</span>(.ics) files
|
||||
(needs <span class="command">Python,
|
||||
<a href="http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
||||
|
||||
<li><span class="literal">Mozilla calendar data</span>
|
||||
See <a href="http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
||||
the wiki</a> about this.</li>
|
||||
|
||||
<li><span class="literal">Wordperfect</span> with <a href=
|
||||
"http://libwpd.sourceforge.net">libwpd</a>.</li>
|
||||
|
||||
<li><span class="literal">postscript</span> with
|
||||
<a href="http://www.gnu.org/software/ghostscript/ghostscript.html">
|
||||
ghostscript</a> and
|
||||
<a href="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
|
||||
pstotext</a>.
|
||||
Actually the pstotext 1.9 found at the latter link
|
||||
has a problem with file names using special shell
|
||||
characters, and you should either use the version
|
||||
packaged for your system which is probably patched,
|
||||
or apply the Debian patch which is
|
||||
stored <a href="files/pstotext-1.9_4-debian.patch">here</a>
|
||||
for convenience. See
|
||||
http://packages.debian.org/squeeze/pstotext and
|
||||
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
|
||||
for references/explanations.</li>
|
||||
|
||||
<li><span class="literal">rtf</span> with <a href=
|
||||
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
|
||||
|
||||
<li><span class="literal">TeX</span> with
|
||||
<span class="command">untex</span>. If there is no untex
|
||||
package for your distribution,
|
||||
<a href="untex/untex-1.3.jf.tar.gz">a source package is
|
||||
stored on this site</a> (as untex has no obvious
|
||||
home).
|
||||
Will also work
|
||||
with <a
|
||||
href="http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
||||
if this is installed.
|
||||
</li>
|
||||
|
||||
<li><span class="literal">dvi</span> with
|
||||
<a href="http://www.radicaleye.com/dvips.html">dvips</a>.
|
||||
</li>
|
||||
|
||||
<li><span class="literal">djvu</span> with
|
||||
<a href="http://djvu.sourceforge.net">DjVuLibre</a>.
|
||||
</li>
|
||||
<li><span class="literal">mp3/flac/ogg vorbis</span>
|
||||
tags support with
|
||||
<a href="http://id3lib.sourceforge.net/">id3info (id3lib)
|
||||
</a> (compiling id3lib on recent systems may need
|
||||
a small patch, see <a href="id3lib.html">here.</a>) or
|
||||
the ogg and flac tools. Release 1.14 and later use a
|
||||
python filter based on
|
||||
<a href="http://code.google.com/p/mutagen/">mutagen</a>
|
||||
for all audio tags.
|
||||
</li>
|
||||
<li>Image file tags support with
|
||||
<a href="http://www.sno.phy.queensu.ca/~phil/exiftool/">
|
||||
exiftool</a>. This is a perl program, so you also
|
||||
need perl on the system. This works with about any
|
||||
possible image file and tag format (jpg, png, tiff,
|
||||
gif etc.).
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</dd>
|
||||
</dl>
|
||||
</dd>
|
||||
|
||||
<dt>Other features</dt>
|
||||
<dd>
|
||||
<ul>
|
||||
<li>Can use <b>Beagle</b> browser plug-ins to index web
|
||||
history. See the
|
||||
<a href="http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">
|
||||
the Wiki</a> for more detail.</li>
|
||||
|
||||
<li>Processes all email attachments.</li>
|
||||
|
||||
<li>Multiple selectable databases.</li>
|
||||
|
||||
<li>Powerful query facilities, with boolean searches,
|
||||
phrases, filter on file types and directory tree.</li>
|
||||
|
||||
<li>Xesam-compatible query language.</li>
|
||||
|
||||
<li>Wildcard searches (with a specific and faster function for
|
||||
file names).</li>
|
||||
|
||||
<li>Support for multiple charsets. Internal processing and
|
||||
storage uses Unicode UTF-8.</li>
|
||||
|
||||
<li><a href="#Stemming">Stemming</a> performed at query
|
||||
time (can switch stemming language after indexing).</li>
|
||||
|
||||
<li>Easy installation. No database daemon, web server or
|
||||
exotic language necessary.</li>
|
||||
|
||||
<li>An indexer which runs either as a thread inside the GUI,
|
||||
as an external, batch, cron'able program, or as a
|
||||
real-time indexing daemon.</li>
|
||||
</ul>
|
||||
</dd>
|
||||
<li><span class="literal">Man pages</span> (need <span
|
||||
class="command">groff</span>).</li>
|
||||
</ul>
|
||||
|
||||
<h4>File types indexed with external helpers</h4>
|
||||
|
||||
<p>Many document types need the <span class="command">iconv</span>
|
||||
command in addition to the applications specifically listed.</p>
|
||||
|
||||
<p>The following types need <span class=
|
||||
"command">xsltproc</span> from the <b>libxslt</b> package.
|
||||
Quite a few also need <span class="command">unzip</span>:</p>
|
||||
|
||||
<ul>
|
||||
<li><span class="literal">Abiword</span> files.</li>
|
||||
|
||||
<li><span class="literal">Fb2</span> ebooks.</li>
|
||||
|
||||
<li><span class="literal">Kword</span> files.</li>
|
||||
|
||||
<li><span class="literal">Microsoft Office Open XML</span>
|
||||
files.</li>
|
||||
|
||||
<li><span class="literal">OpenOffice</span> files.</li>
|
||||
|
||||
<li><span class="literal">SVG</span> files.</li>
|
||||
</ul>
|
||||
|
||||
<p>Others:</p>
|
||||
|
||||
<ul>
|
||||
<li><span class="literal">pdf</span> with the <span class=
|
||||
"command">pdftotext</span> command, which can be installed
|
||||
as part of <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
||||
or <a href="http://poppler.freedesktop.org/">poppler</a>,
|
||||
depending on your distribution.</li>
|
||||
|
||||
<li><span class="literal">msword</span> with <a href=
|
||||
"http://www.winfield.demon.nl/">antiword</a>.</li>
|
||||
|
||||
<li><span class="literal">Powerpoint</span> and <span
|
||||
class="literal">Excel</span> with the <a href=
|
||||
"http://catdoc.klik.atekon.de">catdoc</a> utilities.</li>
|
||||
|
||||
<li><span class="literal">CHM (Microsoft help)</span> files
|
||||
(needs <span class="command">Python, pychm or
|
||||
chmlib</span>).</li>
|
||||
|
||||
<li><span class="literal">Zip</span> archives (needs <span
|
||||
class="command">Python</span>).</li>
|
||||
|
||||
<li><span class="literal">iCalendar</span>(.ics) files
|
||||
(needs <span class="command">Python, <a href=
|
||||
"http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
||||
|
||||
<li><span class="literal">Mozilla calendar data</span> See
|
||||
<a href=
|
||||
"http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
||||
the wiki</a> about this.</li>
|
||||
|
||||
<li><span class="literal">Wordperfect</span> with <a href=
|
||||
"http://libwpd.sourceforge.net">libwpd</a>.</li>
|
||||
|
||||
<li><span class="literal">postscript</span> with <a href=
|
||||
"http://www.gnu.org/software/ghostscript/ghostscript.html">ghostscript</a>
|
||||
and <a href=
|
||||
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
|
||||
Actually the pstotext 1.9 found at the latter link has a
|
||||
problem with file names using special shell characters, and
|
||||
you should either use the version packaged for your system
|
||||
which is probably patched, or apply the Debian patch which
|
||||
is stored <a href=
|
||||
"files/pstotext-1.9_4-debian.patch">here</a> for
|
||||
convenience. See
|
||||
http://packages.debian.org/squeeze/pstotext and
|
||||
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988 for
|
||||
references/explanations.</li>
|
||||
|
||||
<li><span class="literal">RTF</span> files with <a href=
|
||||
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>. Please
|
||||
note that up to version
|
||||
0.21, <span class="command">unrtf</span> mostly does not work
|
||||
with non western-european character sets. If you have a need
|
||||
for indexing, ie, russian or chinese RTF files, I have
|
||||
produced a modified version which works much better (as
|
||||
indicated by my tests and a few external ones). You can
|
||||
download the <a href="unrtf/unrtf-0.22.0beta.tar.gz">source
|
||||
here</a>. The development is hosted
|
||||
on <a href="http://www.bitbucket.org/medoc/unrtf-int">
|
||||
bitbucket.org</a>.</li>
|
||||
|
||||
<li><span class="literal">TeX</span> with <span class=
|
||||
"command">untex</span>. If there is no untex package for
|
||||
your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
|
||||
source package is stored on this site</a> (as untex has no
|
||||
obvious home). Will also work with <a href=
|
||||
"http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
||||
if this is installed.</li>
|
||||
|
||||
<li><span class="literal">dvi</span> with <a href=
|
||||
"http://www.radicaleye.com/dvips.html">dvips</a>.</li>
|
||||
|
||||
<li><span class="literal">djvu</span> with <a href=
|
||||
"http://djvu.sourceforge.net">DjVuLibre</a>.</li>
|
||||
|
||||
<li>Audio file tags: Recoll releases 1.13 and older use <a
|
||||
href="http://id3lib.sourceforge.net/">id3info (id3lib)</a>
|
||||
(compiling id3lib on recent systems may need a small patch,
|
||||
see <a href="id3lib.html">here.</a>) or the ogg and flac
|
||||
tools.<br>
|
||||
Recoll releases 1.14 and later use a Python filter based
|
||||
on <a href="http://code.google.com/p/mutagen/">mutagen</a>
|
||||
for all audio types.</li>
|
||||
|
||||
<li>Image file tags support with <a href=
|
||||
"http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
|
||||
This is a perl program, so you also need perl on the
|
||||
system. This works with about any possible image file and
|
||||
tag format (jpg, png, tiff, gif etc.).</li>
|
||||
</ul>
|
||||
|
||||
<h2>Other features</h2>
|
||||
|
||||
<ul>
|
||||
<li>Can use <b>Beagle</b> browser plug-ins to index web
|
||||
history. See the <a href=
|
||||
"http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">the
|
||||
Wiki</a> for more detail.</li>
|
||||
|
||||
<li>Processes all email attachments.</li>
|
||||
|
||||
<li>Multiple selectable databases.</li>
|
||||
|
||||
<li>Powerful query facilities, with boolean searches,
|
||||
phrases, filter on file types and directory tree.</li>
|
||||
|
||||
<li>Xesam-compatible query language.</li>
|
||||
|
||||
<li>Wildcard searches (with a specific and faster function
|
||||
for file names).</li>
|
||||
|
||||
<li>Support for multiple charsets. Internal processing and
|
||||
storage uses Unicode UTF-8.</li>
|
||||
|
||||
<li><a href="#Stemming">Stemming</a> performed at query
|
||||
time (can switch stemming language after indexing).</li>
|
||||
|
||||
<li>Easy installation. No database daemon, web server or
|
||||
exotic language necessary.</li>
|
||||
|
||||
<li>An indexer which runs either as a thread inside the
|
||||
GUI, as an external, batch, cron'able program, or as a
|
||||
real-time indexing daemon.</li>
|
||||
</ul>
|
||||
|
||||
<h2><a name="#stemming"></a>Stemming</h2>
|
||||
|
||||
<p>Stemming is a process which transforms inflected words into
|
||||
their most basic form. For example, <i>flooring</i>,
|
||||
<i>floors</i>, <i>floored</i> would probably all be transformed
|
||||
to <i>floor</i> by a stemmer for the English language.</p>
|
||||
<p>Stemming is a process which transforms inflected words
|
||||
into their most basic form. For example, <i>flooring</i>,
|
||||
<i>floors</i>, <i>floored</i> would probably all be
|
||||
transformed to <i>floor</i> by a stemmer for the English
|
||||
language.</p>
|
||||
|
||||
<p>In many search engines, the stemming process occurs during
|
||||
indexing. The index will only contain the stemmed form of words,
|
||||
with exceptions for terms which are detected as being probably
|
||||
proper nouns (ie: capitalized). At query time, the terms entered
|
||||
by the user are stemmed, then matched against the index.</p>
|
||||
indexing. The index will only contain the stemmed form of
|
||||
words, with exceptions for terms which are detected as being
|
||||
probably proper nouns (ie: capitalized). At query time, the
|
||||
terms entered by the user are stemmed, then matched against
|
||||
the index.</p>
|
||||
|
||||
<p>This process results into a smaller index, but it has the
|
||||
grave inconvenient of irrevocably losing information during
|
||||
indexing.</p>
|
||||
grave inconvenient of irrevocably losing information during
|
||||
indexing.</p>
|
||||
|
||||
<p>Recoll works in a different way. No stemming is performed at
|
||||
query time, so that all information gets into the index. The
|
||||
resulting index is bigger, but most people probably don't care
|
||||
much about this nowadays, because they have a 100Gb disk 95%
|
||||
full of binary data <em>which does not get indexed</em>.</p>
|
||||
<p>At the end of an indexing pass, Recoll builds one or several
|
||||
stemming dictionaries, where all word stems are listed in
|
||||
correspondence to the list of their derivatives.</p>
|
||||
<p>Recoll works in a different way. No stemming is performed
|
||||
at query time, so that all information gets into the index.
|
||||
The resulting index is bigger, but most people probably don't
|
||||
care much about this nowadays, because they have a 100Gb disk
|
||||
95% full of binary data <em>which does not get
|
||||
indexed</em>.</p>
|
||||
|
||||
<p>At the end of an indexing pass, Recoll builds one or
|
||||
several stemming dictionaries, where all word stems are
|
||||
listed in correspondence to the list of their
|
||||
derivatives.</p>
|
||||
|
||||
<p>At query time, by default, user-entered terms are stemmed,
|
||||
then matched against the stem database, and the query is
|
||||
expanded to include all derivatives. This will yield search
|
||||
results analogous to those obtained by a classical engine.
|
||||
The benefits of this approach is that stem expansion can be
|
||||
controlled instantly at query time in several ways:
|
||||
<ul>
|
||||
<li>It can be selectively turned-off for any query term by
|
||||
capitalizing it (<i>Floor</i>).</li>
|
||||
<li>The stemming language (ie: english, french...) can be
|
||||
selected (this supposes that several stemming databases have
|
||||
been built, which can be configured as part of the indexing,
|
||||
or done later, in a reasonably fast way).</li>
|
||||
then matched against the stem database, and the query is
|
||||
expanded to include all derivatives. This will yield search
|
||||
results analogous to those obtained by a classical engine.
|
||||
The benefits of this approach is that stem expansion can be
|
||||
controlled instantly at query time in several ways:</p>
|
||||
|
||||
<ul>
|
||||
<li>It can be selectively turned-off for any query term by
|
||||
capitalizing it (<i>Floor</i>).</li>
|
||||
|
||||
<li>The stemming language (ie: english, french...) can be
|
||||
selected (this supposes that several stemming databases
|
||||
have been built, which can be configured as part of the
|
||||
indexing, or done later, in a reasonably fast way).</li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
@ -104,16 +104,14 @@
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
<li>2010-04-14 :
|
||||
Recoll <a href="download.html#source">1.13.04</a> is out. It
|
||||
fixes a nasty bug (broken stemming) in 1.13.02.</li>
|
||||
|
||||
<li>2010-01-29 : the full Recoll source repository is now
|
||||
hosted on
|
||||
<a href="http://bitbucket.org/medoc/recoll">Bitbucket</a>, along
|
||||
with a Wiki and an
|
||||
<a href="http://bitbucket.org/medoc/recoll/issues">issues tracking
|
||||
system</a>. Hopefully, this
|
||||
hosted on
|
||||
<a href="http://bitbucket.org/medoc/recoll">Bitbucket</a>,
|
||||
along with a Wiki
|
||||
(<a href="http://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos">
|
||||
Faqs and Howtos</a>) and an
|
||||
<a href="http://bitbucket.org/medoc/recoll/issues">
|
||||
issues tracking system</a>. Hopefully, this
|
||||
new channel for reporting bugs and make suggestions will
|
||||
increase the feedback rate...</li>
|
||||
|
||||
|
||||
@ -135,6 +135,10 @@
|
||||
contributions en code ou en suggestions, voir la page des
|
||||
<a class="important" href="credits.html">Attributions</a>.</p>
|
||||
|
||||
<h2>Autres</h2>
|
||||
<p>Je loue une
|
||||
<a href="http://www.metairie-enbor.com/index.html">
|
||||
grande maison sympa dans l'Aude</a> :)</p>
|
||||
|
||||
</div>
|
||||
</body>
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user