doc
This commit is contained in:
parent
fe108af875
commit
9d89fc2061
@ -1,7 +1,8 @@
|
|||||||
<!DOCTYPE BOOK PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
|
<!DOCTYPE BOOK PUBLIC "-//FreeBSD//DTD DocBook V4.1-Based Extension//EN" [
|
||||||
|
|
||||||
<!ENTITY RCL "<application>Recoll</application>">
|
<!ENTITY RCL "<application>Recoll</application>">
|
||||||
<!ENTITY RCLVERSION "1.12-1.13">
|
<!ENTITY RCLAPPS "<ulink url='http://www.recoll.org/features.html'>Recoll helper applications page</ulink>">
|
||||||
|
<!ENTITY RCLVERSION "1.14">
|
||||||
<!ENTITY XAP "<application>Xapian</application>">
|
<!ENTITY XAP "<application>Xapian</application>">
|
||||||
]>
|
]>
|
||||||
|
|
||||||
@ -2620,138 +2621,119 @@ while query.next >= 0 and query.next < nres:
|
|||||||
specific file type).</para>
|
specific file type).</para>
|
||||||
|
|
||||||
<para>After an indexing pass, the commands that were found
|
<para>After an indexing pass, the commands that were found
|
||||||
missing can be displayed from the <command>recoll</command>
|
missing can be displayed from the <command>recoll</command>
|
||||||
<guilabel>File</guilabel> menu. The list is stored in the
|
<guilabel>File</guilabel> menu. The list is stored in the
|
||||||
<filename>missing</filename> text file inside the configuration
|
<filename>missing</filename> text file inside the configuration
|
||||||
directory.</para>
|
directory.</para>
|
||||||
|
|
||||||
<para>A list of common file types which need external
|
<para>A list of common file types which need external
|
||||||
commands follows. Many of the filters need the
|
commands follows. Many of the filters need the
|
||||||
<command>iconv</command> command, which is not always listed as a
|
<command>iconv</command> command, which is not always listed as a
|
||||||
dependancy.</para>
|
dependancy.</para>
|
||||||
|
|
||||||
<para>As of &RCL; release 1.14, a number of XML-based formats that
|
<para>Please note that, due to the relatively dynamic nature of this
|
||||||
were handled by ad hoc filter code now use
|
information, the most up to date version is now kept on the &RCLAPPS;
|
||||||
<command>xsltproc</command>, which usually comes with
|
along with links to the home pages or best source/patches download
|
||||||
<ulink
|
links. The list below is not updated often and may be quite
|
||||||
url="http://xmlsoft.org/XSLT/index.html">libxslt</ulink>. These
|
stale.</para>
|
||||||
are: abiword, fb2 (ebooks), kword, openoffice, svg.</para>
|
|
||||||
|
|
||||||
|
<para>For many Linux distributions, most of the commands listed can
|
||||||
|
be installed from the package repositories. However, the packages
|
||||||
|
are sometimes outdated, or not the best version for &RCL;, so you
|
||||||
|
should take a look at the &RCLAPPS; if a file
|
||||||
|
type is important to you.</para>
|
||||||
|
|
||||||
|
<para>As of &RCL; release 1.14, a number of XML-based formats that
|
||||||
|
were handled by ad hoc filter code now use the
|
||||||
|
<command>xsltproc</command>, which usually comes with
|
||||||
|
<application>libxslt</application>. These are: abiword, fb2
|
||||||
|
(ebooks), kword, openoffice, svg.</para>
|
||||||
|
|
||||||
|
<para>Now for the list:</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
|
|
||||||
<listitem><para>Openoffice: supported natively, but needs the
|
<listitem><para>Openoffice files need <command>unzip</command> and
|
||||||
<command>unzip</command> command to be installed.</para>
|
<command>xsltproc</command>.</para></listitem>
|
||||||
|
|
||||||
|
<listitem><para>PDF files need <command>pdftotext</command> which
|
||||||
|
is part of the <application>Xpdf</application> or
|
||||||
|
<application>Poppler</application> packages.</para></listitem>
|
||||||
|
|
||||||
|
<listitem><para>Postscript files need <command>pstotext</command>.
|
||||||
|
The original version has an issue with shell
|
||||||
|
character in file names, which is corrected in recent
|
||||||
|
packages. See the the &RCLAPPS; for more detail.
|
||||||
</listitem>
|
</listitem>
|
||||||
|
|
||||||
<listitem><para>PDF: pdftotext is part of the <ulink
|
<listitem><para>MS Word needs
|
||||||
url="http://www.foolabs.com/xpdf/">Xpdf</ulink> or <ulink
|
<command>antiword</command>. It is also useful to have
|
||||||
url="http://poppler.freedesktop.org/">Poppler</ulink> packages.</para>
|
<command>wvWare</command> installed as it may be
|
||||||
</listitem>
|
be used as a fallback for some files which
|
||||||
|
<command>antiword</command> does not handle.</para></listitem>
|
||||||
|
|
||||||
<listitem><para>Postscript: <ulink
|
<listitem><para>MS Excel and PowerPoint need <command>
|
||||||
url="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
|
catdoc</command>.</para></listitem>
|
||||||
pstotext</ulink>. The original version has an issue with shell
|
|
||||||
character in file names. Most recent package repositories /
|
|
||||||
ports system use a patched version (ie FreeBSD, Debian). If
|
|
||||||
compiling from source, it would be better to apply the patch
|
|
||||||
found
|
|
||||||
<ulink url="http://www.recoll.org/files/pstotext-1.9_4-debian.patch">
|
|
||||||
here</ulink>.</para>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem><para>MS Word: <ulink url="http://www.winfield.demon.nl">
|
<listitem><para>MS Open XML (docx) needs <command>
|
||||||
antiword</ulink>.</para>
|
xsltproc</command>.</para></listitem>
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem><para>MS Excel and PowerPoint:
|
<listitem><para>Wordperfect files need <command>wpd2html</command>
|
||||||
<ulink url="http://catdoc.klik.atekon.de/">
|
from the <application>libwpd</application> package.</para></listitem>
|
||||||
catdoc</ulink>.</para>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem><para>MS Open XML (docx): needs
|
<listitem><para>RTF files need <command>unrtf</command>, which, in
|
||||||
<command>xsltproc</command>.</para>
|
its standard version, has much trouble with non-western character
|
||||||
</listitem>
|
sets. Check the &RCLAPPS;.</para></listitem>
|
||||||
|
|
||||||
<listitem><para>Wordperfect files:
|
<listitem><para>TeX files need <command>untex</command> or
|
||||||
<ulink url="http://libwpd.sourceforge.net/download.html">
|
<command>detex</command>. Check the &RCLAPPS; for sources if it's not
|
||||||
libwpd</ulink>.</para>
|
packaged for your distribution.</para></listitem>
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem>
|
<listitem><para>dvi files need <command>dvips</command>.</para>
|
||||||
<para>RTF: <ulink
|
|
||||||
url="http://www.gnu.org/software/unrtf/unrtf.html">unrtf</ulink>
|
|
||||||
</para>
|
|
||||||
</listitem>
|
</listitem>
|
||||||
|
|
||||||
<listitem>
|
<listitem><para>djvu files need <command>djvutxt</command> and
|
||||||
<para>TeX: &RCL; uses the <application>untex</application>
|
<command>djvused</command> from the
|
||||||
program. Your distribution may have a package for it. If it doesn't,
|
<application>DjVuLibre</application> package.</para></listitem>
|
||||||
<ulink url="http://www.recoll.org/untex/untex-1.3.jf.tar.gz">
|
|
||||||
there is a copy of the source on the &RCL; web site</ulink>,
|
|
||||||
because the program has no obvious home. The filter can
|
|
||||||
also work with
|
|
||||||
<ulink url="http://www.cs.purdue.edu/homes/trinkle/detex/">
|
|
||||||
detex</ulink> and will use it if it is installed.</para>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem>
|
<listitem><para>Audio files: &RCL; releases before 1.13
|
||||||
<para>dvi: <ulink
|
used the <command>id3info</command> command from the <application>
|
||||||
url="http://www.radicaleye.com/dvips.html">dvips</ulink></para>
|
id3lib</application> package to extract mp3 tag information,
|
||||||
</listitem>
|
<command>metaflac</command> (standard flac tools) for flac files,
|
||||||
|
and <command>ogginfo</command> (vorbis tools) for ogg
|
||||||
<listitem>
|
files. Releases 1.14 and later use a single
|
||||||
<para>djvu:
|
<application>Python</application> filter based
|
||||||
<ulink
|
on <application>mutagen</application> for all audio file
|
||||||
url="http://djvu.sourceforge.net">DjVuLibre
|
types.</para>
|
||||||
</ulink></para>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem><para>mp3, flac, ogg vorbis: &RCL; releases before 1.13
|
|
||||||
use the <command>id3info</command> command from the <ulink
|
|
||||||
url="http://id3lib.sourceforge.net/">id3lib</ulink> package to
|
|
||||||
extract mp3 tag information. (Some gcc versions after 4.4 may have
|
|
||||||
trouble compiling <application>id3lib</application>. <ulink
|
|
||||||
url="http://www.recoll.org/id3lib.html">You can find a
|
|
||||||
workaround here</ulink>), metaflac (standard flac tools) for flac
|
|
||||||
files, and ogginfo (vorbis tools) for ogg files. Releases 1.14
|
|
||||||
and later use a single Python filter based on
|
|
||||||
<ulink url="http://code.google.com/p/mutagen/">mutagen</ulink>
|
|
||||||
for all audio file types.</para>
|
|
||||||
</listitem>
|
</listitem>
|
||||||
|
|
||||||
<listitem>
|
<listitem><para>Pictures: &RCL; uses the
|
||||||
<para>Pictures: &RCL; uses the
|
<application>Exiftool</application>
|
||||||
<ulink url="http://www.sno.phy.queensu.ca/~phil/exiftool/">
|
<application>Perl</application> package to extract tag
|
||||||
Exiftool</ulink> <application>Perl</application> package to
|
information. Most image file formats are supported. Note that
|
||||||
extract tag information. Most image file formats are
|
there may not be much interest in indexing the technical tags
|
||||||
supported. Note that there may not be much interest in indexing
|
(image size, aperture, etc.). This is only of interest if you
|
||||||
the technical tags (image size, aperture, etc.). This is only of
|
store personal tags or textual descriptions inside the image
|
||||||
interest if you store personal tags or textual descriptions inside
|
files.</para></listitem>
|
||||||
the image files.</para>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem><para>chm: files in microsoft help format need Python and
|
<listitem><para>chm: files in microsoft help format need Python and
|
||||||
the <ulink
|
the <application>pychm</application> module (which needs
|
||||||
url="http://gnochm.sourceforge.net/pychm.html">pychm</ulink>
|
<application>chmlib</application>).</para></listitem>
|
||||||
module (which needs <ulink
|
|
||||||
url="http://www.jedrea.com/chmlib/">chmlib</ulink>).</para>
|
|
||||||
</listitem>
|
|
||||||
|
|
||||||
<listitem><para>ics: up to &RCL; 1.13, iCalendar files need Python
|
<listitem><para>ICS: up to &RCL; 1.13, iCalendar files need
|
||||||
and the <application>icalendar</application> module. For newer
|
<application>Python</application>
|
||||||
versions, <application>icalendar</application> is not needed
|
and the <application>icalendar</application>
|
||||||
</para></listitem>
|
module. <application>icalendar</application> is not needed for newer
|
||||||
|
versions, which use internal code.</para></listitem>
|
||||||
|
|
||||||
<listitem><para>zip: Zip archives need Python (and the standard
|
<listitem><para>Zip archives need <application>Python</application>
|
||||||
zipfile module).</para>
|
(and the standard zipfile module).</para></listitem>
|
||||||
</listitem>
|
|
||||||
|
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
|
|
||||||
<para>Text, HTML, mail folders, Openoffice and Scribus files
|
<para>Text, HTML, mail folders, and Scribus files are
|
||||||
are processed internally. Lyx is used to index Lyx files. Many
|
processed internally. <application>Lyx</application> is used to
|
||||||
filters need <command>iconv</command> and the standard
|
index Lyx files. Many filters need <command>iconv</command> and the
|
||||||
<command>sed</command> and <command>awk</command>.
|
standard <command>sed</command> and <command>awk</command>.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
</sect1>
|
</sect1>
|
||||||
|
|||||||
@ -46,18 +46,11 @@
|
|||||||
<li><a href="perfs.html">Index size and indexing performance
|
<li><a href="perfs.html">Index size and indexing performance
|
||||||
data.</a></li>
|
data.</a></li>
|
||||||
|
|
||||||
<li>Faqs and Howtos are now kept in the
|
<li><a href="http://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos">
|
||||||
<a href="http://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos">
|
Faqs and Howtos</a> are now kept in the
|
||||||
Recoll Wiki</a> on
|
<a href="http://bitbucket.org/medoc/recoll/wiki/">
|
||||||
<a href="http://bitbucket.org/medoc/recoll">bitbucket.org</a>.</li>
|
Recoll Wiki</a> on
|
||||||
|
<a href="http://bitbucket.org/medoc/recoll">bitbucket.org</a>.</li>
|
||||||
<p>Current list of HowTos:</p>
|
|
||||||
<ul>
|
|
||||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/PreventIndexingDir">Prevent indexing of a directory</a></li>
|
|
||||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/MultipleIndexes">Creating and using multiple indexes</a></li>
|
|
||||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/SavingConfig.wiki">Recoll configuration backup</a></p>
|
|
||||||
<li><a href="http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">Indexing Mozilla Sunbird / Lightning calendar data</a></li>
|
|
||||||
</ul>
|
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
@ -384,7 +384,8 @@ sudo add-apt-repository ppa:recoll-backports/ppa
|
|||||||
|
|
||||||
<h2><a name="translations">Translations</a></h2>
|
<h2><a name="translations">Translations</a></h2>
|
||||||
|
|
||||||
<p>Most of the translations for 1.13 are incomplete. The source
|
<p>Most of the translations for 1.13 are incomplete (and I
|
||||||
|
forgot to update the message files for 1.14, ugh). The source
|
||||||
translation files are included in the source release. If your
|
translation files are included in the source release. If your
|
||||||
language has some english messages left and you want to take a
|
language has some english messages left and you want to take a
|
||||||
shot at fixing the problem, you can send the results to
|
shot at fixing the problem, you can send the results to
|
||||||
@ -400,17 +401,17 @@ sudo add-apt-repository ppa:recoll-backports/ppa
|
|||||||
</p>
|
</p>
|
||||||
|
|
||||||
<p><a href="translations/recoll_xx.ts">recoll_xx.ts</a> is a blank
|
<p><a href="translations/recoll_xx.ts">recoll_xx.ts</a> is a blank
|
||||||
Recoll 1.13 message file, handy to work on a new translation.</p>
|
Recoll 1.14 message file, handy to work on a new translation.</p>
|
||||||
|
|
||||||
<h3>Updated 1.13 translations that became available after the
|
<h3>Updated 1.13/1.14 translations that became available after the
|
||||||
release:</h3>
|
release:</h3>
|
||||||
|
|
||||||
<p>None for now :(</p>
|
<!-- <p>None for now :(</p> -->
|
||||||
<!--
|
<p>Lithuanian.
|
||||||
<p>German.
|
<a href="translations/recoll_lt.ts">recoll_lt.ts</a>
|
||||||
<a href="translations/recoll_de.ts">recoll_de.ts</a>
|
<a href="translations/recoll_lt.qm">recoll_lt.qm</a>
|
||||||
<a href="translations/recoll_de.qm">recoll_de.qm</a>
|
|
||||||
</p>
|
</p>
|
||||||
|
<!--
|
||||||
<p>Ukrainian.
|
<p>Ukrainian.
|
||||||
<a href="translations/recoll_uk.ts">recoll_uk.ts</a>
|
<a href="translations/recoll_uk.ts">recoll_uk.ts</a>
|
||||||
<a href="translations/recoll_uk.qm">recoll_uk.qm</a>
|
<a href="translations/recoll_uk.qm">recoll_uk.qm</a>
|
||||||
|
|||||||
@ -9,7 +9,7 @@
|
|||||||
<meta name="Description" content=
|
<meta name="Description" content=
|
||||||
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
||||||
<meta name="Keywords" content=
|
<meta name="Keywords" content=
|
||||||
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
||||||
<meta http-equiv="Content-language" content="en">
|
<meta http-equiv="Content-language" content="en">
|
||||||
<meta http-equiv="content-type" content=
|
<meta http-equiv="content-type" content=
|
||||||
"text/html; charset=iso-8859-1">
|
"text/html; charset=iso-8859-1">
|
||||||
@ -18,260 +18,268 @@
|
|||||||
</head>
|
</head>
|
||||||
|
|
||||||
<body>
|
<body>
|
||||||
|
|
||||||
<div class="rightlinks">
|
<div class="rightlinks">
|
||||||
<ul>
|
<ul>
|
||||||
<li><a href="index.html">Home</a></li>
|
<li><a href="index.html">Home</a></li>
|
||||||
<li><a href="pics/index.html">Screenshots</a></li>
|
|
||||||
<li><a href="download.html">Downloads</a></li>
|
<li><a href="pics/index.html">Screenshots</a></li>
|
||||||
<li><a href="usermanual/index.html">User manual</a></li>
|
|
||||||
<li><a href="index.html#support">Support</a></li>
|
<li><a href="download.html">Downloads</a></li>
|
||||||
<li><a href="devel.html">Development</a></li>
|
|
||||||
|
<li><a href="usermanual/index.html">User manual</a></li>
|
||||||
|
|
||||||
|
<li><a href="index.html#support">Support</a></li>
|
||||||
|
|
||||||
|
<li><a href="devel.html">Development</a></li>
|
||||||
</ul>
|
</ul>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="content">
|
<div class="content">
|
||||||
|
|
||||||
<h1 class="intro">Recoll features</h1>
|
<h1 class="intro">Recoll features</h1>
|
||||||
|
|
||||||
<dl>
|
<h2><a name="systems">Supported systems</a></h2>
|
||||||
<dt><a name="systems">Supported systems</a></dt>
|
|
||||||
<dd><span class="application">Recoll</span> has been compiled and
|
|
||||||
tested on FreeBSD, Linux, Darwin and Solaris (versions
|
|
||||||
FreeBSD 5-7, Redhat 7/8/9, Fedora Core 5-13, Suse 10/11,
|
|
||||||
Gentoo, Debian 3.1, Solaris 8/9/10. Other not too distant
|
|
||||||
releases should be ok too).</dd>
|
|
||||||
|
|
||||||
<dd>Qt versions from 3.1 to 4.5</dd>
|
<p><span class="application">Recoll</span> has been compiled
|
||||||
|
and tested on FreeBSD, Linux, Darwin and Solaris (initial
|
||||||
|
versions FreeBSD 5, Redhat 7, Fedora Core 5, Suse 10, Gentoo,
|
||||||
|
Debian 3.1, Solaris 8). It should compile and run on all
|
||||||
|
subsequent releases of these systems and probably a few
|
||||||
|
others too.</p>
|
||||||
|
|
||||||
<dt><a name="doctypes">Document types</a></dt>
|
<p>Qt versions from 3.1 to 4.7</p>
|
||||||
<dd>Recoll can index many document types (along with their
|
|
||||||
compressed versions). Some types are handled internally (no
|
|
||||||
external application needed). Other types need some application to
|
|
||||||
be installed to extract the text. Types that only need common
|
|
||||||
very common utilities (awk/sed/groff etc.) are listed in the
|
|
||||||
native section.</dd>
|
|
||||||
|
|
||||||
<dl>
|
<h2><a name="doctypes">Document types</a></h2>
|
||||||
<dt>Natively</dt>
|
|
||||||
|
|
||||||
<dd>
|
<p>Recoll can index many document types (along with their
|
||||||
<ul>
|
compressed versions). Some types are handled internally (no
|
||||||
<li><span class="literal">text</span>.</li>
|
external application needed). Other types need a separate
|
||||||
|
application to be installed to extract the text. Types that
|
||||||
|
only need very common utilities (awk/sed/groff etc.) are
|
||||||
|
listed in the native section.</p>
|
||||||
|
|
||||||
<li><span class="literal">html</span>.</li>
|
<h4>File types indexed natively</h4>
|
||||||
|
|
||||||
<li><span class="literal">maildir</span> and <span
|
<ul>
|
||||||
class="literal">mailbox</span> (<span class=
|
<li><span class="literal">text</span>.</li>
|
||||||
"literal">Mozilla</span>, <span class=
|
|
||||||
"literal">Thunderbird</span> and <span class=
|
|
||||||
"literal">Evolution</span> mail ok).</li>
|
|
||||||
|
|
||||||
<li><span class="literal">OpenOffice</span>
|
<li><span class="literal">html</span>.</li>
|
||||||
files (needs <span class="command">unzip</span> command).</li>
|
|
||||||
|
|
||||||
<li><span class="literal">Abiword</span> files.</li>
|
<li><span class="literal">maildir</span> and <span class=
|
||||||
|
"literal">mailbox</span> (<span class=
|
||||||
|
"literal">Mozilla</span>, <span class=
|
||||||
|
"literal">Thunderbird</span> and <span class=
|
||||||
|
"literal">Evolution</span> mail ok).</li>
|
||||||
|
|
||||||
<li><span class="literal">Kword</span> files.</li>
|
<li><span class="literal">gaim</span> and <span class=
|
||||||
|
"literal">purple</span> log files.</li>
|
||||||
|
|
||||||
<li><span class="literal">gaim</span> and <span
|
<li><span class="literal">Lyx</span> files (needs <span
|
||||||
class="literal">purple</span> log files.</li>
|
class="literal">Lyx</span> to be installed).</li>
|
||||||
|
|
||||||
<li><span class="literal">Lyx</span> files (needs
|
<li><span class="literal">Scribus</span> files.</li>
|
||||||
<span class="literal">Lyx</span> to be installed).</li>
|
|
||||||
|
|
||||||
<li><span class="literal">Scribus</span> files.</li>
|
<li><span class="literal">Man pages</span> (need <span
|
||||||
|
class="command">groff</span>).</li>
|
||||||
<li><span class="literal">Man pages</span> (need <span
|
|
||||||
class="command">groff</span>).</li>
|
|
||||||
|
|
||||||
</ul>
|
|
||||||
</dd>
|
|
||||||
|
|
||||||
<dt>With external helpers</dt>
|
|
||||||
|
|
||||||
<dd>
|
|
||||||
<para>In addition to the applications listed below, many
|
|
||||||
document types need the <span
|
|
||||||
class="command">iconv</span> command.</para>
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
<li><span class="literal">Microsoft Office Open XML</span>
|
|
||||||
files with the <span class="command">unzip</span>
|
|
||||||
and <span class="command">xsltproc</span> commands.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">pdf</span> with the <span
|
|
||||||
class="command">pdftotext</span> command, which can be
|
|
||||||
installed as part of <a href=
|
|
||||||
"http://www.foolabs.com/xpdf/">xpdf</a> or <a
|
|
||||||
href="http://poppler.freedesktop.org/">poppler</a>,
|
|
||||||
depending on your distribution.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">msword</span> with <a href=
|
|
||||||
"http://www.winfield.demon.nl/">antiword</a>.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">Powerpoint</span> and
|
|
||||||
<span class="literal">Excel</span> with the
|
|
||||||
<a href="http://catdoc.klik.atekon.de">
|
|
||||||
catdoc</a> utilities.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">CHM (Microsoft help)</span>
|
|
||||||
files (needs <span class="command">Python, pychm or
|
|
||||||
chmlib</span>).</li>
|
|
||||||
|
|
||||||
<li><span class="literal">Zip</span>
|
|
||||||
archives (needs <span class="command">Python</span>).</li>
|
|
||||||
|
|
||||||
<li><span class="literal">iCalendar</span>(.ics) files
|
|
||||||
(needs <span class="command">Python,
|
|
||||||
<a href="http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
|
||||||
|
|
||||||
<li><span class="literal">Mozilla calendar data</span>
|
|
||||||
See <a href="http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
|
||||||
the wiki</a> about this.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">Wordperfect</span> with <a href=
|
|
||||||
"http://libwpd.sourceforge.net">libwpd</a>.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">postscript</span> with
|
|
||||||
<a href="http://www.gnu.org/software/ghostscript/ghostscript.html">
|
|
||||||
ghostscript</a> and
|
|
||||||
<a href="http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">
|
|
||||||
pstotext</a>.
|
|
||||||
Actually the pstotext 1.9 found at the latter link
|
|
||||||
has a problem with file names using special shell
|
|
||||||
characters, and you should either use the version
|
|
||||||
packaged for your system which is probably patched,
|
|
||||||
or apply the Debian patch which is
|
|
||||||
stored <a href="files/pstotext-1.9_4-debian.patch">here</a>
|
|
||||||
for convenience. See
|
|
||||||
http://packages.debian.org/squeeze/pstotext and
|
|
||||||
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
|
|
||||||
for references/explanations.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">rtf</span> with <a href=
|
|
||||||
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
|
|
||||||
|
|
||||||
<li><span class="literal">TeX</span> with
|
|
||||||
<span class="command">untex</span>. If there is no untex
|
|
||||||
package for your distribution,
|
|
||||||
<a href="untex/untex-1.3.jf.tar.gz">a source package is
|
|
||||||
stored on this site</a> (as untex has no obvious
|
|
||||||
home).
|
|
||||||
Will also work
|
|
||||||
with <a
|
|
||||||
href="http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
|
||||||
if this is installed.
|
|
||||||
</li>
|
|
||||||
|
|
||||||
<li><span class="literal">dvi</span> with
|
|
||||||
<a href="http://www.radicaleye.com/dvips.html">dvips</a>.
|
|
||||||
</li>
|
|
||||||
|
|
||||||
<li><span class="literal">djvu</span> with
|
|
||||||
<a href="http://djvu.sourceforge.net">DjVuLibre</a>.
|
|
||||||
</li>
|
|
||||||
<li><span class="literal">mp3/flac/ogg vorbis</span>
|
|
||||||
tags support with
|
|
||||||
<a href="http://id3lib.sourceforge.net/">id3info (id3lib)
|
|
||||||
</a> (compiling id3lib on recent systems may need
|
|
||||||
a small patch, see <a href="id3lib.html">here.</a>) or
|
|
||||||
the ogg and flac tools. Release 1.14 and later use a
|
|
||||||
python filter based on
|
|
||||||
<a href="http://code.google.com/p/mutagen/">mutagen</a>
|
|
||||||
for all audio tags.
|
|
||||||
</li>
|
|
||||||
<li>Image file tags support with
|
|
||||||
<a href="http://www.sno.phy.queensu.ca/~phil/exiftool/">
|
|
||||||
exiftool</a>. This is a perl program, so you also
|
|
||||||
need perl on the system. This works with about any
|
|
||||||
possible image file and tag format (jpg, png, tiff,
|
|
||||||
gif etc.).
|
|
||||||
</li>
|
|
||||||
|
|
||||||
</ul>
|
|
||||||
</dd>
|
|
||||||
</dl>
|
|
||||||
</dd>
|
|
||||||
|
|
||||||
<dt>Other features</dt>
|
|
||||||
<dd>
|
|
||||||
<ul>
|
|
||||||
<li>Can use <b>Beagle</b> browser plug-ins to index web
|
|
||||||
history. See the
|
|
||||||
<a href="http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">
|
|
||||||
the Wiki</a> for more detail.</li>
|
|
||||||
|
|
||||||
<li>Processes all email attachments.</li>
|
|
||||||
|
|
||||||
<li>Multiple selectable databases.</li>
|
|
||||||
|
|
||||||
<li>Powerful query facilities, with boolean searches,
|
|
||||||
phrases, filter on file types and directory tree.</li>
|
|
||||||
|
|
||||||
<li>Xesam-compatible query language.</li>
|
|
||||||
|
|
||||||
<li>Wildcard searches (with a specific and faster function for
|
|
||||||
file names).</li>
|
|
||||||
|
|
||||||
<li>Support for multiple charsets. Internal processing and
|
|
||||||
storage uses Unicode UTF-8.</li>
|
|
||||||
|
|
||||||
<li><a href="#Stemming">Stemming</a> performed at query
|
|
||||||
time (can switch stemming language after indexing).</li>
|
|
||||||
|
|
||||||
<li>Easy installation. No database daemon, web server or
|
|
||||||
exotic language necessary.</li>
|
|
||||||
|
|
||||||
<li>An indexer which runs either as a thread inside the GUI,
|
|
||||||
as an external, batch, cron'able program, or as a
|
|
||||||
real-time indexing daemon.</li>
|
|
||||||
</ul>
|
|
||||||
</dd>
|
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
<h4>File types indexed with external helpers</h4>
|
||||||
|
|
||||||
|
<p>Many document types need the <span class="command">iconv</span>
|
||||||
|
command in addition to the applications specifically listed.</p>
|
||||||
|
|
||||||
|
<p>The following types need <span class=
|
||||||
|
"command">xsltproc</span> from the <b>libxslt</b> package.
|
||||||
|
Quite a few also need <span class="command">unzip</span>:</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li><span class="literal">Abiword</span> files.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Fb2</span> ebooks.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Kword</span> files.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Microsoft Office Open XML</span>
|
||||||
|
files.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">OpenOffice</span> files.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">SVG</span> files.</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p>Others:</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li><span class="literal">pdf</span> with the <span class=
|
||||||
|
"command">pdftotext</span> command, which can be installed
|
||||||
|
as part of <a href="http://www.foolabs.com/xpdf/">xpdf</a>
|
||||||
|
or <a href="http://poppler.freedesktop.org/">poppler</a>,
|
||||||
|
depending on your distribution.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">msword</span> with <a href=
|
||||||
|
"http://www.winfield.demon.nl/">antiword</a>.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Powerpoint</span> and <span
|
||||||
|
class="literal">Excel</span> with the <a href=
|
||||||
|
"http://catdoc.klik.atekon.de">catdoc</a> utilities.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">CHM (Microsoft help)</span> files
|
||||||
|
(needs <span class="command">Python, pychm or
|
||||||
|
chmlib</span>).</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Zip</span> archives (needs <span
|
||||||
|
class="command">Python</span>).</li>
|
||||||
|
|
||||||
|
<li><span class="literal">iCalendar</span>(.ics) files
|
||||||
|
(needs <span class="command">Python, <a href=
|
||||||
|
"http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Mozilla calendar data</span> See
|
||||||
|
<a href=
|
||||||
|
"http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
|
||||||
|
the wiki</a> about this.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">Wordperfect</span> with <a href=
|
||||||
|
"http://libwpd.sourceforge.net">libwpd</a>.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">postscript</span> with <a href=
|
||||||
|
"http://www.gnu.org/software/ghostscript/ghostscript.html">ghostscript</a>
|
||||||
|
and <a href=
|
||||||
|
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
|
||||||
|
Actually the pstotext 1.9 found at the latter link has a
|
||||||
|
problem with file names using special shell characters, and
|
||||||
|
you should either use the version packaged for your system
|
||||||
|
which is probably patched, or apply the Debian patch which
|
||||||
|
is stored <a href=
|
||||||
|
"files/pstotext-1.9_4-debian.patch">here</a> for
|
||||||
|
convenience. See
|
||||||
|
http://packages.debian.org/squeeze/pstotext and
|
||||||
|
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988 for
|
||||||
|
references/explanations.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">RTF</span> files with <a href=
|
||||||
|
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>. Please
|
||||||
|
note that up to version
|
||||||
|
0.21, <span class="command">unrtf</span> mostly does not work
|
||||||
|
with non western-european character sets. If you have a need
|
||||||
|
for indexing, ie, russian or chinese RTF files, I have
|
||||||
|
produced a modified version which works much better (as
|
||||||
|
indicated by my tests and a few external ones). You can
|
||||||
|
download the <a href="unrtf/unrtf-0.22.0beta.tar.gz">source
|
||||||
|
here</a>. The development is hosted
|
||||||
|
on <a href="http://www.bitbucket.org/medoc/unrtf-int">
|
||||||
|
bitbucket.org</a>.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">TeX</span> with <span class=
|
||||||
|
"command">untex</span>. If there is no untex package for
|
||||||
|
your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
|
||||||
|
source package is stored on this site</a> (as untex has no
|
||||||
|
obvious home). Will also work with <a href=
|
||||||
|
"http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
|
||||||
|
if this is installed.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">dvi</span> with <a href=
|
||||||
|
"http://www.radicaleye.com/dvips.html">dvips</a>.</li>
|
||||||
|
|
||||||
|
<li><span class="literal">djvu</span> with <a href=
|
||||||
|
"http://djvu.sourceforge.net">DjVuLibre</a>.</li>
|
||||||
|
|
||||||
|
<li>Audio file tags: Recoll releases 1.13 and older use <a
|
||||||
|
href="http://id3lib.sourceforge.net/">id3info (id3lib)</a>
|
||||||
|
(compiling id3lib on recent systems may need a small patch,
|
||||||
|
see <a href="id3lib.html">here.</a>) or the ogg and flac
|
||||||
|
tools.<br>
|
||||||
|
Recoll releases 1.14 and later use a Python filter based
|
||||||
|
on <a href="http://code.google.com/p/mutagen/">mutagen</a>
|
||||||
|
for all audio types.</li>
|
||||||
|
|
||||||
|
<li>Image file tags support with <a href=
|
||||||
|
"http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
|
||||||
|
This is a perl program, so you also need perl on the
|
||||||
|
system. This works with about any possible image file and
|
||||||
|
tag format (jpg, png, tiff, gif etc.).</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<h2>Other features</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Can use <b>Beagle</b> browser plug-ins to index web
|
||||||
|
history. See the <a href=
|
||||||
|
"http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">the
|
||||||
|
Wiki</a> for more detail.</li>
|
||||||
|
|
||||||
|
<li>Processes all email attachments.</li>
|
||||||
|
|
||||||
|
<li>Multiple selectable databases.</li>
|
||||||
|
|
||||||
|
<li>Powerful query facilities, with boolean searches,
|
||||||
|
phrases, filter on file types and directory tree.</li>
|
||||||
|
|
||||||
|
<li>Xesam-compatible query language.</li>
|
||||||
|
|
||||||
|
<li>Wildcard searches (with a specific and faster function
|
||||||
|
for file names).</li>
|
||||||
|
|
||||||
|
<li>Support for multiple charsets. Internal processing and
|
||||||
|
storage uses Unicode UTF-8.</li>
|
||||||
|
|
||||||
|
<li><a href="#Stemming">Stemming</a> performed at query
|
||||||
|
time (can switch stemming language after indexing).</li>
|
||||||
|
|
||||||
|
<li>Easy installation. No database daemon, web server or
|
||||||
|
exotic language necessary.</li>
|
||||||
|
|
||||||
|
<li>An indexer which runs either as a thread inside the
|
||||||
|
GUI, as an external, batch, cron'able program, or as a
|
||||||
|
real-time indexing daemon.</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
<h2><a name="#stemming"></a>Stemming</h2>
|
<h2><a name="#stemming"></a>Stemming</h2>
|
||||||
|
|
||||||
<p>Stemming is a process which transforms inflected words into
|
<p>Stemming is a process which transforms inflected words
|
||||||
their most basic form. For example, <i>flooring</i>,
|
into their most basic form. For example, <i>flooring</i>,
|
||||||
<i>floors</i>, <i>floored</i> would probably all be transformed
|
<i>floors</i>, <i>floored</i> would probably all be
|
||||||
to <i>floor</i> by a stemmer for the English language.</p>
|
transformed to <i>floor</i> by a stemmer for the English
|
||||||
|
language.</p>
|
||||||
|
|
||||||
<p>In many search engines, the stemming process occurs during
|
<p>In many search engines, the stemming process occurs during
|
||||||
indexing. The index will only contain the stemmed form of words,
|
indexing. The index will only contain the stemmed form of
|
||||||
with exceptions for terms which are detected as being probably
|
words, with exceptions for terms which are detected as being
|
||||||
proper nouns (ie: capitalized). At query time, the terms entered
|
probably proper nouns (ie: capitalized). At query time, the
|
||||||
by the user are stemmed, then matched against the index.</p>
|
terms entered by the user are stemmed, then matched against
|
||||||
|
the index.</p>
|
||||||
|
|
||||||
<p>This process results into a smaller index, but it has the
|
<p>This process results into a smaller index, but it has the
|
||||||
grave inconvenient of irrevocably losing information during
|
grave inconvenient of irrevocably losing information during
|
||||||
indexing.</p>
|
indexing.</p>
|
||||||
|
|
||||||
<p>Recoll works in a different way. No stemming is performed at
|
<p>Recoll works in a different way. No stemming is performed
|
||||||
query time, so that all information gets into the index. The
|
at query time, so that all information gets into the index.
|
||||||
resulting index is bigger, but most people probably don't care
|
The resulting index is bigger, but most people probably don't
|
||||||
much about this nowadays, because they have a 100Gb disk 95%
|
care much about this nowadays, because they have a 100Gb disk
|
||||||
full of binary data <em>which does not get indexed</em>.</p>
|
95% full of binary data <em>which does not get
|
||||||
<p>At the end of an indexing pass, Recoll builds one or several
|
indexed</em>.</p>
|
||||||
stemming dictionaries, where all word stems are listed in
|
|
||||||
correspondence to the list of their derivatives.</p>
|
<p>At the end of an indexing pass, Recoll builds one or
|
||||||
|
several stemming dictionaries, where all word stems are
|
||||||
|
listed in correspondence to the list of their
|
||||||
|
derivatives.</p>
|
||||||
|
|
||||||
<p>At query time, by default, user-entered terms are stemmed,
|
<p>At query time, by default, user-entered terms are stemmed,
|
||||||
then matched against the stem database, and the query is
|
then matched against the stem database, and the query is
|
||||||
expanded to include all derivatives. This will yield search
|
expanded to include all derivatives. This will yield search
|
||||||
results analogous to those obtained by a classical engine.
|
results analogous to those obtained by a classical engine.
|
||||||
The benefits of this approach is that stem expansion can be
|
The benefits of this approach is that stem expansion can be
|
||||||
controlled instantly at query time in several ways:
|
controlled instantly at query time in several ways:</p>
|
||||||
<ul>
|
|
||||||
<li>It can be selectively turned-off for any query term by
|
|
||||||
capitalizing it (<i>Floor</i>).</li>
|
|
||||||
<li>The stemming language (ie: english, french...) can be
|
|
||||||
selected (this supposes that several stemming databases have
|
|
||||||
been built, which can be configured as part of the indexing,
|
|
||||||
or done later, in a reasonably fast way).</li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>It can be selectively turned-off for any query term by
|
||||||
|
capitalizing it (<i>Floor</i>).</li>
|
||||||
|
|
||||||
|
<li>The stemming language (ie: english, french...) can be
|
||||||
|
selected (this supposes that several stemming databases
|
||||||
|
have been built, which can be configured as part of the
|
||||||
|
indexing, or done later, in a reasonably fast way).</li>
|
||||||
|
</ul>
|
||||||
</div>
|
</div>
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
|
|||||||
@ -104,16 +104,14 @@
|
|||||||
</ul>
|
</ul>
|
||||||
</li>
|
</li>
|
||||||
|
|
||||||
<li>2010-04-14 :
|
|
||||||
Recoll <a href="download.html#source">1.13.04</a> is out. It
|
|
||||||
fixes a nasty bug (broken stemming) in 1.13.02.</li>
|
|
||||||
|
|
||||||
<li>2010-01-29 : the full Recoll source repository is now
|
<li>2010-01-29 : the full Recoll source repository is now
|
||||||
hosted on
|
hosted on
|
||||||
<a href="http://bitbucket.org/medoc/recoll">Bitbucket</a>, along
|
<a href="http://bitbucket.org/medoc/recoll">Bitbucket</a>,
|
||||||
with a Wiki and an
|
along with a Wiki
|
||||||
<a href="http://bitbucket.org/medoc/recoll/issues">issues tracking
|
(<a href="http://bitbucket.org/medoc/recoll/wiki/FaqsAndHowTos">
|
||||||
system</a>. Hopefully, this
|
Faqs and Howtos</a>) and an
|
||||||
|
<a href="http://bitbucket.org/medoc/recoll/issues">
|
||||||
|
issues tracking system</a>. Hopefully, this
|
||||||
new channel for reporting bugs and make suggestions will
|
new channel for reporting bugs and make suggestions will
|
||||||
increase the feedback rate...</li>
|
increase the feedback rate...</li>
|
||||||
|
|
||||||
|
|||||||
@ -135,6 +135,10 @@
|
|||||||
contributions en code ou en suggestions, voir la page des
|
contributions en code ou en suggestions, voir la page des
|
||||||
<a class="important" href="credits.html">Attributions</a>.</p>
|
<a class="important" href="credits.html">Attributions</a>.</p>
|
||||||
|
|
||||||
|
<h2>Autres</h2>
|
||||||
|
<p>Je loue une
|
||||||
|
<a href="http://www.metairie-enbor.com/index.html">
|
||||||
|
grande maison sympa dans l'Aude</a> :)</p>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
</body>
|
</body>
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user