release 3787

This commit is contained in:
Jean-Francois Dockes 2015-01-28 11:22:49 +01:00
parent 9176cdd147
commit 88bccb47b3
2 changed files with 72 additions and 50 deletions

View File

@ -103,8 +103,8 @@ Chapter 5. Installation and configuration
o Openoffice files need unzip and xsltproc. o Openoffice files need unzip and xsltproc.
o PDF files need pdftotext which is part of the Xpdf or Poppler o PDF files need pdftotext which is part of Poppler (usually comes with
packages. the poppler-utils package). Avoid the original one from Xpdf.
o Postscript files need pstotext. The original version has an issue with o Postscript files need pstotext. The original version has an issue with
shell character in file names, which is corrected in recent packages. shell character in file names, which is corrected in recent packages.
@ -121,9 +121,10 @@ Chapter 5. Installation and configuration
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
Ubuntu) package. Ubuntu) package.
o RTF files need unrtf, which, in its standard version, has much trouble o RTF files need unrtf, which, in its older versions, has much trouble
with non-western character sets. Check with non-western character sets. Many Linux distributions carry
http://www.recoll.org/features.html. outdated unrtf versions. Check http://www.recoll.org/features.html for
details.
o TeX files need untex or detex. Check o TeX files need untex or detex. Check
http://www.recoll.org/features.html for sources if it's not packaged http://www.recoll.org/features.html for sources if it's not packaged

View File

@ -197,15 +197,17 @@ Chapter 1. Introduction
1.1. Giving it a try 1.1. Giving it a try
If you do not like reading manuals (who does?) and would like to give If you do not like reading manuals (who does?) but wish to give Recoll a
Recoll a try, just install the application and start the recoll graphical try, just install the application and start the recoll graphical user
user interface (GUI), which will ask to index your home directory by interface (GUI), which will ask permission to index your home directory by
default, allowing you to search immediately after indexing completes. default, allowing you to search immediately after indexing completes.
Do not do this if your home directory contains a huge number of documents Do not do this if your home directory contains a huge number of documents
and you do not want to wait or are very short on disk space. In this case, and you do not want to wait or are very short on disk space. In this case,
you may first want to customize the configuration to restrict the indexed you may first want to customize the configuration to restrict the indexed
area. area (for the very impatient with a completed package install, from the
recoll GUI: Preferences -> Indexing configuration, then adjust the Top
directories section).
Also be aware that you may need to install the appropriate supporting Also be aware that you may need to install the appropriate supporting
applications for document types that need them (for example antiword for applications for document types that need them (for example antiword for
@ -213,49 +215,58 @@ Chapter 1. Introduction
1.2. Full text search 1.2. Full text search
Recoll is a full text search application. Full text search applications Recoll is a full text search application. Full text search finds your data
let you find your data by content rather than by external attributes (like by content rather than by external attributes (like a file name). You
a file name). More specifically, they will let you specify words (terms) specify words (terms) which should or should not appear in the text you
that should or should not appear in the text you are looking for, and are looking for, and receive in return a list of matching documents,
return a list of matching documents, ordered so that the most relevant ordered so that the most relevant documents will appear first.
documents will appear first.
You do not need to remember in what file or email message you stored a You do not need to remember in what file or email message you stored a
given piece of information. You just ask for related terms, and the tool given piece of information. You just ask for related terms, and the tool
will return a list of documents where these terms are prominent, in a will return a list of documents where these terms are prominent, in a
similar way to Internet search engines. similar way to Internet search engines.
A search application tries to determine which documents are most relevant Full text search applications try to determine which documents are most
to the search terms you provide. Computer algorithms for determining relevant to the search terms you provide. Computer algorithms for
relevance can be very complex, and in general are inferior to the power of determining relevance can be very complex, and in general are inferior to
the human mind to rapidly determine relevance. The quality of relevance the power of the human mind to rapidly determine relevance. The quality of
guessing is probably the most important aspect when evaluating a search relevance guessing is probably the most important aspect when evaluating a
application. search application.
In many cases, you are looking for all the forms of a word, not for a In many cases, you are looking for all the forms of a word, including
specific form or spelling. These different forms may include plurals, plurals, different tenses for a verb, or terms derived from the same root
different tenses for a verb, or terms derived from the same root or stem or stem (example: floor, floors, floored, flooring...). Queries are
(example: floor, floors, floored, flooring...). Search applications usually automatically expanded to all such related terms (words that
usually expand queries to all such related terms (words that reduce to the reduce to the same stem). This can be prevented for searching for a
same stem) and also provide a way to disable this expansion if you are specific form.
actually searching for a specific form.
Stemming, by itself, does not accommodate for misspellings or phonetic Stemming, by itself, does not accommodate for misspellings or phonetic
searches. Recoll supports these features through a specific tool (the term searches. A full text search application may also support this form of
explorer) which will let you explore the set of index terms along approximation. For example, a search for aliterattion returning no result
different modes. may propose, depending on index contents, alliteration alteration
alterations altercation as possible replacement terms.
1.3. Recoll overview 1.3. Recoll overview
Recoll uses the Xapian information retrieval library as its storage and Recoll uses the Xapian information retrieval library as its storage and
retrieval engine. Xapian is a very mature package using a sophisticated retrieval engine. Xapian is a very mature package using a sophisticated
probabilistic ranking model. Recoll provides the mechanisms and interface probabilistic ranking model.
to get data into and out of the system.
In practice, Xapian works by remembering where terms appear in your The Xapian library manages an index database which describes where terms
document files. The acquisition process is called indexing. appear in your document files. It efficiently processes the complex
queries which are produced by the Recoll query expansion mechanism, and is
in charge of the all-important relevance computation task.
The resulting index can be big (roughly the size of the original document Recoll provides the mechanisms and interface to get data into and out of
the index. This includes translating the many possible document formats
into pure text, handling term variations (using Xapian stemmers), and
spelling approximations (using the aspell speller), interpreting user
queries and presenting results.
In a shorter way, Recoll does the dirty footwork, Xapian deals with the
intelligent parts of the process.
The Xapian index can be big (roughly the size of the original document
set), but it is not a document archive. Recoll can only display documents set), but it is not a document archive. Recoll can only display documents
that still exist at the place from which they were indexed. (Actually, that still exist at the place from which they were indexed. (Actually,
there is a way to reconstruct a document from the information in the there is a way to reconstruct a document from the information in the
@ -263,8 +274,10 @@ Chapter 1. Introduction
capitalization are lost). capitalization are lost).
Recoll stores all internal data in Unicode UTF-8 format, and it can index Recoll stores all internal data in Unicode UTF-8 format, and it can index
files with different character sets, encodings, and languages into the files of many types with different character sets, encodings, and
same index. It has can process many document types. languages into the same index. It can process documents embedded inside
other documents (for example a pdf document stored inside a Zip archive
sent as an email attachment...), down to an arbitrary depth.
Stemming is the process by which Recoll reduces words to their radicals so Stemming is the process by which Recoll reduces words to their radicals so
that searching does not depend, for example, on a word being singular or that searching does not depend, for example, on a word being singular or
@ -318,13 +331,15 @@ Chapter 1. Introduction
The indexing process is started automatically the first time you execute The indexing process is started automatically the first time you execute
the recoll GUI. Indexing can also be performed by executing the the recoll GUI. Indexing can also be performed by executing the
recollindex command. recollindex command. Recoll indexing is multithreaded by default when
appropriate hardware resources are available, and can perform in parallel
multiple tasks among text extraction, segmentation and index updates.
Searches are usually performed inside the recoll GUI, which has many Searches are usually performed inside the recoll GUI, which has many
options to help you find what you are looking for. However, there are options to help you find what you are looking for. However, there are
other ways to perform Recoll searches: mostly a command line interface, a other ways to perform Recoll searches: mostly a command line interface, a
Python programming interface, a KDE KIO slave module, and a Ubuntu Unity Python programming interface, a KDE KIO slave module, and Ubuntu Unity
Lens module. Lens (for older versions) or Scope (for current versions) modules.
Chapter 2. Indexing Chapter 2. Indexing
@ -332,10 +347,10 @@ Chapter 2. Indexing
Indexing is the process by which the set of documents is analyzed and the Indexing is the process by which the set of documents is analyzed and the
data entered into the database. Recoll indexing is normally incremental: data entered into the database. Recoll indexing is normally incremental:
documents will only be processed if they have been modified. On the first documents will only be processed if they have been modified since the last
execution, all documents will need processing. A full index build can be run. On the first execution, all documents will need processing. A full
forced later by specifying an option to the indexing command (recollindex index build can be forced later by specifying an option to the indexing
-z or -Z). command (recollindex -z or -Z).
The following sections give an overview of different aspects of the The following sections give an overview of different aspects of the
indexing processes and configuration, with links to detailed sections. indexing processes and configuration, with links to detailed sections.
@ -1463,6 +1478,11 @@ Chapter 3. Searching
cases where the exact search term is not known. For example, you may not cases where the exact search term is not known. For example, you may not
remember the exact spelling, or only know the beginning of the name. remember the exact spelling, or only know the beginning of the name.
The search will only propose replacement terms with spelling variations
when no matching document were found. In some cases, both proper spellings
and mispellings are present in the index, and it may be interesting to
look for them explicitely.
The term explorer tool (started from the toolbar icon or from the Term The term explorer tool (started from the toolbar icon or from the Term
explorer entry of the Tools menu) can be used to search the full index explorer entry of the Tools menu) can be used to search the full index
terms list. It has three modes of operations: terms list. It has three modes of operations:
@ -3302,8 +3322,8 @@ Chapter 5. Installation and configuration
o Openoffice files need unzip and xsltproc. o Openoffice files need unzip and xsltproc.
o PDF files need pdftotext which is part of the Xpdf or Poppler o PDF files need pdftotext which is part of Poppler (usually comes with
packages. the poppler-utils package). Avoid the original one from Xpdf.
o Postscript files need pstotext. The original version has an issue with o Postscript files need pstotext. The original version has an issue with
shell character in file names, which is corrected in recent packages. shell character in file names, which is corrected in recent packages.
@ -3320,9 +3340,10 @@ Chapter 5. Installation and configuration
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
Ubuntu) package. Ubuntu) package.
o RTF files need unrtf, which, in its standard version, has much trouble o RTF files need unrtf, which, in its older versions, has much trouble
with non-western character sets. Check with non-western character sets. Many Linux distributions carry
http://www.recoll.org/features.html. outdated unrtf versions. Check http://www.recoll.org/features.html for
details.
o TeX files need untex or detex. Check o TeX files need untex or detex. Check
http://www.recoll.org/features.html for sources if it's not packaged http://www.recoll.org/features.html for sources if it's not packaged