release 3787
This commit is contained in:
parent
9176cdd147
commit
88bccb47b3
11
src/INSTALL
11
src/INSTALL
@ -103,8 +103,8 @@ Chapter 5. Installation and configuration
|
||||
|
||||
o Openoffice files need unzip and xsltproc.
|
||||
|
||||
o PDF files need pdftotext which is part of the Xpdf or Poppler
|
||||
packages.
|
||||
o PDF files need pdftotext which is part of Poppler (usually comes with
|
||||
the poppler-utils package). Avoid the original one from Xpdf.
|
||||
|
||||
o Postscript files need pstotext. The original version has an issue with
|
||||
shell character in file names, which is corrected in recent packages.
|
||||
@ -121,9 +121,10 @@ Chapter 5. Installation and configuration
|
||||
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
||||
Ubuntu) package.
|
||||
|
||||
o RTF files need unrtf, which, in its standard version, has much trouble
|
||||
with non-western character sets. Check
|
||||
http://www.recoll.org/features.html.
|
||||
o RTF files need unrtf, which, in its older versions, has much trouble
|
||||
with non-western character sets. Many Linux distributions carry
|
||||
outdated unrtf versions. Check http://www.recoll.org/features.html for
|
||||
details.
|
||||
|
||||
o TeX files need untex or detex. Check
|
||||
http://www.recoll.org/features.html for sources if it's not packaged
|
||||
|
||||
111
src/README
111
src/README
@ -197,15 +197,17 @@ Chapter 1. Introduction
|
||||
|
||||
1.1. Giving it a try
|
||||
|
||||
If you do not like reading manuals (who does?) and would like to give
|
||||
Recoll a try, just install the application and start the recoll graphical
|
||||
user interface (GUI), which will ask to index your home directory by
|
||||
If you do not like reading manuals (who does?) but wish to give Recoll a
|
||||
try, just install the application and start the recoll graphical user
|
||||
interface (GUI), which will ask permission to index your home directory by
|
||||
default, allowing you to search immediately after indexing completes.
|
||||
|
||||
Do not do this if your home directory contains a huge number of documents
|
||||
and you do not want to wait or are very short on disk space. In this case,
|
||||
you may first want to customize the configuration to restrict the indexed
|
||||
area.
|
||||
area (for the very impatient with a completed package install, from the
|
||||
recoll GUI: Preferences -> Indexing configuration, then adjust the Top
|
||||
directories section).
|
||||
|
||||
Also be aware that you may need to install the appropriate supporting
|
||||
applications for document types that need them (for example antiword for
|
||||
@ -213,49 +215,58 @@ Chapter 1. Introduction
|
||||
|
||||
1.2. Full text search
|
||||
|
||||
Recoll is a full text search application. Full text search applications
|
||||
let you find your data by content rather than by external attributes (like
|
||||
a file name). More specifically, they will let you specify words (terms)
|
||||
that should or should not appear in the text you are looking for, and
|
||||
return a list of matching documents, ordered so that the most relevant
|
||||
documents will appear first.
|
||||
Recoll is a full text search application. Full text search finds your data
|
||||
by content rather than by external attributes (like a file name). You
|
||||
specify words (terms) which should or should not appear in the text you
|
||||
are looking for, and receive in return a list of matching documents,
|
||||
ordered so that the most relevant documents will appear first.
|
||||
|
||||
You do not need to remember in what file or email message you stored a
|
||||
given piece of information. You just ask for related terms, and the tool
|
||||
will return a list of documents where these terms are prominent, in a
|
||||
similar way to Internet search engines.
|
||||
|
||||
A search application tries to determine which documents are most relevant
|
||||
to the search terms you provide. Computer algorithms for determining
|
||||
relevance can be very complex, and in general are inferior to the power of
|
||||
the human mind to rapidly determine relevance. The quality of relevance
|
||||
guessing is probably the most important aspect when evaluating a search
|
||||
application.
|
||||
Full text search applications try to determine which documents are most
|
||||
relevant to the search terms you provide. Computer algorithms for
|
||||
determining relevance can be very complex, and in general are inferior to
|
||||
the power of the human mind to rapidly determine relevance. The quality of
|
||||
relevance guessing is probably the most important aspect when evaluating a
|
||||
search application.
|
||||
|
||||
In many cases, you are looking for all the forms of a word, not for a
|
||||
specific form or spelling. These different forms may include plurals,
|
||||
different tenses for a verb, or terms derived from the same root or stem
|
||||
(example: floor, floors, floored, flooring...). Search applications
|
||||
usually expand queries to all such related terms (words that reduce to the
|
||||
same stem) and also provide a way to disable this expansion if you are
|
||||
actually searching for a specific form.
|
||||
In many cases, you are looking for all the forms of a word, including
|
||||
plurals, different tenses for a verb, or terms derived from the same root
|
||||
or stem (example: floor, floors, floored, flooring...). Queries are
|
||||
usually automatically expanded to all such related terms (words that
|
||||
reduce to the same stem). This can be prevented for searching for a
|
||||
specific form.
|
||||
|
||||
Stemming, by itself, does not accommodate for misspellings or phonetic
|
||||
searches. Recoll supports these features through a specific tool (the term
|
||||
explorer) which will let you explore the set of index terms along
|
||||
different modes.
|
||||
searches. A full text search application may also support this form of
|
||||
approximation. For example, a search for aliterattion returning no result
|
||||
may propose, depending on index contents, alliteration alteration
|
||||
alterations altercation as possible replacement terms.
|
||||
|
||||
1.3. Recoll overview
|
||||
|
||||
Recoll uses the Xapian information retrieval library as its storage and
|
||||
retrieval engine. Xapian is a very mature package using a sophisticated
|
||||
probabilistic ranking model. Recoll provides the mechanisms and interface
|
||||
to get data into and out of the system.
|
||||
probabilistic ranking model.
|
||||
|
||||
In practice, Xapian works by remembering where terms appear in your
|
||||
document files. The acquisition process is called indexing.
|
||||
The Xapian library manages an index database which describes where terms
|
||||
appear in your document files. It efficiently processes the complex
|
||||
queries which are produced by the Recoll query expansion mechanism, and is
|
||||
in charge of the all-important relevance computation task.
|
||||
|
||||
The resulting index can be big (roughly the size of the original document
|
||||
Recoll provides the mechanisms and interface to get data into and out of
|
||||
the index. This includes translating the many possible document formats
|
||||
into pure text, handling term variations (using Xapian stemmers), and
|
||||
spelling approximations (using the aspell speller), interpreting user
|
||||
queries and presenting results.
|
||||
|
||||
In a shorter way, Recoll does the dirty footwork, Xapian deals with the
|
||||
intelligent parts of the process.
|
||||
|
||||
The Xapian index can be big (roughly the size of the original document
|
||||
set), but it is not a document archive. Recoll can only display documents
|
||||
that still exist at the place from which they were indexed. (Actually,
|
||||
there is a way to reconstruct a document from the information in the
|
||||
@ -263,8 +274,10 @@ Chapter 1. Introduction
|
||||
capitalization are lost).
|
||||
|
||||
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
||||
files with different character sets, encodings, and languages into the
|
||||
same index. It has can process many document types.
|
||||
files of many types with different character sets, encodings, and
|
||||
languages into the same index. It can process documents embedded inside
|
||||
other documents (for example a pdf document stored inside a Zip archive
|
||||
sent as an email attachment...), down to an arbitrary depth.
|
||||
|
||||
Stemming is the process by which Recoll reduces words to their radicals so
|
||||
that searching does not depend, for example, on a word being singular or
|
||||
@ -318,13 +331,15 @@ Chapter 1. Introduction
|
||||
|
||||
The indexing process is started automatically the first time you execute
|
||||
the recoll GUI. Indexing can also be performed by executing the
|
||||
recollindex command.
|
||||
recollindex command. Recoll indexing is multithreaded by default when
|
||||
appropriate hardware resources are available, and can perform in parallel
|
||||
multiple tasks among text extraction, segmentation and index updates.
|
||||
|
||||
Searches are usually performed inside the recoll GUI, which has many
|
||||
options to help you find what you are looking for. However, there are
|
||||
other ways to perform Recoll searches: mostly a command line interface, a
|
||||
Python programming interface, a KDE KIO slave module, and a Ubuntu Unity
|
||||
Lens module.
|
||||
Python programming interface, a KDE KIO slave module, and Ubuntu Unity
|
||||
Lens (for older versions) or Scope (for current versions) modules.
|
||||
|
||||
Chapter 2. Indexing
|
||||
|
||||
@ -332,10 +347,10 @@ Chapter 2. Indexing
|
||||
|
||||
Indexing is the process by which the set of documents is analyzed and the
|
||||
data entered into the database. Recoll indexing is normally incremental:
|
||||
documents will only be processed if they have been modified. On the first
|
||||
execution, all documents will need processing. A full index build can be
|
||||
forced later by specifying an option to the indexing command (recollindex
|
||||
-z or -Z).
|
||||
documents will only be processed if they have been modified since the last
|
||||
run. On the first execution, all documents will need processing. A full
|
||||
index build can be forced later by specifying an option to the indexing
|
||||
command (recollindex -z or -Z).
|
||||
|
||||
The following sections give an overview of different aspects of the
|
||||
indexing processes and configuration, with links to detailed sections.
|
||||
@ -1463,6 +1478,11 @@ Chapter 3. Searching
|
||||
cases where the exact search term is not known. For example, you may not
|
||||
remember the exact spelling, or only know the beginning of the name.
|
||||
|
||||
The search will only propose replacement terms with spelling variations
|
||||
when no matching document were found. In some cases, both proper spellings
|
||||
and mispellings are present in the index, and it may be interesting to
|
||||
look for them explicitely.
|
||||
|
||||
The term explorer tool (started from the toolbar icon or from the Term
|
||||
explorer entry of the Tools menu) can be used to search the full index
|
||||
terms list. It has three modes of operations:
|
||||
@ -3302,8 +3322,8 @@ Chapter 5. Installation and configuration
|
||||
|
||||
o Openoffice files need unzip and xsltproc.
|
||||
|
||||
o PDF files need pdftotext which is part of the Xpdf or Poppler
|
||||
packages.
|
||||
o PDF files need pdftotext which is part of Poppler (usually comes with
|
||||
the poppler-utils package). Avoid the original one from Xpdf.
|
||||
|
||||
o Postscript files need pstotext. The original version has an issue with
|
||||
shell character in file names, which is corrected in recent packages.
|
||||
@ -3320,9 +3340,10 @@ Chapter 5. Installation and configuration
|
||||
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
||||
Ubuntu) package.
|
||||
|
||||
o RTF files need unrtf, which, in its standard version, has much trouble
|
||||
with non-western character sets. Check
|
||||
http://www.recoll.org/features.html.
|
||||
o RTF files need unrtf, which, in its older versions, has much trouble
|
||||
with non-western character sets. Many Linux distributions carry
|
||||
outdated unrtf versions. Check http://www.recoll.org/features.html for
|
||||
details.
|
||||
|
||||
o TeX files need untex or detex. Check
|
||||
http://www.recoll.org/features.html for sources if it's not packaged
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user