release 3787
This commit is contained in:
parent
9176cdd147
commit
88bccb47b3
11
src/INSTALL
11
src/INSTALL
@ -103,8 +103,8 @@ Chapter 5. Installation and configuration
|
|||||||
|
|
||||||
o Openoffice files need unzip and xsltproc.
|
o Openoffice files need unzip and xsltproc.
|
||||||
|
|
||||||
o PDF files need pdftotext which is part of the Xpdf or Poppler
|
o PDF files need pdftotext which is part of Poppler (usually comes with
|
||||||
packages.
|
the poppler-utils package). Avoid the original one from Xpdf.
|
||||||
|
|
||||||
o Postscript files need pstotext. The original version has an issue with
|
o Postscript files need pstotext. The original version has an issue with
|
||||||
shell character in file names, which is corrected in recent packages.
|
shell character in file names, which is corrected in recent packages.
|
||||||
@ -121,9 +121,10 @@ Chapter 5. Installation and configuration
|
|||||||
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
||||||
Ubuntu) package.
|
Ubuntu) package.
|
||||||
|
|
||||||
o RTF files need unrtf, which, in its standard version, has much trouble
|
o RTF files need unrtf, which, in its older versions, has much trouble
|
||||||
with non-western character sets. Check
|
with non-western character sets. Many Linux distributions carry
|
||||||
http://www.recoll.org/features.html.
|
outdated unrtf versions. Check http://www.recoll.org/features.html for
|
||||||
|
details.
|
||||||
|
|
||||||
o TeX files need untex or detex. Check
|
o TeX files need untex or detex. Check
|
||||||
http://www.recoll.org/features.html for sources if it's not packaged
|
http://www.recoll.org/features.html for sources if it's not packaged
|
||||||
|
|||||||
111
src/README
111
src/README
@ -197,15 +197,17 @@ Chapter 1. Introduction
|
|||||||
|
|
||||||
1.1. Giving it a try
|
1.1. Giving it a try
|
||||||
|
|
||||||
If you do not like reading manuals (who does?) and would like to give
|
If you do not like reading manuals (who does?) but wish to give Recoll a
|
||||||
Recoll a try, just install the application and start the recoll graphical
|
try, just install the application and start the recoll graphical user
|
||||||
user interface (GUI), which will ask to index your home directory by
|
interface (GUI), which will ask permission to index your home directory by
|
||||||
default, allowing you to search immediately after indexing completes.
|
default, allowing you to search immediately after indexing completes.
|
||||||
|
|
||||||
Do not do this if your home directory contains a huge number of documents
|
Do not do this if your home directory contains a huge number of documents
|
||||||
and you do not want to wait or are very short on disk space. In this case,
|
and you do not want to wait or are very short on disk space. In this case,
|
||||||
you may first want to customize the configuration to restrict the indexed
|
you may first want to customize the configuration to restrict the indexed
|
||||||
area.
|
area (for the very impatient with a completed package install, from the
|
||||||
|
recoll GUI: Preferences -> Indexing configuration, then adjust the Top
|
||||||
|
directories section).
|
||||||
|
|
||||||
Also be aware that you may need to install the appropriate supporting
|
Also be aware that you may need to install the appropriate supporting
|
||||||
applications for document types that need them (for example antiword for
|
applications for document types that need them (for example antiword for
|
||||||
@ -213,49 +215,58 @@ Chapter 1. Introduction
|
|||||||
|
|
||||||
1.2. Full text search
|
1.2. Full text search
|
||||||
|
|
||||||
Recoll is a full text search application. Full text search applications
|
Recoll is a full text search application. Full text search finds your data
|
||||||
let you find your data by content rather than by external attributes (like
|
by content rather than by external attributes (like a file name). You
|
||||||
a file name). More specifically, they will let you specify words (terms)
|
specify words (terms) which should or should not appear in the text you
|
||||||
that should or should not appear in the text you are looking for, and
|
are looking for, and receive in return a list of matching documents,
|
||||||
return a list of matching documents, ordered so that the most relevant
|
ordered so that the most relevant documents will appear first.
|
||||||
documents will appear first.
|
|
||||||
|
|
||||||
You do not need to remember in what file or email message you stored a
|
You do not need to remember in what file or email message you stored a
|
||||||
given piece of information. You just ask for related terms, and the tool
|
given piece of information. You just ask for related terms, and the tool
|
||||||
will return a list of documents where these terms are prominent, in a
|
will return a list of documents where these terms are prominent, in a
|
||||||
similar way to Internet search engines.
|
similar way to Internet search engines.
|
||||||
|
|
||||||
A search application tries to determine which documents are most relevant
|
Full text search applications try to determine which documents are most
|
||||||
to the search terms you provide. Computer algorithms for determining
|
relevant to the search terms you provide. Computer algorithms for
|
||||||
relevance can be very complex, and in general are inferior to the power of
|
determining relevance can be very complex, and in general are inferior to
|
||||||
the human mind to rapidly determine relevance. The quality of relevance
|
the power of the human mind to rapidly determine relevance. The quality of
|
||||||
guessing is probably the most important aspect when evaluating a search
|
relevance guessing is probably the most important aspect when evaluating a
|
||||||
application.
|
search application.
|
||||||
|
|
||||||
In many cases, you are looking for all the forms of a word, not for a
|
In many cases, you are looking for all the forms of a word, including
|
||||||
specific form or spelling. These different forms may include plurals,
|
plurals, different tenses for a verb, or terms derived from the same root
|
||||||
different tenses for a verb, or terms derived from the same root or stem
|
or stem (example: floor, floors, floored, flooring...). Queries are
|
||||||
(example: floor, floors, floored, flooring...). Search applications
|
usually automatically expanded to all such related terms (words that
|
||||||
usually expand queries to all such related terms (words that reduce to the
|
reduce to the same stem). This can be prevented for searching for a
|
||||||
same stem) and also provide a way to disable this expansion if you are
|
specific form.
|
||||||
actually searching for a specific form.
|
|
||||||
|
|
||||||
Stemming, by itself, does not accommodate for misspellings or phonetic
|
Stemming, by itself, does not accommodate for misspellings or phonetic
|
||||||
searches. Recoll supports these features through a specific tool (the term
|
searches. A full text search application may also support this form of
|
||||||
explorer) which will let you explore the set of index terms along
|
approximation. For example, a search for aliterattion returning no result
|
||||||
different modes.
|
may propose, depending on index contents, alliteration alteration
|
||||||
|
alterations altercation as possible replacement terms.
|
||||||
|
|
||||||
1.3. Recoll overview
|
1.3. Recoll overview
|
||||||
|
|
||||||
Recoll uses the Xapian information retrieval library as its storage and
|
Recoll uses the Xapian information retrieval library as its storage and
|
||||||
retrieval engine. Xapian is a very mature package using a sophisticated
|
retrieval engine. Xapian is a very mature package using a sophisticated
|
||||||
probabilistic ranking model. Recoll provides the mechanisms and interface
|
probabilistic ranking model.
|
||||||
to get data into and out of the system.
|
|
||||||
|
|
||||||
In practice, Xapian works by remembering where terms appear in your
|
The Xapian library manages an index database which describes where terms
|
||||||
document files. The acquisition process is called indexing.
|
appear in your document files. It efficiently processes the complex
|
||||||
|
queries which are produced by the Recoll query expansion mechanism, and is
|
||||||
|
in charge of the all-important relevance computation task.
|
||||||
|
|
||||||
The resulting index can be big (roughly the size of the original document
|
Recoll provides the mechanisms and interface to get data into and out of
|
||||||
|
the index. This includes translating the many possible document formats
|
||||||
|
into pure text, handling term variations (using Xapian stemmers), and
|
||||||
|
spelling approximations (using the aspell speller), interpreting user
|
||||||
|
queries and presenting results.
|
||||||
|
|
||||||
|
In a shorter way, Recoll does the dirty footwork, Xapian deals with the
|
||||||
|
intelligent parts of the process.
|
||||||
|
|
||||||
|
The Xapian index can be big (roughly the size of the original document
|
||||||
set), but it is not a document archive. Recoll can only display documents
|
set), but it is not a document archive. Recoll can only display documents
|
||||||
that still exist at the place from which they were indexed. (Actually,
|
that still exist at the place from which they were indexed. (Actually,
|
||||||
there is a way to reconstruct a document from the information in the
|
there is a way to reconstruct a document from the information in the
|
||||||
@ -263,8 +274,10 @@ Chapter 1. Introduction
|
|||||||
capitalization are lost).
|
capitalization are lost).
|
||||||
|
|
||||||
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
||||||
files with different character sets, encodings, and languages into the
|
files of many types with different character sets, encodings, and
|
||||||
same index. It has can process many document types.
|
languages into the same index. It can process documents embedded inside
|
||||||
|
other documents (for example a pdf document stored inside a Zip archive
|
||||||
|
sent as an email attachment...), down to an arbitrary depth.
|
||||||
|
|
||||||
Stemming is the process by which Recoll reduces words to their radicals so
|
Stemming is the process by which Recoll reduces words to their radicals so
|
||||||
that searching does not depend, for example, on a word being singular or
|
that searching does not depend, for example, on a word being singular or
|
||||||
@ -318,13 +331,15 @@ Chapter 1. Introduction
|
|||||||
|
|
||||||
The indexing process is started automatically the first time you execute
|
The indexing process is started automatically the first time you execute
|
||||||
the recoll GUI. Indexing can also be performed by executing the
|
the recoll GUI. Indexing can also be performed by executing the
|
||||||
recollindex command.
|
recollindex command. Recoll indexing is multithreaded by default when
|
||||||
|
appropriate hardware resources are available, and can perform in parallel
|
||||||
|
multiple tasks among text extraction, segmentation and index updates.
|
||||||
|
|
||||||
Searches are usually performed inside the recoll GUI, which has many
|
Searches are usually performed inside the recoll GUI, which has many
|
||||||
options to help you find what you are looking for. However, there are
|
options to help you find what you are looking for. However, there are
|
||||||
other ways to perform Recoll searches: mostly a command line interface, a
|
other ways to perform Recoll searches: mostly a command line interface, a
|
||||||
Python programming interface, a KDE KIO slave module, and a Ubuntu Unity
|
Python programming interface, a KDE KIO slave module, and Ubuntu Unity
|
||||||
Lens module.
|
Lens (for older versions) or Scope (for current versions) modules.
|
||||||
|
|
||||||
Chapter 2. Indexing
|
Chapter 2. Indexing
|
||||||
|
|
||||||
@ -332,10 +347,10 @@ Chapter 2. Indexing
|
|||||||
|
|
||||||
Indexing is the process by which the set of documents is analyzed and the
|
Indexing is the process by which the set of documents is analyzed and the
|
||||||
data entered into the database. Recoll indexing is normally incremental:
|
data entered into the database. Recoll indexing is normally incremental:
|
||||||
documents will only be processed if they have been modified. On the first
|
documents will only be processed if they have been modified since the last
|
||||||
execution, all documents will need processing. A full index build can be
|
run. On the first execution, all documents will need processing. A full
|
||||||
forced later by specifying an option to the indexing command (recollindex
|
index build can be forced later by specifying an option to the indexing
|
||||||
-z or -Z).
|
command (recollindex -z or -Z).
|
||||||
|
|
||||||
The following sections give an overview of different aspects of the
|
The following sections give an overview of different aspects of the
|
||||||
indexing processes and configuration, with links to detailed sections.
|
indexing processes and configuration, with links to detailed sections.
|
||||||
@ -1463,6 +1478,11 @@ Chapter 3. Searching
|
|||||||
cases where the exact search term is not known. For example, you may not
|
cases where the exact search term is not known. For example, you may not
|
||||||
remember the exact spelling, or only know the beginning of the name.
|
remember the exact spelling, or only know the beginning of the name.
|
||||||
|
|
||||||
|
The search will only propose replacement terms with spelling variations
|
||||||
|
when no matching document were found. In some cases, both proper spellings
|
||||||
|
and mispellings are present in the index, and it may be interesting to
|
||||||
|
look for them explicitely.
|
||||||
|
|
||||||
The term explorer tool (started from the toolbar icon or from the Term
|
The term explorer tool (started from the toolbar icon or from the Term
|
||||||
explorer entry of the Tools menu) can be used to search the full index
|
explorer entry of the Tools menu) can be used to search the full index
|
||||||
terms list. It has three modes of operations:
|
terms list. It has three modes of operations:
|
||||||
@ -3302,8 +3322,8 @@ Chapter 5. Installation and configuration
|
|||||||
|
|
||||||
o Openoffice files need unzip and xsltproc.
|
o Openoffice files need unzip and xsltproc.
|
||||||
|
|
||||||
o PDF files need pdftotext which is part of the Xpdf or Poppler
|
o PDF files need pdftotext which is part of Poppler (usually comes with
|
||||||
packages.
|
the poppler-utils package). Avoid the original one from Xpdf.
|
||||||
|
|
||||||
o Postscript files need pstotext. The original version has an issue with
|
o Postscript files need pstotext. The original version has an issue with
|
||||||
shell character in file names, which is corrected in recent packages.
|
shell character in file names, which is corrected in recent packages.
|
||||||
@ -3320,9 +3340,10 @@ Chapter 5. Installation and configuration
|
|||||||
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
||||||
Ubuntu) package.
|
Ubuntu) package.
|
||||||
|
|
||||||
o RTF files need unrtf, which, in its standard version, has much trouble
|
o RTF files need unrtf, which, in its older versions, has much trouble
|
||||||
with non-western character sets. Check
|
with non-western character sets. Many Linux distributions carry
|
||||||
http://www.recoll.org/features.html.
|
outdated unrtf versions. Check http://www.recoll.org/features.html for
|
||||||
|
details.
|
||||||
|
|
||||||
o TeX files need untex or detex. Check
|
o TeX files need untex or detex. Check
|
||||||
http://www.recoll.org/features.html for sources if it's not packaged
|
http://www.recoll.org/features.html for sources if it's not packaged
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user