diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml
index 4b9e3688..3dcf8452 100644
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@@ -50,18 +50,23 @@
Giving it a try
- If you do not like reading manuals (who does?) and would like
- to give &RCL; a try, just install the application and
- start the recoll graphical user interface (GUI),
- which will ask to index your home directory by default, allowing
- you to search immediately after indexing completes.
+ If you do not like reading manuals (who does?) but
+ wish to give &RCL; a try, just install the application
+ and start the recoll graphical user
+ interface (GUI), which will ask permission to index your home
+ directory by default, allowing you to search immediately after
+ indexing completes.Do not do this if your home directory contains a huge
number of documents and you do not want to wait or are very
short on disk space. In this case, you may first want to customize
the configuration
- to restrict the indexed area.
+ to restrict the indexed area (for the very impatient with a completed package install, from the recoll GUI:
+ Preferences
+ Indexing configuration
+ , then adjust the Top
+ directories section).
Also be aware that you may need to install the
appropriate supporting
@@ -74,12 +79,12 @@
Full text search&RCL; is a full text search application. Full text search
- applications let you find your data by content rather
- than by external attributes (like a file name). More
- specifically, they will let you specify words (terms) that
- should or should not appear in the text you are looking for,
- and return a list of matching documents, ordered so that the
- most relevant documents will appear
+ finds your data by content rather than by external attributes
+ (like a file name). You specify words
+ (terms) which should or should not appear in the text you are
+ looking for, and receive in return a list of matching
+ documents, ordered so that the most
+ relevant documents will appear
first.You do not need to remember in what file or email message you
@@ -88,27 +93,30 @@
these terms are prominent, in a similar way to Internet search
engines.
- A search application tries to determine which documents are
- most relevant to the search terms you provide. Computer algorithms
- for determining relevance can be very complex, and in general are
- inferior to the power of the human mind to rapidly determine
- relevance. The quality of relevance guessing is probably the most
- important aspect when evaluating a search application.
+ Full text search applications try to determine which
+ documents are most relevant to the search terms you
+ provide. Computer algorithms for determining relevance can be
+ very complex, and in general are inferior to the power of the
+ human mind to rapidly determine relevance. The quality of
+ relevance guessing is probably the most important aspect when
+ evaluating a search application.
- In many cases, you are looking for all the forms of a
- word, not for a specific form or spelling. These different forms
- may include plurals, different tenses for a verb, or terms derived
- from the same root or stem (example: floor,
- floors, floored, flooring...). Search applications usually expand
- queries to all such related terms (words that reduce to the same
- stem) and also provide a way to disable this expansion if you are
- actually searching for a specific form.
-
- Stemming, by itself, does not accommodate for misspellings or
- phonetic searches. &RCL; supports these features through a specific
- tool (the term explorer) which will let you
- explore the set of index terms along different modes.
+ In many cases, you are looking for all the forms of a
+ word, including plurals, different tenses for a verb, or terms
+ derived from the same root or stem
+ (example: floor, floors, floored,
+ flooring...). Queries are usually automatically
+ expanded to all such related terms (words that reduce to the
+ same stem). This can be prevented for searching for a specific
+ form.
+ Stemming, by itself, does not accommodate for misspellings
+ or phonetic searches. A full text search application may also
+ support this form of approximation. For example, a search for
+ aliterattion returning no result may
+ propose, depending on index contents, alliteration
+ alteration alterations altercation as possible
+ replacement terms.
@@ -120,14 +128,25 @@
library as its storage and retrieval engine. &XAP; is a very
mature package using a sophisticated
- probabilistic ranking model. &RCL; provides the mechanisms
- and interface to get data into and out of the system.
+ probabilistic ranking model.
+
+ The &XAP; library manages an index database which
+ describes where terms appear in your document files. It
+ efficiently processes the complex queries which are produced by
+ the &RCL; query expansion mechanism, and is in charge of the
+ all-important relevance computation task.
- In practice, &XAP; works by remembering where terms appear
- in your document files. The acquisition process is called
- indexing.
+ &RCL; provides the mechanisms and interface to get data
+ into and out of the index. This includes translating the many
+ possible document formats into pure text, handling term
+ variations (using &XAP; stemmers), and spelling approximations
+ (using the aspell speller),
+ interpreting user queries and presenting results.
- The resulting index can be big (roughly the size of the
+ In a shorter way, &RCL; does the dirty footwork, &XAP;
+ deals with the intelligent parts of the process.
+
+ The &XAP; index can be big (roughly the size of the
original document set), but it is not a document
archive. &RCL; can only display documents that still exist at
the place from which they were indexed. (Actually, there is a
@@ -136,9 +155,12 @@
punctuation and capitalization are lost).&RCL; stores all internal data in Unicode
- UTF-8 format, and it can index files with
- different character sets, encodings, and languages into the same
- index. It has can process many document types.
+ UTF-8 format, and it can index files of many types
+ with different character sets, encodings, and languages into the
+ same index. It can process documents embedded inside other
+ documents (for example a pdf document stored inside a Zip
+ archive sent as an email attachment...), down to an arbitrary
+ depth.Stemming is the process by which &RCL; reduces words to
their radicals so that searching does not depend, for example, on a
@@ -206,9 +228,12 @@
The indexing
process is started automatically the first time you
- execute the recoll GUI. Indexing can also be
- performed by executing the recollindex
- command.
+ execute the recoll GUI. Indexing can also
+ be performed by executing the recollindex
+ command. &RCL; indexing is multithreaded by default when
+ appropriate hardware resources are available, and can perform
+ in parallel multiple tasks among text extraction, segmentation
+ and index updates.Searches are usually
performed inside the recoll GUI, which has many
@@ -220,7 +245,10 @@
Python
programming interface, a
KDE KIO slave module, and
- a Ubuntu Unity Lens module.
+ Ubuntu Unity
+ Lens (for older versions) or
+
+ Scope (for current versions) modules.
@@ -236,11 +264,11 @@
Indexing is the process by which the set of documents is
analyzed and the data entered into the database. &RCL;
indexing is normally incremental: documents will only be
- processed if they have been modified. On the first execution,
- all documents will need processing. A full index build can be
- forced later by specifying an option to the indexing command
- (recollindex
- or ).
+ processed if they have been modified since the last run. On
+ the first execution, all documents will need processing. A
+ full index build can be forced later by specifying an option
+ to the indexing command (recollindex
+ or ).
The following sections give an overview of different
aspects of the indexing processes and configuration, with links
@@ -1853,6 +1881,11 @@ MimeType=*/*
term is not known. For example, you may not remember the exact
spelling, or only know the beginning of the name.
+ The search will only propose replacement terms with
+ spelling variations when no matching document were found. In some
+ cases, both proper spellings and mispellings are present in the
+ index, and it may be interesting to look for them explicitely.
+
The term explorer tool (started from the toolbar icon or
from the Term explorer entry of the
Tools menu) can be used to search the full index
@@ -4636,9 +4669,11 @@ except:
Openoffice files need unzip and
xsltproc.
- PDF files need pdftotext which
- is part of the Xpdf or
- Poppler packages.
+ PDF files need pdftotext
+ which is part of Poppler (usually
+ comes with the poppler-utils
+ package). Avoid the original one from
+ Xpdf.Postscript files need pstotext.
The original version has an issue with shell
@@ -4663,9 +4698,11 @@ except:
libwpd-tools on Ubuntu)
package.
- RTF files need unrtf, which, in
- its standard version, has much trouble with non-western character
- sets. Check &RCLAPPS;.
+ RTF files need unrtf,
+ which, in its older versions, has much trouble with
+ non-western character sets. Many Linux distributions carry
+ outdated unrtf versions. Check
+ &RCLAPPS; for details.TeX files need untex or
detex. Check &RCLAPPS; for sources if it's not