diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 4b9e3688..3dcf8452 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -50,18 +50,23 @@ Giving it a try - If you do not like reading manuals (who does?) and would like - to give &RCL; a try, just install the application and - start the recoll graphical user interface (GUI), - which will ask to index your home directory by default, allowing - you to search immediately after indexing completes. + If you do not like reading manuals (who does?) but + wish to give &RCL; a try, just install the application + and start the recoll graphical user + interface (GUI), which will ask permission to index your home + directory by default, allowing you to search immediately after + indexing completes. Do not do this if your home directory contains a huge number of documents and you do not want to wait or are very short on disk space. In this case, you may first want to customize the configuration - to restrict the indexed area. + to restrict the indexed area (for the very impatient with a completed package install, from the recoll GUI: + Preferences + Indexing configuration + , then adjust the Top + directories section). Also be aware that you may need to install the appropriate supporting @@ -74,12 +79,12 @@ Full text search &RCL; is a full text search application. Full text search - applications let you find your data by content rather - than by external attributes (like a file name). More - specifically, they will let you specify words (terms) that - should or should not appear in the text you are looking for, - and return a list of matching documents, ordered so that the - most relevant documents will appear + finds your data by content rather than by external attributes + (like a file name). You specify words + (terms) which should or should not appear in the text you are + looking for, and receive in return a list of matching + documents, ordered so that the most + relevant documents will appear first. You do not need to remember in what file or email message you @@ -88,27 +93,30 @@ these terms are prominent, in a similar way to Internet search engines. - A search application tries to determine which documents are - most relevant to the search terms you provide. Computer algorithms - for determining relevance can be very complex, and in general are - inferior to the power of the human mind to rapidly determine - relevance. The quality of relevance guessing is probably the most - important aspect when evaluating a search application. + Full text search applications try to determine which + documents are most relevant to the search terms you + provide. Computer algorithms for determining relevance can be + very complex, and in general are inferior to the power of the + human mind to rapidly determine relevance. The quality of + relevance guessing is probably the most important aspect when + evaluating a search application. - In many cases, you are looking for all the forms of a - word, not for a specific form or spelling. These different forms - may include plurals, different tenses for a verb, or terms derived - from the same root or stem (example: floor, - floors, floored, flooring...). Search applications usually expand - queries to all such related terms (words that reduce to the same - stem) and also provide a way to disable this expansion if you are - actually searching for a specific form. - - Stemming, by itself, does not accommodate for misspellings or - phonetic searches. &RCL; supports these features through a specific - tool (the term explorer) which will let you - explore the set of index terms along different modes. + In many cases, you are looking for all the forms of a + word, including plurals, different tenses for a verb, or terms + derived from the same root or stem + (example: floor, floors, floored, + flooring...). Queries are usually automatically + expanded to all such related terms (words that reduce to the + same stem). This can be prevented for searching for a specific + form. + Stemming, by itself, does not accommodate for misspellings + or phonetic searches. A full text search application may also + support this form of approximation. For example, a search for + aliterattion returning no result may + propose, depending on index contents, alliteration + alteration alterations altercation as possible + replacement terms. @@ -120,14 +128,25 @@ library as its storage and retrieval engine. &XAP; is a very mature package using a sophisticated - probabilistic ranking model. &RCL; provides the mechanisms - and interface to get data into and out of the system. + probabilistic ranking model. + + The &XAP; library manages an index database which + describes where terms appear in your document files. It + efficiently processes the complex queries which are produced by + the &RCL; query expansion mechanism, and is in charge of the + all-important relevance computation task. - In practice, &XAP; works by remembering where terms appear - in your document files. The acquisition process is called - indexing. + &RCL; provides the mechanisms and interface to get data + into and out of the index. This includes translating the many + possible document formats into pure text, handling term + variations (using &XAP; stemmers), and spelling approximations + (using the aspell speller), + interpreting user queries and presenting results. - The resulting index can be big (roughly the size of the + In a shorter way, &RCL; does the dirty footwork, &XAP; + deals with the intelligent parts of the process. + + The &XAP; index can be big (roughly the size of the original document set), but it is not a document archive. &RCL; can only display documents that still exist at the place from which they were indexed. (Actually, there is a @@ -136,9 +155,12 @@ punctuation and capitalization are lost). &RCL; stores all internal data in Unicode - UTF-8 format, and it can index files with - different character sets, encodings, and languages into the same - index. It has can process many document types. + UTF-8 format, and it can index files of many types + with different character sets, encodings, and languages into the + same index. It can process documents embedded inside other + documents (for example a pdf document stored inside a Zip + archive sent as an email attachment...), down to an arbitrary + depth. Stemming is the process by which &RCL; reduces words to their radicals so that searching does not depend, for example, on a @@ -206,9 +228,12 @@ The indexing process is started automatically the first time you - execute the recoll GUI. Indexing can also be - performed by executing the recollindex - command. + execute the recoll GUI. Indexing can also + be performed by executing the recollindex + command. &RCL; indexing is multithreaded by default when + appropriate hardware resources are available, and can perform + in parallel multiple tasks among text extraction, segmentation + and index updates. Searches are usually performed inside the recoll GUI, which has many @@ -220,7 +245,10 @@ Python programming interface, a KDE KIO slave module, and - a Ubuntu Unity Lens module. + Ubuntu Unity + Lens (for older versions) or + + Scope (for current versions) modules. @@ -236,11 +264,11 @@ Indexing is the process by which the set of documents is analyzed and the data entered into the database. &RCL; indexing is normally incremental: documents will only be - processed if they have been modified. On the first execution, - all documents will need processing. A full index build can be - forced later by specifying an option to the indexing command - (recollindex - or ). + processed if they have been modified since the last run. On + the first execution, all documents will need processing. A + full index build can be forced later by specifying an option + to the indexing command (recollindex + or ). The following sections give an overview of different aspects of the indexing processes and configuration, with links @@ -1853,6 +1881,11 @@ MimeType=*/* term is not known. For example, you may not remember the exact spelling, or only know the beginning of the name. + The search will only propose replacement terms with + spelling variations when no matching document were found. In some + cases, both proper spellings and mispellings are present in the + index, and it may be interesting to look for them explicitely. + The term explorer tool (started from the toolbar icon or from the Term explorer entry of the Tools menu) can be used to search the full index @@ -4636,9 +4669,11 @@ except: Openoffice files need unzip and xsltproc. - PDF files need pdftotext which - is part of the Xpdf or - Poppler packages. + PDF files need pdftotext + which is part of Poppler (usually + comes with the poppler-utils + package). Avoid the original one from + Xpdf. Postscript files need pstotext. The original version has an issue with shell @@ -4663,9 +4698,11 @@ except: libwpd-tools on Ubuntu) package. - RTF files need unrtf, which, in - its standard version, has much trouble with non-western character - sets. Check &RCLAPPS;. + RTF files need unrtf, + which, in its older versions, has much trouble with + non-western character sets. Many Linux distributions carry + outdated unrtf versions. Check + &RCLAPPS; for details. TeX files need untex or detex. Check &RCLAPPS; for sources if it's not