From 6bd88ca32fc0bd0356247b50a9921bd9b919c9b7 Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Mon, 15 Oct 2012 08:40:50 +0200 Subject: [PATCH] doc --- src/doc/user/usermanual.sgml | 447 ++++++++++++++++++++--------------- 1 file changed, 258 insertions(+), 189 deletions(-) diff --git a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml index 93dcde7d..87a89b9f 100644 --- a/src/doc/user/usermanual.sgml +++ b/src/doc/user/usermanual.sgml @@ -64,8 +64,8 @@ Also be aware that you may need to install the appropriate supporting applications for document types that need them (for - example antiword for ms-word - files). + example antiword for + Microsoft Word files). @@ -83,7 +83,7 @@ You do not need to remember in what file or email message you stored a given piece of information. You just ask for related terms, and the tool will return a list of documents where - those terms are prominent, in a similar way to Internet search + these terms are prominent, in a similar way to Internet search engines. A search application tries to determine which documents are @@ -143,7 +143,7 @@ word being singular or plural (floor, floors), or on a verb tense (flooring, floored). Because the mechanisms used for stemming depend on the specific grammatical rules for each language, there - is a separate stemmer module for most common languages where + is a separate &XAP; stemmer module for most common languages where stemming makes sense. &RCL; stores the unstemmed versions of terms in the main index @@ -160,26 +160,27 @@ recognition, which means that the stemmer will sometimes be applied to terms from other languages with potentially strange results. In practise, even if this introduces possibilities of confusion, this - approach has been proven quite useful, and, awaiting the addition - of an automatic language recognition module to &RCL;, it is much - less cumbersome than separating your documents according to what + approach has been proven quite useful, and it is much less + cumbersome than separating your documents according to what language they are written in. - Before version 1.18, &RCL; always stripped most accents and + Before version 1.18, &RCL; stripped most accents and diacritics from terms, and converted them to lower case before - storing them in the index. As a consequence, it was impossible to - search for a particular capitalization of a term - (US / us), or to - discriminate two terms based on diacritics (sake - / saké, mate / - maté). + either storing them in the index or searching for them. As a + consequence, it was impossible to search for a particular + capitalization of a term (US / + us), or to discriminate two terms based on + diacritics (sake / saké, + mate / maté). As of version 1.18, &RCL; can optionally store the raw terms, - without accent stripping or case conversion. Expansions necessary - for searches insensitive to case and/or diacritics are then - performed when searching. This is described in more detail in the - section about index case - and diacritics sensitivity. + without accent stripping or case conversion. In this configuration, + it is still possible (and most common) for a query to be + insensitive to case and/or diacritics. Appropriate term expansions + are performed before actually accessing the main index. This is + described in more detail in the section about index case and + diacritics sensitivity. &RCL; has many parameters which define exactly what to index, and how to classify and decode the source @@ -197,7 +198,9 @@ sufficient for giving &RCL; a try, but you may want to adjust it later, which can be done either by editing the text files or by using configuration menus in the - recoll GUI + recoll GUI. Some other parameters affecting only + the recoll GUI are stored in the standard + location defined by Qt. The indexing process is started automatically the first time you @@ -241,7 +244,7 @@ aspects of the indexing processes and configuration, with links to detailed sections. - + Indexing modes &RCL; indexing can be performed along two different modes: @@ -279,20 +282,30 @@ directory). Monitoring a big file system tree can consume significant system resources. + The choice of method and the parameters used can be + configured from the recoll GUI: + + Preferences + Indexing schedule + + - + Configurations, multiple indexes The parameters describing what is to be indexed and local preferences are defined in text files contained in a configuration directory. + All parameters have defaults, defined in system-wide files. + Without further configuration, &RCL; will index all appropriate files from your home directory, with a reasonable set of defaults. + A default personal configuration directory ($HOME/.recoll/) is created when a &RCL; program is first executed. It is possible to @@ -308,14 +321,14 @@ would be done to separate personal and shared indexes, or to take advantage of the organization of your data to improve search precision. + The generated indexes can - be queried - concurrently in a transparent manner. + be queried concurrently in a transparent manner. For index generation, multiple configurations are totally independant from each other. When multiple indexes need to be used for a single search, - some parameters + some parameters should be consistent among the configurations. @@ -331,8 +344,8 @@ one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may themselves be compound ones. Such hierarchies can go quite - deep, and &RCL; can process, for example, an - ms-word + deep, and &RCL; can process, for example, a + LibreOffice document stored as an attachment to an email message inside an email folder archived in a zip file... @@ -395,22 +408,23 @@ recoll the index in ~/.indexes-email/xapiandb/. - Using multiple configuration directories and - configuration - options allows you to tailor multiple configurations - and indexes to handle whatever subset of the available data - that you wish to make searchable. + Using multiple configuration directories and configuration + options allows you to tailor multiple configurations and + indexes to handle whatever subset of the available data you wish + to make searchable. - You can also specify a different storage - location for the index by setting the dbdir - parameter in the configuration file - (see the configuration - section). This method would mainly be of use if you - wanted to keep the configuration directory in its default location, - but desired another location for the index, typically out of - disk occupation concerns. + For a given configuration directory, you can + specify a non-default storage location for the index by setting + the dbdir parameter in the configuration file + (see the configuration + section). This method would mainly be of use if you wanted + to keep the configuration directory in its default location, but + desired another location for the index, typically out of disk + occupation concerns. @@ -437,7 +451,7 @@ recoll destroyed safely. - Xapian index formats + &XAP; index formats &XAP; versions usually support several formats for index storage. A given major &XAP; version will have a current format, @@ -490,8 +504,9 @@ recoll &RCL; configuration files control which areas of the file system are indexed, and how files are processed. These variables can be set either by - editing the text files or using the dialogs in the - recoll GUI. + editing the text files or by using the + dialogs in the + recoll GUI. The first time you start recoll, you will be asked whether or not you would like it to build the @@ -522,6 +537,61 @@ recoll described in the external packages section. + As of Recoll 1.18 there are two incompatible types of Recoll + indexes, depending on the treatment of character case and + diacritics. The next section describes the two types in more + detail. + + + Multiple indexes + + Multiple &RCL; indexes can be created by + using several configuration directories which are usually set to + index different areas of the file system. A specific index can + be selected for updating or searching, using the + RECOLL_CONFDIR environment variable or the + option to recoll and + recollindex. + + A typical usage scenario for the multiple index feature + would be for a system administrator to set up a central index + for shared data, that you choose to search or not in addition to + your personal data. Of course, there are other + possibilities. There are many cases where you know the subset of + files that should be searched, and where narrowing the search + can improve the results. You can achieve approximately the same + effect with the directory filter in advanced search, but + multiple indexes will have much better performance and may be + worth the trouble. + + A recollindex program instance can only + update one specific index. + + The main index (defined by + RECOLL_CONFDIR or ) is + always active. If this is undesirable, you can set up your + base configuration to index an empty directory. + + The different search interfaces (GUI, command line, ...) + have different methods to define the set of indexes to be + used, see the appropriate section. + + If a set of multiple indexes are to be used together for + searches, some configuration parameters must be consistent + among the set. These are parameters which need to be the same + when indexing and searching. As the parameters come from the + main configuration when searching, they need to be compatible + with what was set when creating the other indexes (which came + from their respective configuration directories). + + Most importantly, all indexes to be queried concurrently must + have the same option concerning character case and diacritics + stripping, but there are other constraints. Most of the + relevant parameters are described in the + linked + section. + + @@ -562,7 +632,7 @@ recoll As a cost for added capability, a raw index will be slightly bigger than a stripped one (around 10%). Also, searches will be more complex, so probably slightly slower, and the feature is - still young, and a certain amount of weirdness cannot be + still young, so that a certain amount of weirdness cannot be excluded. @@ -709,7 +779,7 @@ recoll described here. Option will reset the index when starting. This is almost the same as destroying the index - files (the nuance is that the Xapian format version will not + files (the nuance is that the &XAP; format version will not be changed). Option will force the update of all documents without resetting the index first. This will not @@ -905,8 +975,8 @@ fvwm Advanced search (a panel accessed through the Tools menu or the toolbox bar icon) has multiple entry fields, which you may use to build a logical - condition, with additional filtering on file type and location - in the file system. + condition, with additional filtering on file type, location + in the file system, modification date, and size. @@ -955,60 +1025,53 @@ fvwm described in a separate section. - File name will specifically look for file - names. The entry will be split at white space characters, - and each fragment will be separately expanded, then the search will - be for file names matching all fragments (this is new in 1.15, - older releases did an OR of the whole thing which did not make - sense). Things to know: - - The search is case- and accent-insensitive. - - Fragments without any wild card - character and not capitalized will be prepended and appended - with '*' (ie: etc -> - *etc*, but - Etc -> - etc). Of course it does not make - sense to have multiple fragments if one of them is capitalized - (as this one will require an exact match). - - If you want to search for a pattern including - white space, use double quotes (ie: "admin - note*"). - - If you have a big index (many files), - excessively generic fragments may result in inefficient - searches. - - As an example, inst - recoll would match - recollinstall.in (and quite a few - others...). - - - The point of having a separate file name - search is that wild card expansion can be performed more - efficiently on a relatively small subset of the index (allowing - wild cards on the left of terms without excessive penality). - All search modes allow wildcards inside terms (*, ?, []). You may want to have a look at the section about wildcards for more information about this. + File name will specifically look for file + names. The point of having a separate file name + search is that wild card expansion can be performed more + efficiently on a small subset of the index (allowing + wild cards on the left of terms without excessive penality). + Things to know: + + White space in the entry should match white + space in the file name, and is not treated specially. + + The search is insensitive to character case and + accents, independantly of the type of index. + + An entry without any wild card + character and not capitalized will be prepended and appended + with '*' (ie: etc -> + *etc*, but + Etc -> + etc). + + If you have a big index (many files), + excessively generic fragments may result in inefficient + searches. + + + + You can search for exact phrases (adjacent words in a given order) by enclosing the input inside double quotes. Ex: "virtual reality". - Character case has no influence on search, except that you - can disable stem expansion for any term by capitalizing it. Ie: - a search for floor will also normally look for - flooring, floored, etc., but - a search for Floor will only look for - floor, in any character case. Stemming can - also be disabled globally in the preferences. + When using a stripped index, character case has no influence on + search, except that you can disable stem expansion for any term by + capitalizing it. Ie: a search for floor will also + normally look for flooring, + floored, etc., but a search for + Floor will only look for floor, + in any character case. Stemming can also be disabled globally in the + preferences. When using a raw index, the rules are a bit more + complicated. &RCL; remembers the last few searches that you performed. You can use the simple search text entry widget (a @@ -1050,10 +1113,7 @@ fvwm By default, the document list is presented in order of relevance (how well the system estimates that the document matches the query). You can sort the result by ascending or - descending date by using the vertical arrows in the toolbar (the old - sort tool is gone after release 1.15, because the new result table has much better - capability). + descending date by using the vertical arrows in the toolbar. Clicking on the Preview link for an entry will open an @@ -1520,7 +1580,7 @@ fvwm of the string to search for (ie a wildcard expression like *coll), the expansion can take quite a long time because the full index term list will have to be - processed. The expansion is currently limited at 200 results for + processed. The expansion is currently limited at 10000 results for wildcards and regular expressions. Double-clicking on a term in the result list will insert @@ -1531,9 +1591,9 @@ fvwm - Multiple databases + Multiple indexes - See the section + See the section describing the use of multiple indexes for generalities. Only the aspects concerning the recoll GUI are described here. @@ -1627,7 +1687,7 @@ fvwm of the document container, not only of the text contents (so that ie, a text document with an image added will not be a duplicate of the text only). Duplicates hiding is controlled - by an entry in the Query configuration + by an entry in the GUI configuration dialog, and is off by default. @@ -1821,7 +1881,7 @@ fvwm Customizing the search interface You can customize some aspects of the search interface by using - the Query configuration entry in the + the GUI configuration entry in the Preferences menu. There are several tabs in the dialog, dealing with the @@ -1868,8 +1928,7 @@ fvwm version instead. - Use <PRE> tags instead of - <BR> to display plain text as HTML in preview: + Plain text to HTML line style: when displaying plain text inside the preview window, &RCL; tries to preserve some of the original text line breaks and indentation. It can either use PRE HTML tags, which will @@ -1877,7 +1936,9 @@ fvwm scrolling for long lines, or use BR tags to break at the original line breaks, which will let the editor introduce other line breaks according to the window width, but will - lose some of the original indentation. + lose some of the original indentation. The third option has + been available in recent releases and is probably now the best + one: use PRE tags with line wrapping. Use desktop preferences to choose @@ -1895,7 +1956,9 @@ fvwm that will still be opened according to &RCL; preferences. This is useful for passing parameters like page numbers or search strings to applications that support them - (e.g. evince). + (e.g. evince). This cannot be done + with xdg-open which only supports passing + one parameter. Choose editor applications @@ -1917,9 +1980,8 @@ fvwm Start with advanced search dialog open - and Start with sort dialog - open: If you use these dialogs all the time, checking - these entries will get them to open when recoll starts. + : If you use this dialog frequently, checking + the entries will get it to open when recoll starts. Remember sort activation @@ -1957,9 +2019,9 @@ fvwm - Edit result page html header insert: + Edit result page HTML header insert: allows you to define text inserted at the end of the result - page html header. + page HTML header. More detail in the result list customisation section. @@ -2026,11 +2088,10 @@ fvwm Dynamically build abstracts: this decides if &RCL; tries to build - document abstracts when displaying the result list. Abstracts - are constructed by taking context from the document - information, around the search terms. This can slow down - result list display significantly for big documents, and you - may want to turn it off. + document abstracts (lists of snippets) + when displaying the result list. Abstracts are constructed by + taking context from the document information, around the search + terms. Synthetic abstract size: @@ -2081,12 +2142,12 @@ fvwm by adjusting two elements: The paragraph format - Html code inside the header + HTML code inside the header section These can be edited from the Result list - tab of the Query configuration. + tab of the GUI configuration. Newer versions of Recoll (from 1.17) use a WebKit HTML object by default (this may be disabled at build time), and @@ -2115,10 +2176,6 @@ fvwm %DDate - %EPrecooked Snippets - link (will only appear for documents indexed with page - numbers) - %IIcon image name. This is normally determined from the mime type. The associations are defined inside the @@ -2131,8 +2188,8 @@ fvwm %KKeywords (if any) - %LPrecooked Preview and - Edit links + %LPrecooked Preview, + Edit, and possibly Snippets links %MMime type @@ -2156,10 +2213,11 @@ fvwm - The format of the Preview and Edit links is - <a href="P%N"> - and + The format of the Preview, Edit, and Snippets links is + <a href="P%N">, <a href="E%N"> + and + <a href="A%N"> where docnum (%N) expands to the document number inside the result page). @@ -2377,7 +2435,7 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r capabilities as the complex search interface in the GUI. - The language is roughly based on the (seemingly defunct) + The language is based on the (seemingly defunct) Xesam user search language specification. @@ -2405,13 +2463,15 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r potatoes (in any part of the document). An element is composed of an optional field specification, - and a value, separated by a colon. Example: - Beatles, + and a value, separated by a colon (the field separator is the last + colon in the element). Example: + Eugenie, author:balzac, dc:title:grandet The colon, if present, means "contains". Xesam defines other - relations, which are not supported for now. + relations, which are mostly supported for now (except in special + cases, described further down). All elements in the search entry are normally combined with an implicit AND. It is possible to specify that elements be @@ -2429,8 +2489,8 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/r not (word1 AND word2) OR - word3. Do not enter explicit - parenthesis, they are not supported for now. + word3. Explicit + parenthesis are not supported. An element preceded by a - specifies a term that should not appear. Pure negative @@ -2777,6 +2837,11 @@ dir:recoll dir:src -dir:utils -dir:common a word can make for a slow search because &RCL; will have to scan the whole index term list to find the matches. + When working with a raw index (preserving + character case and diacritics), the literal part of a wildcard + expression will be matched exactly for case and + diacritics. + Using a * at the end of a word can produce more matches than you would think, and strange search results. You can use the at the beginning of the text would be a match for "^my term"o5. - + Anchored searches can be very useful for searches inside + somewhat structured documents like scientific articles, in case + explicit metadata has not been supplied (a most frequent case), for + example for looking for matches inside the abstract or the list of + authors (which occur at the top of the document). + + + @@ -2892,61 +2964,13 @@ dir:recoll dir:src -dir:utils -dir:common - - - Multiple databases - - Multiple &RCL; databases or indexes can be created by - using several configuration directories which are usually set to - index different areas of the file system. A specific index can - be selected for updating or searching, using the - RECOLL_CONFDIR environment variable or the - option to recoll and - recollindex. - - A typical usage scenario for the multiple index feature - would be for a system administrator to set up a central index - for shared data, that you choose to search or not in addition to - your personal data. Of course, there are other - possibilities. There are many cases where you know the subset of - files that should be searched, and where narrowing the search - can improve the results. You can achieve approximately the same - effect with the directory filter in advanced search, but - multiple indexes will have much better performance and may be - worth the trouble. - - A recollindex program instance can only - update one specific index. - - The main index (defined by - RECOLL_CONFDIR or ) is - always active. If this is undesirable, you can set up your - base configuration to index an empty directory. - - The different search interfaces (GUI, command line, ...) - have different methods to define the set of indexes to be - used, see the appropriate section. - - If a set of multiple indexes are to be used together for - searches, some configuration parameters must be consistent - among the set. These are parameters which need to be the same - when indexing and searching. As the parameters come from the - main configuration when searching, they need to be compatible - with what was set when creating the other indexes (which came - from their respective configuration directories. Most of the - relevant parameters are described in the following - linked - section. - - - Programming interface - &RCL; has an Application programming Interface, usable both + &RCL; has an Application Programming Interface, usable both for indexing and searching, currently accessible from the Python language. @@ -2972,8 +2996,8 @@ dir:recoll dir:src -dir:utils -dir:common Simple filters (the old ones) run once and exit. They can be bare programs like antiword, or shell-scripts using other - programs. They are very simple to write, just having to write the - text to the standard output. + programs. They are very simple to write, because they just need + to output the converted to the standard output. Multiple filters, new in 1.13, run as long as their master process (ie: recollindex) is active. They can @@ -3008,12 +3032,12 @@ dir:recoll dir:src -dir:utils -dir:common source file name. They should output the result to stdout. When writing a filter, you should decide if it will output - plain text or html. Plain text is simpler, but you will not be able + plain text or HTML. Plain text is simpler, but you will not be able to add metadata or vary the output character encoding (this will be defined in a configuration file). Additionally, some formatting may - easier to preserve when previewing html. Actually the deciding factor + be easier to preserve when previewing HTML. Actually the deciding factor is metadata: &RCL; has a way to - extract metadata from the html header and use it for field + extract metadata from the HTML header and use it for field searches.. The RECOLL_FILTER_FORPREVIEW environment @@ -3121,7 +3145,7 @@ application/x-chm = execm rclchm should be transformed into "&lt;". This is not always properly done by translating programs which output HTML, and of - course nerver by those which output plain text. + course never by those which output plain text. The character set needs to be specified in the header. It does not need to be UTF-8 (&RCL; will take care @@ -3197,11 +3221,51 @@ application/x-chm = execm rclchm other aspects of fields handling is defined inside the fields configuration file. + The sequence of events for field processing is as follows: + + During indexing, + recollindex scans all meta + fields in HTML documents (most document types are transformed + into HTML at some point). It compares the name for each element + to the configuration defining what should be done with fields + (the fields file) + + If the name for the meta + element matches one for a field that should be indexed, the + contents are processed and the terms are entered into the index + with the prefix defined in the fields + file. + + If the name for the meta element + matches one for a field that should be stored, the content of the + element is stored with the document data record, from which it + can be extracted and displayed at query time. + + At query time, if a field search is performed, the + index prefix is computed and the match is only performed against + appropriately prefixed terms in the index. + + At query time, the field can be displayed inside + the result list by using the appropriate directive in the + definition of the result list paragraph + format. All fields are displayed on the fields screen of + the preview window (which you can reach through the right-click + menu). This is independant of the fact that the search which + produced the results used the field or not. + + + You can find more information in the section about the fields file, or in comments inside the file. + You can also have a look at the example + on the Wiki, detailing + how one could add a page count field to pdf + documents for displaying inside result lists. @@ -3276,8 +3340,7 @@ application/x-chm = execm rclchm &RCL; versions after 1.11 define a Python programming interface, both for searching and indexing. - The Python interface is not built by default and can be - found in the source package, + The Python interface can be found in the source package, under python/recoll. In order to build the module, you should first build or re-build the Recoll library using position-independant @@ -4389,6 +4452,12 @@ unac_except_trans = character, you could very well have something like üue in the list. + The default value set for + unac_except_trans can't be listed here + because I have trouble with SGML and UTF-8, but it only + contains ligature decompositions: german ss, oe, ae, fi, + fl. + This parameter can't be defined for subdirectories, it is global, because there is no way to do otherwise when querying. If you have document sets which would need different