From c469dd559516b830a51afc9e22c7b08002257dc9 Mon Sep 17 00:00:00 2001 From: dockes Date: Sat, 28 Nov 2009 08:11:28 +0000 Subject: [PATCH] clean-up + documented 1.13 new features --- src/doc/user/usermanual.sgml | 2205 +++++++++++++++++++--------------- 1 file changed, 1258 insertions(+), 947 deletions(-) diff --git a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml index 0436c5a0..b4f29481 100644 --- a/src/doc/user/usermanual.sgml +++ b/src/doc/user/usermanual.sgml @@ -1,6 +1,6 @@ Recoll"> - + Xapian"> ]> @@ -188,28 +188,28 @@ - - Periodic indexing: - indexing takes place at discrete - times, by executing the recollindex - command. The typical usage is to have a nightly indexing run + + Periodic indexing: + indexing takes place at discrete + times, by executing the recollindex + command. The typical usage is to have a nightly indexing run programmed into your cron file. - - + + - - Real time indexing: - indexing takes place as soon as a file is created or - changed. recollindex runs as a daemon - and uses a file system alteration monitor such as - Fam, - Gamin or - inotify do detect file changes. + + Real time indexing: + indexing takes place as soon as a file is created or + changed. recollindex runs as a daemon + and uses a file system alteration monitor such as + Fam, + Gamin or + inotify do detect file changes. Monitoring a big directory tree can consume significant system resources. - - + + The choice between the two methods is mostly a matter of @@ -224,12 +224,12 @@ processing are set in configuration files Most file types, like HTML or word processing files, only hold - one document. Some file types, like mail folder files can hold + one document. Some file types, like mail folder files, can hold many individually indexed documents. &RCL; indexing processes plain text, HTML, openoffice - and e-mail files internally. + and e-mail files internally (a few more actually). Other file types (ie: postscript, pdf, ms-word, rtf ...) need external applications for preprocessing. The list is in the @@ -246,12 +246,23 @@ set of defaults. In some cases, it may be interesting to index different - areas of the file system to separate databases. You can do this - by using multiple configuration directories, each indexing a - file system area to a specific database. See the section about using multiple - databases for more information on multiple configurations - and indexes. + areas of the file system to separate databases. You can do this + by using multiple configuration directories, each indexing a + file system area to a specific database. See the + section about using multiple + databases for more information on multiple configurations + and indexes. + + In the rare case where the index becomes corrupted (which can + signal itself by weird search results or crashes), the index files + need to be erased before restarting a clean indexing pass. Just delete + the xapiandb directory (see + next section), or, + alternatively, start the next recollindex with the + -z option, which will reset the database before + indexing. + + @@ -265,8 +276,8 @@ - You can specify a different configuration - directory by setting the RECOLL_CONFDIR + You can specify a different configuration + directory by setting the RECOLL_CONFDIR environment variable, or using the -c option to the &RCL; commands. This method would typically be used to index different areas of the file system to @@ -287,21 +298,21 @@ recoll and indexes to handle whatever subset of the available data that you wish to make searchable. - - You can also specify a different storage - location for the index by setting the dbdir - parameter in the configuration file + + You can also specify a different storage + location for the index by setting the dbdir + parameter in the configuration file (see the configuration section). This method would mainly be of use if you wanted to keep the configuration directory in its default location, but desired another location for the index, typically out of disk occupation concerns. - + - + - The size of the index is determined by the size of the set - of documents, but the ratio can vary a lot. For a typical mixed + The size of the index is determined by the document set size, + but the ratio can vary a lot. For a typical mixed set of documents, the index size will often be close to the data set size. In specific cases (a set of compressed mbox files for example), the index can become much bigger than @@ -316,68 +327,68 @@ recoll total amount of data on the computer. The index data directory (xapiandb) - only contains data that can be completely rebuilt by an index - run, and it can always be destroyed safely. + only contains data that can be completely rebuilt by an index + run, and it can always be destroyed safely. - Xapian index formats + Xapian index formats - If your first installation of &RCL; was 1.9.0 or more - recent, you can skip this section. + If your first installation of &RCL; was 1.9.0 or more + recent, you can skip this section. - &XAP; has had two possible index formats for quite some - time. The "old" one named Quartz, and the - new one named Flint. &XAP; 0.9 used - Quartz by default, but could use - Flint if a specific environment variable - (XAPIAN_PREFER_FLINT) was set. &XAP; 1.0 - still supports Quartz but will use - Flint by default for new index - creations. + &XAP; has had two possible index formats for quite some + time. The "old" one named Quartz, and the + new one named Flint. &XAP; 0.9 used + Quartz by default, but could use + Flint if a specific environment variable + (XAPIAN_PREFER_FLINT) was set. &XAP; 1.0 + still supports Quartz but will use + Flint by default for new index + creations. - The number of disk accesses performed during indexing - has been much optimized in the new Flint - engine and you may see indexing times improved by 50% in some - cases (compared to Quartz), typically for - big indexes where disk accesses dominate the indexing - time. There is also a more modest improvement of index - size. + The number of disk accesses performed during indexing + has been much optimized in the new Flint + engine and you may see indexing times improved by 50% in some + cases (compared to Quartz), typically for + big indexes where disk accesses dominate the indexing + time. There is also a more modest improvement of index + size. - &XAP; will not convert automatically an existing index - from the Quartz to the - Flint format. If you have an older index - and want to take advantage of the new format (which can be - done without setting the environment variable as of &RCL; - 1.8.2 and &XAP; 1.0.0), you will have to explicitly delete - the old index, then run a normal indexing process. + &XAP; will not convert automatically an existing index + from the Quartz to the + Flint format. If you have an older index + and want to take advantage of the new format (which can be + done without setting the environment variable as of &RCL; + 1.8.2 and &XAP; 1.0.0), you will have to explicitly delete + the old index, then run a normal indexing process. - Unfortunately, using the -z option to - recollindex is not sufficient to change the - format, you have to delete all files inside the index - directory (typically ~/.recoll/xapiandb) - before starting indexing. + Unfortunately, using the -z option to + recollindex is not sufficient to change the + format, you have to delete all files inside the index + directory (typically ~/.recoll/xapiandb) + before starting indexing. - Security aspects + Security aspects - The &RCL; index does not hold copies of the indexed - documents. But it does hold enough data to allow for an almost - complete reconstruction. If confidential data is indexed, - access to the database directory should be restricted. + The &RCL; index does not hold copies of the indexed + documents. But it does hold enough data to allow for an almost + complete reconstruction. If confidential data is indexed, + access to the database directory should be restricted. - As of version 1.4, &RCL; will create the configuration - directory with a mode of 0700 (access by owner only). As the - index data directory is by default a sub-directory of the - configuration directory, this should result in appropriate - protection. + As of version 1.4, &RCL; will create the configuration + directory with a mode of 0700 (access by owner only). As the + index data directory is by default a sub-directory of the + configuration directory, this should result in appropriate + protection. - If you use another setup, you should think of the kind - of protection you need for your index, set the directory - and files access modes appropriately, and also maybe adjust - the umask used during index updates. - + If you use another setup, you should think of the kind + of protection you need for your index, set the directory + and files access modes appropriately, and also maybe adjust + the umask used during index updates. + @@ -399,11 +410,12 @@ recoll the organization of your data to improve search precision. The first time you start recoll, you - will be asked whether or not you would like recoll to build the + will be asked whether or not you would like it to build the index. If you want to adjust the configuration before indexing, - just click Cancel at this point. That way, - recoll will have created a ~/.recoll directory containing empty - configuration files. + just click Cancel at this point, which will get + you into the configuration interface. If you exit, + recoll will have created a ~/.recoll directory + containing empty configuration files, which you can edit by hand. The configuration is documented inside the installation chapter of this @@ -420,89 +432,115 @@ recoll packages section - The indexing configuration GUI + The indexing configuration GUI - Most parameters for a given indexing configuration can - be set from a recoll GUI running on this - configuration (either as default, or by setting - RECOLL_CONFDIR or the -c - option.) + Most parameters for a given indexing configuration can + be set from a recoll GUI running on this + configuration (either as default, or by setting + RECOLL_CONFDIR or the -c + option.) - The interface is started from the - Preferences menu. It has two main - panels. The first panel allows setting global variables, like - the list of top directories or the list of skipped paths. The - second panel allows setting variables that can be redefined - for subdirectories. This second panel has an initially empty list of - customisation directories, to which you can add. The variables - are then set for the currently selected directory (or at the top - level if the empty line is selected). + The interface is started from the + Preferences menu. It has two main + panels. The first panel allows setting global variables, like + the list of top directories or the list of skipped paths. The + second panel allows setting variables that can be redefined + for subdirectories. This second panel has an initially empty list of + customisation directories, to which you can add. The variables + are then set for the currently selected directory (or at the top + level if the empty line is selected). - The meaning for most entries in the interface is - self-evident and documented by a ToolTip - popup on the text label. For more detail, you will need to - refer to the configuration - section of this guide. + The meaning for most entries in the interface is + self-evident and documented by a ToolTip + popup on the text label. For more detail, you will need to + refer to the configuration + section of this guide. - The configuration tool normally respects the comments - and most of the formatting inside the configuration file, so - that it is quite possible to use it on hand-edited files, - which you might nevertheless want to backup first... + The configuration tool normally respects the comments + and most of the formatting inside the configuration file, so + that it is quite possible to use it on hand-edited files, + which you might nevertheless want to backup first... - + + Using Beagle WEB browser plugins + + Beagle is a concurrent desktop + indexer, built on Lucene and the Mono project (C#), for which a + number of add-on browser plugins were written. These work by + copying visited web pages to an indexing queue directory, which the + indexer then processes. + + If, for any reason, you so happen to prefer &RCL; to + Beagle, you can still use + the browser plugins (they are written in Javascript and completely + independant of C#, Beagle, Lucene...). &RCL; can process the + Beagle queue directory. Of course, this + supposes that Beagle is not running, + else both programs will fight for the same files. + + This feature can be enabled in the GUI indexing configuration + panel, or by editing the configuration file (set + processbeaglequeue to 1). + + Periodic indexing - Starting indexing + Starting indexing - Indexing is performed either by the - recollindex program, or by the - indexing thread inside the recoll - program (use the File menu). Both programs - will use the RECOLL_CONFDIR - variable or accept a -c - confdir option to specify a non-default - configuration directory. + Indexing is performed either by the + recollindex program, or by the + indexing thread inside the recoll + program (use the File menu). Both programs + will use the RECOLL_CONFDIR + variable or accept a -c + confdir option to specify a non-default + configuration directory. - If the recoll program finds no index - when it starts, it will automatically start indexing (except - if canceled). + If the recoll program finds no index + when it starts, it will automatically start indexing (except + if canceled). - It is best to avoid interrupting the indexing process, as - this may sometimes leave the index in a bad state. This is - not a serious problem, as you then just need to delete - the index files and restart the indexing. The index files are - normally stored in the $HOME/.recoll/xapiandb - directory, which you can just delete if needed. Alternatively, - you can start recollindex with option - -z, which will reset the database before - indexing. + The indexing process can be interrupted by sending an + interrupt (^C, SIGINT) or terminate (SIGTERM) signal. Some time may + elapse before the process exits, because it needs to properly flush + and close the index. The indexing will restart at the + interruption point the next time (the full file tree will still be + traversed, but files that were indexed up to the interruption and + are still up to date will not need to be reindexed). + + After such an interruption, the index will be somewhat + inconsistent because some operations which are normally performed + at the end of the indexing pass will have been skipped (for + exemple, the stemming and spelling databases will be inexistant + or out of date). You just need to restart indexing at a later + time to restore consistency. - Using <command>cron</command> to automate - indexing + Using <command>cron</command> to automate + indexing - The most common way to set up indexing is to have a cron - task execute it every night. For example the following - crontab entry would do it every day at - 3:30AM (supposing recollindex is in your - PATH): + The most common way to set up indexing is to have a cron + task execute it every night. For example the following + crontab entry would do it every day at + 3:30AM (supposing recollindex is in your + PATH): - 30 3 * * * recollindex > /tmp/recolltrace 2>&1 + 30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1 - The usual command to edit your - crontab is + The usual command to edit your + crontab is crontab -e (which will usually start - the vi editor to edit the file). You may - have more sophisticated tools available on your - system. + the vi editor to edit the file). You may + have more sophisticated tools available on your + system. @@ -557,22 +595,21 @@ fvwm There is a similar mechanism under Gnome (find the session control tool in the menus and use the "Startup programs" tab). - By default, the indexing daemon will write its messages to - a file inside the configuration directory (this is controlled - by the daemlogfilename - and daemloglevel configuration - parameters). You may want to change this. Also the log file - will only be truncated when the daemon starts. If the daemon - runs permanently, the log file may grow quite big, depending - on the log level. + By default, the messages from the indexing daemon will be + discarded. You may want to change this by setting the + daemlogfilename and + daemloglevel configuration parameters. Also the + log file will only be truncated when the daemon starts. If the + daemon runs permanently, the log file may grow quite big, depending + on the log level. While it is convenient that data is indexed in real time, - repeated indexing can generate a significant load on the - system when files such as email folders change. Also, - monitoring large file trees by itself significantly taxes - system resources. You probably do not want to enable it if - your system is short on resources. Periodic indexing is - adequate in most cases. + repeated indexing can generate a significant load on the + system when files such as email folders change. Also, + monitoring large file trees by itself significantly taxes + system resources. You probably do not want to enable it if + your system is short on resources. Periodic indexing is + adequate in most cases. @@ -588,22 +625,22 @@ fvwm recoll has two search modes: Simple search (the default, on the main screen) has - a single entry field where you can enter multiple words. + a single entry field where you can enter multiple words. Advanced search (a panel accessed through the - Tools menu or the toolbox bar icon) shas - multiple entry fields, which you may use to build a logical - condition, with additional filtering on file type and location - in the file system. + Tools menu or the toolbox bar icon) shas + multiple entry fields, which you may use to build a logical + condition, with additional filtering on file type and location + in the file system. In most cases, you can enter the terms as you - think them, even if they contain embedded punctuation or other - non-textual characters. For - exemple, &RCL; can handle things like e-mail addresses, or - arbitrary cut and paste from another text window, punctation - and all. + think them, even if they contain embedded punctuation or other + non-textual characters. For + exemple, &RCL; can handle things like e-mail addresses, or + arbitrary cut and paste from another text window, punctation + and all. The main case where you should enter text differently from how it is printed is for east-asian languages (Chinese, @@ -616,19 +653,19 @@ fvwm Simple search - Start the recoll program. - - Possibly choose a search mode: Any - term, All terms, + Start the recoll program. + + Possibly choose a search mode: Any + term, All terms, File name or - Query language. - - Enter search term(s) in the text field at the top of the + Query language. + + Enter search term(s) in the text field at the top of the window. - - Click the Search button or + + Click the Search button or hit the Enter key to start the search. - + The initial default search mode is All @@ -640,8 +677,8 @@ fvwm File name will specifically look for file names. The entry will be split at white space characters, and each pattern will be separately expanded. If you want - to search for a pattern including white space, you need - to use double quotes. The point of having a separate file name + to search for a pattern including white space, use + double quotes. The point of having a separate file name search is that wild card expansion can be performed more efficiently on a relatively small subset of the index. @@ -664,7 +701,7 @@ fvwm a search for floor will also normally look for flooring, floored, etc., but a search for Floor will only look for - floor, in any character case. Sstemming can + floor, in any character case. Stemming can also be disabled globally in the preferences. &RCL; remembers the last few searches that you @@ -681,14 +718,13 @@ fvwm Double-clicking on a word in the result list or a preview window will insert it into the simple search entry field. - Note that, apart from wildcard characters (single - ? characters are ok), you can cut and paste - any text into an All terms or - Any term search field, punctuation, - newlines and all. &RCL; will process it and produce a meaningful - search. This is what most differentiates this mode from the - Query Language mode, where you have to care - about the syntax. + You can cut and paste any text into an All + terms or Any term search field, + punctuation, newlines and all - except for wildcard characters + (single ? characters are ok). &RCL; will process + it and produce a meaningful search. This is what most differentiates + this mode from the Query Language mode, where + you have to care about the syntax. You can use the Tools / Advanced search @@ -719,12 +755,16 @@ fvwm Shift+ArrowUp/Down in the window). - Clicking the Edit link will attempt to - start an external editor. The editors can be configured through - the user preferences dialog, or by editing the - mimeview configuration file. + Clicking the Open link will attempt to + start an external viewer. The viewer for each document type can be + configured through the user preferences dialog, or by editing the + mimeview configuration file. You can also check + the Use desktop preferences option in the user + preferences dialog to use the desktop defaults for all + documents. This is probably the best option if you are using a well + configured Gnome or KDE desktop. - The Preview and Edit + The Preview and Open edit links may not be present for all entries, meaning that &RCL; has no configured way to preview a given file type (which was indexed by name only), or no configured external editor for @@ -737,7 +777,7 @@ fvwm The format of the result list entries is entirely configurable by using the preference dialog to - edit an HTML + edit an HTML fragment. You can click on the Query details link @@ -754,44 +794,47 @@ fvwm - The result list right-click menu + The result list right-click menu - Apart from the preview and edit links, you can display a + Apart from the preview and edit links, you can display a pop-up menu by right-clicking over a paragraph in the result list. This menu has the following entries: - - Preview - Edit - Copy File Name - Copy Url - Save to File - Find similar - Parent document - + + Preview + Edit + Copy File Name + Copy Url + Save to File + Find similar + Preview Parent + document + Open Parent + document + - The Preview and + The Preview and Edit entries do the same thing as the corresponding links. - The Copy File Name and - Copy Url copy the relevant data to the - clipboard, for later pasting. + The Copy File Name and + Copy Url copy the relevant data to the + clipboard, for later pasting. - Save to File allows saving the - contents of a result document to a chosen file. This entry - will only appear if the document does not correspond to an - existing file, but is a subdocument inside such a file (ie: an - email attachment). It is especially useful to extract attachments - with no associated editor. + Save to File allows saving the + contents of a result document to a chosen file. This entry + will only appear if the document does not correspond to an + existing file, but is a subdocument inside such a file (ie: an + email attachment). It is especially useful to extract attachments + with no associated editor. The Find similar entry will select a number of relevant term from the current document and enter them into the simple search field. You can then start a simple search, with a good chance of finding documents related to the - current result. + current result. - The Parent document entry will + The Parent document entries will appear for documents which are not actually files but are part of, or attached to, a higher level document. This entry is mainly useful for email attachments and permits viewing @@ -800,7 +843,9 @@ fvwm folder file, but that you can't actually visualize the folder (there will be an error dialog if you try). &RCL; is unfortunately not yet smart enough to disable the entry in - this case. + this case. In other cases, the Open option makes sense, for + exemple to start a chm viewer on the parent document for a help + page. @@ -854,11 +899,11 @@ fvwm associated to the document (ie: author, abtract, etc.). This is especially useful in cases where the term match did not occur in the main text but in one of the fields. - + You can print the current preview window contents by typing - ^P (Ctrl + P) in - the window text. - + ^P (Ctrl + P) in + the window text. + @@ -931,58 +976,58 @@ fvwm &RCL; currently manages the following default fields: - title, - subject or caption are - synonyms which specify data to be searched for in the - document title or subject. - - author or - from for searching the documents originators. - - recipient or - to for searching the documents recipients. - - keyword for searching the - document-specified keywords (few documents actually have any). - - filename for the document's - file name. - ext specifies the file - name extension (Ex: ext:html) - + title, + subject or caption are + synonyms which specify data to be searched for in the + document title or subject. + + author or + from for searching the documents originators. + + recipient or + to for searching the documents recipients. + + keyword for searching the + document-specified keywords (few documents actually have any). + + filename for the document's + file name. + ext specifies the file + name extension (Ex: ext:html) + The field syntax also supports a few field-like, but special, criteria: - dir for filtering the - results on file location (Ex: - dir:/home/me/somedir). Please note - that this is quite inefficient, that it may produce very - slow searches, and that it may be worth in some - cases to set up separate databases instead. - + dir for filtering the + results on file location (Ex: + dir:/home/me/somedir). Please note + that this is quite inefficient, that it may produce very + slow searches, and that it may be worth in some + cases to set up separate databases instead. + - mime or - format for specifying the - mime type. This one is quite special because you can specify - several values which will be OR'ed (the normal default for the - language is AND). Ex: mime:text/plain - mime:text/html. Specifying an explicit boolean - operator or negation (-) before a - mime specification is not supported and - will produce strange results. - + mime or + format for specifying the + mime type. This one is quite special because you can specify + several values which will be OR'ed (the normal default for the + language is AND). Ex: mime:text/plain + mime:text/html. Specifying an explicit boolean + operator or negation (-) before a + mime specification is not supported and + will produce strange results. + - type or - rclcat for specifying the category (as in - text/media/presentation/etc.). The classification of mime - types in categories is defined in the &RCL; configuration - (mimeconf), and can be modified or - extended. The default category names are those which permit - filtering results in the main GUI screen. Categories are OR'ed - like mime types above. - + type or + rclcat for specifying the category (as in + text/media/presentation/etc.). The classification of mime + types in categories is defined in the &RCL; configuration + (mimeconf), and can be modified or + extended. The default category names are those which permit + filtering results in the main GUI screen. Categories are OR'ed + like mime types above. + @@ -1007,7 +1052,7 @@ fvwm Most Xesam phrase modifiers are unsupported, except for l (small ell) to disable stemming, and - p to turn an phrase into a NEAR (unordered) + p to turn a phrase into a NEAR (unordered) search. Exemple: "prejudice pride"p @@ -1022,26 +1067,26 @@ fvwm The dialog has three parts: - The top part allows constructing a query by + The top part allows constructing a query by combining multiple clauses of different types. Each entry field is configurable for the following modes: - All terms. - - Any term. - - None of the terms. - - Phrase (exact terms in order within an - adjustable window). - - Proximity (terms in any order within an - adjustable window). - - Filename search. - - + All terms. + + Any term. + + None of the terms. + + Phrase (exact terms in order within an + adjustable window). + + Proximity (terms in any order within an + adjustable window). + + Filename search. + + Additional entry fields can be created by clicking the Add clause button. @@ -1055,22 +1100,22 @@ fvwm a mix of single words and phrases enclosed in double quotes. Stemming and wildcard expansion will be performed as for simple search. - + - The next part allows filtering the + The next part allows filtering the results by their mime types. - The state of the file type selection can be saved as - the default (the file type filter will not be activated at - program start-up, but the lists will be in the restored - state). - + The state of the file type selection can be saved as + the default (the file type filter will not be activated at + program start-up, but the lists will be in the restored + state). + - + The bottom part allows restricting the search results to a sub-tree of the indexed area. If you need to do this often, you may think of setting up multiple indexes instead, as the performance will be much better. - + @@ -1117,7 +1162,7 @@ fvwm - Wildcard + Wildcard In this mode of operation, you can enter a search string with shell-like wildcards (*, ?, []). ie: xapi* would display all index terms @@ -1127,8 +1172,8 @@ fvwm - Regular expression - This mode will accept a regular expression + Regular expression + This mode will accept a regular expression as input. Example: word[0-9]+. The expression is implicitely anchored at the beginning. Ie: @@ -1138,19 +1183,19 @@ fvwm .*press to match the latter, but be aware that this will cause a full index term list scan, which can be quite long. - + - Stem expansion - This mode will perform the usual stem expansion - normally done as part user input processing. As such it is - probably mostly useful to demonstrate the process. - + Stem expansion + This mode will perform the usual stem expansion + normally done as part user input processing. As such it is + probably mostly useful to demonstrate the process. + - Spelling/Phonetic In this + Spelling/Phonetic In this mode, you enter the term as you think it is spelled, and &RCL; will do its best to find index terms that sound like your entry. This mode uses the @@ -1192,38 +1237,38 @@ fvwm * which matches 0 or more characters. - - ? which matches + + ? which matches a single character. - + [] which allow defining sets of characters to be matched (ex: [abc] matches a single character which may be 'a' or 'b' or 'c', [0-9] matches any number. - + You should be aware of a few things before using - wildcards. + wildcards. - Using a wildcard character at the beginning of - a word can make for a slow search because &RCL; will have to - scan the whole index term list to find the matches. - - Using a * at the end of a - word can produce more matches than you would think, and - strange search results. You can use the term explorer tool to - check what completions exist for a given term. You can also - see exactly what search was performed by clicking on the link - at the top of the result list. In general, for natural - language terms, stem expansion will produce better results - than an ending * (stem expansion is turned - off when any wildcard character appears in the term). - + Using a wildcard character at the beginning of + a word can make for a slow search because &RCL; will have to + scan the whole index term list to find the matches. + + Using a * at the end of a + word can produce more matches than you would think, and + strange search results. You can use the term explorer tool to + check what completions exist for a given term. You can also + see exactly what search was performed by clicking on the link + at the top of the result list. In general, for natural + language terms, stem expansion will produce better results + than an ending * (stem expansion is turned + off when any wildcard character appears in the term). + @@ -1344,22 +1389,22 @@ fvwm Terms and search expansion Term completion - Typing Esc Space in - the simple search entry field while entering a word will - either complete the current word if its beginning matches a - unique term in the index, or open a window to propose a list - of completions. + Typing Esc Space in + the simple search entry field while entering a word will + either complete the current word if its beginning matches a + unique term in the index, or open a window to propose a list + of completions. Picking up new terms from result or preview text - Double-clicking on a word in the result list or in a - preview window will copy it to the simple search entry field. + Double-clicking on a word in the result list or in a + preview window will copy it to the simple search entry field. Wildcards - Wildcards can be used inside search terms in all forms - of searches. + Wildcards can be used inside search terms in all forms + of searches. More about wildcards. @@ -1376,12 +1421,12 @@ fvwm Finding related documents - Selecting the Find similar documents entry - in the result list paragraph right-click menu will select a - set of "interesting" terms from the current result, and insert - them into the simple search entry field. You can then possibly - edit the list and start a search to find documents which may - be apparented to the current result. + Selecting the Find similar documents entry + in the result list paragraph right-click menu will select a + set of "interesting" terms from the current result, and insert + them into the simple search entry field. You can then possibly + edit the list and start a search to find documents which may + be apparented to the current result. File names @@ -1428,7 +1473,7 @@ fvwm Others - + Using fields You can use the query language and field specifications @@ -1454,6 +1499,13 @@ fvwm the new document. + Scrolling the result list from the keyboard + You can use PageUp and PageDown + to scroll the result list, Shift+Home to go back + to the first page. These work even while the focus is in the + search entry. + + Forced opening of a preview window You can use Shift+Click on a result list Preview link to force the creation of a @@ -1469,7 +1521,7 @@ fvwm Printing previews Entering ^P in a preview window will print - the currently displayed text. + the currently displayed text. Quitting @@ -1482,102 +1534,257 @@ fvwm Customizing the search interface - It is possible to customize some aspects of the search - interface by using Query configuration entry - in the Preferences menu. + You can customize some aspects of the search interface by using + the Query configuration entry in the + Preferences menu. - There are two tabs in the dialog, dealing with the - interface itself, and with the parameters used for searching and - returning results. + There are several tabs in the dialog, dealing with the + interface itself, the parameters used for searching and + returning results, and what indexes are searched. User interface parameters: - + - Number of results in a result - page: - + Number of results in a result + page: + - Hide duplicate results: - decides if result list entries are shown for identical - documents found in different places. - + Hide duplicate results: + decides if result list entries are shown for identical + documents found in different places. + - Highlight color for query - terms: - Terms from the user query are highlighted in the result - list samples and the preview window. The color can be - chosen here. Any QT color string should work - (ie red, #ff0000). The - default is blue. - + Highlight color for query + terms: Terms from the user query are highlighted in + the result list samples and the preview window. The color can + be chosen here. Any QT color string should work (ie + red, #ff0000). The + default is blue. + - Result list font: There - is quite a lot of information shown in the result list, and - you may want to customize the font and/or font size. The rest - of the fonts used by &RCL; are determined by your generic QT - config (try the qtconfig command). - + Result list font: There is + quite a lot of information shown in the result list, and you + may want to customize the font and/or font size. The rest of + the fonts used by &RCL; are determined by your generic QT + config (try the qtconfig command). + + + + Result paragraph format string: + allows you to change the presentation of each result list + entry. This is + described in its own section. + + + Maximum text size highlighted for + preview Inserting highlights on search term inside + the text before inserting it in the preview window involves + quite a lot of processing, and can be disabled over the given + text size to speed up loading. + + + Use desktop preferences to choose + document editor: if this is checked, the + xdg-open utility will be used to open files + when you click the Edit link in the result + list, instead of the application defined in + mimeview. xdg-open will + in term use your desktop preferences to choose an appropriate + application. + + + Choose editor applications + this will let you choose the command started by the + Edit links inside the result list, for + specific document types. + + + Display category filter as + toolbar... this will let you choose if the document + categories are displayed as a list or a set of buttons. + + + Auto-start simple search on white + space entry: if this is checked, a search will be + executed each time you enter a space in the simple search input + field. This lets you look at the result list as you enter new + terms. This is off by default, you may like it or not... + + + Start with advanced search dialog open + and Start with sort dialog + open: If you use these dialogs all the time, checking + these entries will get them to open when recoll starts. + + + Remember sort activation + state if set, Recoll will remember the sort tool + stat between invocations. It normally starts with sorting + disabled. + + Prefer HTML to plain text for preview + + if set, Recoll will display HTML as such inside the + preview window. If this causes problems with the Qt HTML + display, you can uncheck it to display the plain text version + instead. + + + + + + + + Search parameters: + + + + Stemming language: + stemming obviously depends on the document's language. This + listbox will let you chose among the stemming databases which + were built during indexing (this is set in the main configuration + file), or later added with recollindex + -s (See the recollindex manual). Stemming languages + which are dynamically added will be deleted at the next + indexing pass unless they are also added in the configuration + file. + + + Dynamically add phrase to simple + searches: a phrase will be automatically built and + added to simple searches when looking for Any + terms. This will give a relevance boost to the + results where the search terms appear as a phrase (consecutive + and in order). + + + Replace abstracts from + documents: this decides if we should synthesize and + display an abstract in place of an explicit abstract found + within the document itself. + + + Dynamically build + abstracts: this decides if &RCL; tries to build + document abstracts when displaying the result list. Abstracts + are constructed by taking context from the document + information, around the search terms. This can slow down + result list display significantly for big documents, and you + may want to turn it off. + + + Replace abstracts from + documents: this decides if we should synthesize and + display an abstract in place of an explicit abstract found + within the document itself. + + + Synthetic abstract size: + adjust to taste... + + + Synthetic abstract context + words: how many words should be displayed around + each term occurrence. + + + + + + + + External indexes: + This panel will let you browse for additional indexes + that you may want to search. External indexes are designated by + their database directory (ie: + /home/someothergui/.recoll/xapiandb, + /usr/local/recollglobal/xapiandb). + + Once entered, the indexes will appear in the + External indexes list, and you can + chose which ones you want to use at any moment by checking or + unchecking their entries. + + Your main database (the one the current configuration + indexes to), is always implicitly active. If this is not + desirable, you can set up your configuration so that it indexes, + for example, an empty directory. An alternative indexer may also + need to implement a way of purging the index from stale data, + + + + The result list paragraph format + + The presentation of each result inside the result list can be + customized by setting the result list paragraph format inside the + User Interface tab of the Query + configuration. + + This is a Qt HTML string where the following printf-like + % substitutions will be performed: - - Result paragraph format string: - allows you to change the presentation of - each result list entry. This is a qt-html string where the - following printf-like % substitutions will - be performed: - %AAbstract - - %DDate - - %IIcon image name - - - %KKeywords (if - any) - - %LPreview and - Edit links - - %MMime - type - - %Nresult Number - - - %RRelevance - percentage - - %SSize - information - - %TTitle - - - %UUrl - - + %AAbstract + + %DDate + + %IIcon image name + + + %KKeywords (if + any) + + %LPreview and + Edit links + + %MMime + type + + %Nresult Number + + + %RRelevance + percentage + + %SSize + information + + %TTitle + + + %UUrl + + - In addition to the predefined values above, all strings like - %(fieldname) will be replaced by the value - of the field named fieldname for this - document. Only stored fields can be accessed in this way, the - value of indexed but not stored fields is not known at this - point (see field - configuration). There are currently very few fields - stored by default, apart from the values above (only - author), so this feature will need some - custom local configuration to be useful. For example, you - could look at the fields for the document types of interest - (use the right-click menu inside the preview window), and add - what you want to the list of stored fields. A candidate - example would be the recipient field - which is generated by the message filters. + The format of the Preview and Edit links is + <a href="P%N"> + and + <a href="E%N"> + where docnum (%N expands to the document + number inside the result list). + + In addition to the predefined values above, all strings like + %(fieldname) will be replaced by the value of + the field named fieldname for this + document. Only stored fields can be accessed in this way, the value + of indexed but not stored fields is not known at this point in the + search process (see field + configuration). There are currently very few fields stored + by default, apart from the values above (only + author), so this feature will need some custom + local configuration to be useful. For example, you could look at + the fields for the document types of interest (use the right-click + menu inside the preview window), and add what you want to the list + of stored fields. A candidate example would be the + recipient field which is generated by the + message filters. The default value for the paragraph format string is: <img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br> -%M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i><br> +%M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br> %A %K You may, for example, try the following for a more web-like @@ -1593,122 +1800,17 @@ fvwm <tr><td><div>%A</div></td></tr> </table>%K - The format of the Preview and Edit links is - <a href="Pdocnum"> - and - <a href="Edocnum"> - where docnum is what %N would - print. This makes the title a preview link in the above format. - - Please note that, due to the way the program - handles right mouse clicks in the result list, if the custom - formatting results in multiple paragraphs per result, right - clicks will only work inside the first one. + Note that the P%N link in the above paragraph makes the title a + preview link. + - - - HTML help browser: this - will let you chose your preferred browser which will be - started from the Help menu to read the user - manual. You can enter a simple name if the command is in your - PATH, or browse for a full pathname. - - - Auto-start simple search on - white space entry: if this is checked, a search will - be executed each time you enter a space in the simple search - input field. This lets you look at the result list as you - enter new terms. This is off by default, you may like it or - not... - - - Start with advanced search dialog open - and Start with sort dialog open: - If you use these dialogs all the time, checking these - entries will get them to open when recoll starts. - - - Use desktop preferences to choose - document editor: if this is checked, the - xdg-open - utility will be used to open files when you click the - Edit link in the result list, instead of - the application defined in - mimeview. xdg-open - will in term use your desktop preferences to choose an - appropriate application. - + Due to the way the program handles right mouse clicks in the + result list, if the custom formatting results in multiple + paragraphs per result, right clicks will only work inside the first + one. - - - - - - Search parameters: - - - - Stemming language: - stemming obviously depends on the document's language. This - listbox will let you chose among the stemming databases which - were built during indexing (this is set in the main configuration - file), or later added with - recollindex -s (See the recollindex - manual). Stemming languages which are dynamically added will be - deleted at the next indexing pass unless they are also added in - the configuration file. - - - Dynamically build - abstracts: this decides if &RCL; tries to build - document abstracts when displaying the result list. Abstracts - are constructed by taking context from the document - information, around the search terms. This can slow down - result list display significantly for big documents, and you - may want to turn it off. - - - Replace abstracts from - documents: this decides if we should synthesize and - display an abstract in place of an explicit abstract found - within the document itself. - - - Synthetic abstract size: - adjust to taste... - - - Synthetic abstract context - words: how many words should be displayed around - each term occurrence. - - - - - - - - External indexes: - This panel will let you browse for additional indexes - that you may want to search. External indexes are designated by - their database directory (ie: - /home/someothergui/.recoll/xapiandb, - /usr/local/recollglobal/xapiandb). - - Once entered, the indexes will appear in the - External indexes list, and you can - chose which ones you want to use at any moment by checking or - unchecking their entries. - - Your main database (the one the current configuration - indexes to), is always implicitly active. If this is not - desirable, you can set up your configuration so that it indexes, - for example, an empty directory. An alternative indexer may also - need to implement a way of purging the index from stale data, - - + @@ -1834,10 +1936,10 @@ Common options: Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11) OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html)) 4 results -text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes -text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio... -text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]... -text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree.... +text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes +text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio... +text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]... +text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree.... @@ -1856,34 +1958,58 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-ch (fields) is highly configurable. - Writing a document filter + Writing a document filter - &RCL; filters are executable programs which - translate from a specific format (ie: - openoffice, - acrobat, etc.) to the &RCL; - indexing input format, which may be - text/plain or - text/html. + &RCL; filters are executable programs which + translate from a specific format (ie: + openoffice, + acrobat, etc.) to the &RCL; + indexing input format, which may be + text/plain or + text/html. - &RCL; filters are usually shell-scripts, but this is in - no way necessary. These programs are extremely simple and most - of the difficulty lies in extracting the text from the native - format, not outputting what is expected by &RCL;. Happily - enough, most document formats already have translators or text - extractors which handle the difficult part and can be called - from the filter. In some case the output of the translating - program is appropriate, and no intermediate shell-script is - needed. + As of &RCL; 1.13, there are two kinds of filters: + + Simple filters (the old ones) run once and + exit. They can be bare programs like + antiword, or shell-scripts using other + programs. They are very simple to write, just having to write the + text to the standard output. + + Multiple filters, new in 1.13, run as long as + their master process (ie: recollindex) is active. They can + process multiple files (sparing the process startup time which + can be very significant), or multiple documents per file (ie: for + zip or chm files). They communicate with the indexer through a + simple protocol, but are nevertheless a bit more complicated than + the older kind. Most of these new filters are written in + Python, using a common module to + handle the protocol. + + + The following will just describe the simple filters, if you are + programmer enough to write one of the other kind, it shouldn't be too + difficult to make sense of one of the existing modules (ie: + rclzip). - Filters are called with a single argument which is the - source file name. They should output the result to stdout. + &RCL; simple filters are usually shell-scripts, but this is in + no way necessary. These programs are extremely simple and most + of the difficulty lies in extracting the text from the native + format, not outputting what is expected by &RCL;. Happily + enough, most document formats already have translators or text + extractors which handle the difficult part and can be called + from the filter. In some case the output of the translating + program is appropriate, and no intermediate shell-script is + needed. - The RECOLL_FILTER_FORPREVIEW - environment variable (values yes, - no) tells the filter if the operation is - for indexing or previewing. Some filters use this to output a - slightly different format. This is not essential. + Filters are called with a single argument which is the + source file name. They should output the result to stdout. + + The RECOLL_FILTER_FORPREVIEW + environment variable (values yes, + no) tells the filter if the operation is + for indexing or previewing. Some filters use this to output a + slightly different format. This is not essential. The association of file types to filters is performed in the mimeconf file. A sample: @@ -1891,42 +2017,46 @@ text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-ch [index] application/msword = exec antiword -t -i 1 -m UTF-8;\ - mimetype=text/plain;charset=utf-8 + mimetype = text/plain ; charset=utf-8 application/ogg = exec rclogg text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html + +application/x-chm = execm rclchm - The fragment specifies that: + The fragment specifies that: - - application/msword files - are processed by executing the antiword - program, which outputs - text/plain encoded in - iso-8859-1. - - - application/ogg files are - processed by the rclogg script, with - default output type (text/html, with - encoding specified in the header, or utf-8 - by default). - - - text/rtf is processed by - unrtf, which outputs - text/html. The - iso-8859-1 encoding is specified because it - is not the utf-8 default, and not output by - unrtf in the HTML header section. - - - - The easiest way to write a new filter is probably to start - from an existing one. + application/msword files + are processed by executing the antiword + program, which outputs + text/plain encoded in + utf-8. + + + application/ogg files are + processed by the rclogg script, with + default output type (text/html, with + encoding specified in the header, or utf-8 + by default). + + + text/rtf is processed by + unrtf, which outputs + text/html. The + iso-8859-1 encoding is specified because it + is not the utf-8 default, and not output by + unrtf in the HTML header section. + + application/x-chm is processed + by a persistant filter. This is determined by the + execm keyword. + + + The easiest way to write a new filter is probably to start from an + existing one. Filters which output text/plain text are generally simpler, but they cannot specify the character set @@ -1935,41 +2065,41 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html - Filter HTML output + Filter HTML output - The output HTML could be very minimal like the following - example: + The output HTML could be very minimal like the following + example: - <html><head> + <html><head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> </head> <body>some text content</body></html> - You should take care to escape some - characters inside - the text by transforming them into appropriate - entities. "&" should be transformed into - "&amp;", "<" - should be transformed into - "&lt;". This is not always properly - done by translating programs which output HTML, and of - course nerver by those which output plain text. + You should take care to escape some + characters inside + the text by transforming them into appropriate + entities. "&" should be transformed into + "&amp;", "<" + should be transformed into + "&lt;". This is not always properly + done by translating programs which output HTML, and of + course nerver by those which output plain text. - The character set needs to be specified in the - header. It does not need to be UTF-8 (&RCL; will take care - of translating it), but it must be accurate for good - results. + The character set needs to be specified in the + header. It does not need to be UTF-8 (&RCL; will take care + of translating it), but it must be accurate for good + results. - &RCL; will also make use of other header fields if - they are present: title, - description, - keywords. + &RCL; will also make use of other header fields if + they are present: title, + description, + keywords. - Filters also have the possibility to "invent" field - names. This should be output as meta tags: + Filters also have the possibility to "invent" field + names. This should be output as meta tags: - + <meta name="somefield" content="Some textual data" /> @@ -1981,7 +2111,7 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html - Field data processing configuration + Field data processing configuration Fields are named pieces of information in or about documents, like title, @@ -2003,15 +2133,15 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html Fields can be: - indexed, meaning that their - terms are separately stored in inverted lists (with a specific - prefix), and that a field-specific search is possible. - + indexed, meaning that their + terms are separately stored in inverted lists (with a specific + prefix), and that a field-specific search is possible. + - stored, meaning that their + stored, meaning that their value is recorded in the index data record for the document, and can be returned and displayed with search results. - + @@ -2042,8 +2172,8 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html - - udi An udi (unique document + + udi An udi (unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes), which is why a regular URI cannot be used. The @@ -2053,34 +2183,34 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html document path (file path + internal path), truncated to length, the suppressed part being replaced by a hash value. - + - - ipath - - This data value (set as a field in the Doc - object) is stored, along with the URL, but not indexed by - &RCL;. Its contents are not interpreted, and its use is up - to the application. For example, the &RCL; internal file - system indexer stores the part of the document access path - internal to the container file (ipath in - this case is a list of subdocument sequential numbers). url - and ipath are returned in every search result and permit - access to the original document. - - + + ipath + + This data value (set as a field in the Doc + object) is stored, along with the URL, but not indexed by + &RCL;. Its contents are not interpreted, and its use is up + to the application. For example, the &RCL; internal file + system indexer stores the part of the document access path + internal to the container file (ipath in + this case is a list of subdocument sequential numbers). url + and ipath are returned in every search result and permit + access to the original document. + + - - Stored and indexed fields - - The fields file inside - the &RCL; configuration defines which document fields are - either "indexed" (searchable), "stored" (retrievable with - search results), or both. - - + + Stored and indexed fields + + The fields file inside + the &RCL; configuration defines which document fields are + either "indexed" (searchable), "stored" (retrievable with + search results), or both. + + - + Data for an external indexer, should be stored in a separate index, not the one for the &RCL; internal file system @@ -2096,18 +2226,18 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html Python interface - Introduction + Introduction - &RCL; versions after 1.11 define a Python programming - interface, both for searching and indexing. + &RCL; versions after 1.11 define a Python programming + interface, both for searching and indexing. - The python interface is not built by default and can be - found in the source package, under python/recoll. The - directory contains the usual setup.py - script which you can use to build and install the - module: + The python interface is not built by default and can be + found in the source package, under python/recoll. The + directory contains the usual setup.py + script which you can use to build and install the + module: - + cd recoll-xxx/python/recoll python setup.py build python setup.py install @@ -2118,7 +2248,7 @@ text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html - Interface manual + Interface manual NAME @@ -2307,16 +2437,16 @@ FUNCTIONS - + - Example code + Example code - The following sample would query the index with a user - language string. See the python/samples - directory inside the &RCL; source for other examples. + The following sample would query the index with a user + language string. See the python/samples + directory inside the &RCL; source for other examples. - + #!/usr/bin/env python import recoll @@ -2353,44 +2483,66 @@ while query.next >= 0 and query.next < nres: Installation - Installing a prebuilt copy + Installing a binary copy - &RCL; binary packages from the &RCL; web site are always - linked statically to the &XAP; libraries, and have no other - dependencies. You will only have to check or install There are three types of binary &RCL; installations: + + Through your system normal software distribution + framework (ie, Debian/Ubuntu apt, + FreeBSD ports, etc.). + + + From a package downloaded from the + &RCL; web site. + + + From a prebuilt tree downloaded from the &RCL; + web site. + + + + In all cases, the strict software dependancies (ie on &XAP; or + iconv) will be automatically satisfied, + you should not have to worry about them. + + You will only have to check or install supporting applications - for the file types that you want to index beyond text, HTML and - mail files, and maybe have a look at the + for the file types that you want to index beyond those that are + natively processed by &RCL; (text, HTML, mail files, and a few + others). + + You should also maybe have a look at the configuration section (but this may not be necessary for a quick test with default - parameters). + parameters). Most parameters can be more conveniently set from the + GUI interface. Installing through a package system - If you use a BSD-type port system or a prebuilt package - (RPM or other), just follow the usual procedure for your - system. + If you use a BSD-type port system or a prebuilt package (DEB, + RPM, manually or through the system software configuration + utility), just follow the usual procedure for your system. Installing a prebuilt &RCL; - The unpackaged binary versions on the &RCL; web site are + The unpackaged binary versions on the &RCL; web site are just compressed tar files of a build tree, where only the useful parts were kept (executables and sample configuration). - The executable binary files are built with a static link to + The executable binary files are built with a static link to libxapian and libiconv, to make installation easier (no dependencies). - After extracting the tar file, you can proceed with + After extracting the tar file, you can proceed with installation as if you had built the package from source (that is, just type make install). The binary trees are built for - installation to /usr/local. + installation to /usr/local. @@ -2400,8 +2552,9 @@ while query.next >= 0 and query.next < nres: &RCL; uses external applications to index some file types. You need to install them for the file types that you wish to - have indexed (these are run-time dependencies. None is needed for - building &RCL;). + have indexed (these are run-time optional dependencies. None is + needed for building or running &RCL; except for indexing their + specific file type). After an indexing pass, the commands that were found missing can be displayed from the recoll @@ -2445,11 +2598,11 @@ while query.next >= 0 and query.next < nres: RTF: unrtf - + - - TeX: &RCL; uses the untex - program. Your distribution may have a package for it. If it doesn't, + + TeX: &RCL; uses the untex + program. Your distribution may have a package for it. If it doesn't, there is a copy of the source on the &RCL; web site, because the program has no obvious home. The filter can @@ -2458,39 +2611,61 @@ while query.next >= 0 and query.next < nres: detex and will use it if it is installed. - + dvi: dvips + url="http://www.radicaleye.com/dvips.html">dvips - - djvu: - DjVuLibre - - - - MP3: &RCL; will use the - id3info command from the id3lib package to - extract tag information. Without it, only the file names will - be indexed. - - - Pictures: &RCL; uses the - - Exiftool Perl package to - extract tag information. Most image file formats are - supported. + djvu: + DjVuLibre + + + + mp3: &RCL; will use the + id3info command from the id3lib package to + extract tag information. Without it, only the file names will + be indexed. + + flac files need metaflac. + + ogg files need ogginfo. + + + Pictures: &RCL; uses the + + Exiftool Perl package to + extract tag information. Most image file formats are + supported. Note that there may not be much interest in indexing + the technical tags (image size, aperture, etc.). This is only of + interest if you store personal tags or textual descriptions inside + the image files. + chm: files in microsoft help format need Python and + the pychm + module (which needs chmlib). + + + ics: iCalendar files need Python and the + icalendar + module. + + + zip: Zip archives need Python (and the standard + zipfile module). + + - Text, HTML, mail folders Openoffice and Scribus files - are processed internally. Lyx is used to index Lyx files. Many - filters need sed and awk. - + Text, HTML, mail folders, Openoffice and Scribus files + are processed internally. Lyx is used to index Lyx files. Many + filters need sed and awk. + @@ -2503,12 +2678,12 @@ while query.next >= 0 and query.next < nres: At the very least, you will need to download and install the xapian core package - (&RCL; 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x - version will work too), and the qt - run-time and development packages (&RCL; development - currently uses version 3.3.5, but any 3.3 version is - probably OK). + and the qt + run-time and development packages. + Check the + &RCL; download page for up to date version + information. You will most probably be able to find a binary package for qt for your system. You may have to @@ -2526,9 +2701,9 @@ while query.next >= 0 and query.next < nres: Building - &RCL; has been built on - Linux (redhat7.3, mandriva 2005/6, Fedora Core 3/4/5/6), - FreeBSD 5/6, macosx, and Solaris 8. If you build on another system, and + &RCL; has been built on Linux, FreeBSD, macosx, and Solaris, + most versions after 2005 should be ok, maybe some older ones too + (Solaris 8 is ok). If you build on another system, and need to modify things, I would very much welcome patches. @@ -2554,15 +2729,40 @@ while query.next >= 0 and query.next < nres: On many Linux systems, QTDIR is set by the login scripts, and QMAKESPECS is not needed because there is a default link in - mkspecs/. + mkspecs/. Neither should be needed with + Qt 4. - Configure - options:--without-aspell - will disable the code for phonetic matching of search - terms. --with-fam or - --with-inotify will enable the code for - real time indexing. Inotify support is enabled by default on - recent Linux systems. + Configure options: + + + --without-aspell + will disable the code for phonetic matching of search + terms. + + --with-fam or + --with-inotify will enable the code for + real time indexing. Inotify support is enabled by default on + recent Linux systems. + + --enable-xattr will enable + code to fetch data from file extended attributes. This is only + useful is some application stores data in there, and also needs + some simple configuration (see comments in the + fields configuration file). + + --with-file-command Specify + the version of the 'file' command to use (ie: + --with-file-command=/usr/local/bin/file). Can be useful to + enable the gnu version on systems where the native one is + bad. + + --without-gui Disable the Qt + interface, and auxiliary uses of X11, and compile the command + line version. + + + + Normal procedure: @@ -2573,7 +2773,7 @@ while query.next >= 0 and query.next < nres: - There little auto-configuration. The + There is little auto-configuration. The configure script will mainly link one of the system-specific files in the mk directory to mk/sysconf. If your system @@ -2593,14 +2793,14 @@ while query.next >= 0 and query.next < nres: and the sample configuration files, scripts and other shared data to prefix/share/recoll. - If the installation prefix given to - recollinstall is different from what was - specified when executing configure, you - will have to set the RECOLL_DATADIR - environment variable to indicate where the shared data is to - be found. + If the installation prefix given to + recollinstall is different from what was + specified when executing configure, you + will have to set the RECOLL_DATADIR + environment variable to indicate where the shared data is to + be found. - You can then proceed to You can then proceed to configuration. @@ -2717,12 +2917,12 @@ while query.next >= 0 and query.next < nres: the configuration file before restarting the command. This will start the initial indexing, which may take some time. - Paramers: + Paramers affecting what we index: - topdirs + topdirs Specifies the list of directories or files to index (recursively for directories). The indexer will not follow symbolic links inside the indexed trees by default @@ -2730,16 +2930,6 @@ while query.next >= 0 and query.next < nres: - dbdir - The name of the Xapian data directory. It - will be created if needed when the index is - initialized. If this is not an absolute path, it will be - interpreted relative to the configuration directory. The - value can have embedded spaces but starting or trailing - spaces will be trimmed. You cannot use quotes here. - - - skippedNames A space-separated list of patterns for @@ -2747,12 +2937,12 @@ while query.next >= 0 and query.next < nres: ignored. The list defined in the default file is: skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \ - *~ recollrc + *~ .beagle .git .hg .bzr loop.ps .xsession-errors \ + .recoll* xapiandb recollrc recoll.conf - The list can be redefined for sub-directories, but is only - actually changed for the top level ones in - topdirs. - The top-level directories are not affected by this + The list can be redefined at any sub-directory in the + indexed area. + The top-level directories are not affected by this list (that is, a directory in topdirs might match and would still be indexed). The list in the default configuration does not @@ -2784,21 +2974,21 @@ skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \ There is no default in the sample configuration file, but the code always adds the configuration and database directories in there. - skippedPaths is used both by - batch and real time - indexing. daemSkippedPaths can be - used to specify things that should be indexed at - startup, but not monitored. - Example of use for skipping text files only in a - specific directory: - + skippedPaths is used both by + batch and real time + indexing. daemSkippedPaths can be + used to specify things that should be indexed at + startup, but not monitored. + Example of use for skipping text files only in a + specific directory: + skippedPaths = ~/somedir/∗.txt - followLinks + followLinks Specifies if the indexer should follow symbolic links while walking the file tree. The default is to ignore symbolic links to avoid multiple indexing of @@ -2810,6 +3000,151 @@ skippedPaths = ~/somedir/∗.txt + indexedmimetypes + &RCL; normally indexes any file which it + knows how to read. This list lets you restrict the indexed + mime types to what you specify. If the variable is + unspecified or the list empty (the default), all supported + types are processed. + + + + compressedfilemaxkbs + Size limit for compressed (.gz or .bz2) + files. These need to be decompressed in a temporary + directory for identification, which can be very wasteful + if 'uninteresting' big compressed files are present. + Negative means no limit, 0 means no processing of any + compressed file. Defaults to -1. + + + + textfilemaxmbs + Maximum size for text files. Very big text + files are often uninteresting logs. Set to -1 to disable + (default). + + + + textfilepagekbs + If set to other than -1 (the default), text + files will be indexed as multiple documents of the given page + size. This may be useful if you do want to index very big text + files as it will both reduce memory usage at index time and + help with loading data to the preview window. A size of a few + megabytes would seem reasonable. + + + + indexallfilenames + &RCL; indexes file names in a special + section of the database to allow specific file names + searches using wild cards. This parameter decides if + file name indexing is performed only for files with mime + types that would qualify them for full text indexing, or + for all files inside the selected subtrees, independently of + mime type. + + + + usesystemfilecommand + Decide if we use the file -i + system command as a final step for determining the mime + type for a file (the main procedure uses suffix + associations as defined in the mimemap + file). This can be useful for files with suffix-less names, + but it will also cause the indexing of many bogus "text" + files. + + + + processbeaglequeue + If this is set, process the directory where + Beagle Web browser plugins copy visited pages for indexing. Of + course, Beagle MUST NOT be running, else things will behave + strangely. + + + + beaglequeuedir + The path to the Beagle indexing queue. This is + hard-coded in the Beagle plugin as + ~/.beagle/ToIndex so there should be no + need to change it. + + + + + + + Parameters affecting where and how we store things: + + + dbdir + The name of the Xapian data directory. It + will be created if needed when the index is + initialized. If this is not an absolute path, it will be + interpreted relative to the configuration directory. The + value can have embedded spaces but starting or trailing + spaces will be trimmed. You cannot use quotes here. + + + + maxfsoccuppc + Maximum file system occupation before we + stop indexing. The value is a percentage, corresponding to + what the "Capacity" df output column shows. The default + value is 0, meaning no checking. + + + + mboxcachedir + The directory where mbox message offsets cache + files are held. This is normally $RECOLL_CONFDIR/mboxcache, but + it may be useful to share a directory between different + configurations. + + + + mboxcacheminmbs + The minimum mbox file size over which we + cache the offsets. There is really no sense in caching + offsets for small files. The default is 5 MB. + + + + webcachedir + This is only used by the Beagle web browser + plugin indexing code, and defines where the cache for visited + pages will live. Default: + $RECOLL_CONFDIR/webcache + + + + webcachemaxmbs + This is only used by the Beagle web browser + plugin indexing code, and defines the maximum size for the web + page cache. Default: 40 MB. + + + + + idxflushmb + Threshold (megabytes of new text data) + where we flush from memory to disk index. Setting this can + help control memory usage. A value of 0 means no explicit + flushing, letting Xapian use its own default, which is + flushing every 10000 documents (memory usage depends on + average document size). The default value is 10. + + + + + + Miscellani: + + + loglevel,daemloglevel Verbosity level for recoll and recollindex. A value of 4 lists quite a lot of @@ -2820,7 +3155,7 @@ skippedPaths = ~/somedir/∗.txt logfilename, - daemlogfilename + daemlogfilename Where the messages should go. 'stderr' can be used as a special value, and is the default. The daemversion is specific to the indexing monitor @@ -2847,24 +3182,34 @@ skippedPaths = ~/somedir/∗.txt sub-directory. If it is not set at all, the character set used is the one defined by the nls environment (LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set. + + - maxfsoccuppc - Maximum file system occupation before we - stop indexing. The value is a percentage, corresponding to - what the "Capacity" df output column shows. The default - value is 0, meaning no checking. + filtermaxseconds + Maximum filter execution time, after which it + is aborted. Some postscript programs just loop... + + + + maildefcharset + This can be used to define the default + character set specifically for mail messages which don't + specify it. This is mainly useful for readpst (libpst) dumps, + which are utf-8 but do not say so. - idxflushmb - Threshold (megabytes of new text data) - where we flush from memory to disk index. Setting this can - help control memory usage. A value of 0 means no explicit - flushing, letting Xapian use its own default, which is - flushing every 10000 documents (memory usage depends on - average document size). The default value is 10. + localfields + This allows setting fields for all documents + under a given directory. Typical usage would be to set an + "rclaptg" field, to be used in mimeview to + select a specific viewer. Ie: + localfields=rclaptg=gnus;other=val, then + select specifier viewer with + mimetype|tag=... in + mimeview. - + filtersdir A directory to search for the external @@ -2883,72 +3228,23 @@ skippedPaths = ~/somedir/∗.txt - guesscharset - Decide if we try to guess the character - set of files if no internal value is available (ie: for - plain text files). This does not work well in general, and - should probably not be used. - - - - usesystemfilecommand - Decide if we use the file -i - system command as a final step for determining the mime - type for a file (the main procedure uses suffix - associations as defined in the mimemap - file). This can be useful for files with suffix-less names, - but it will also cause the indexing of many bogus "text" - files. - - - - indexedmimetypes - &RCL; normally indexes any file which it - knows how to read. This list lets you restrict the indexed - mime types to what you specify. If the variable is - unspecified or the list empty (the default), all supported - types are processed. - - - - compressedfilemaxkbs - Size limit for compressed (.gz or .bz2) - files. These need to be decompressed in a temporary - directory for identification, which can be very wasteful - if 'uninteresting' big compressed files are present. - Negative means no limit, 0 means no processing of any - compressed file. Defaults to -1. - - - - indexallfilenames - &RCL; indexes file names in a special - section of the database to allow specific file names - searches using wild cards. This parameter decides if - file name indexing is performed only for files with mime - types that would qualify them for full text indexing, or - for all files inside the selected subtrees, independently of - mime type. - - - - idxabsmlen - &RCL; stores an abstract for each indexed - file inside the database. The text can come from an actual - 'abstract' section in the document or will just be the - beginning of the document. It is stored in the index so - that it can be displayed inside the result lists without - decoding the original - file. The idxabsmlen parameter defines - the size of the stored abstract. The default value is 250 bytes. - The search interface gives you the choice to display this - stored text or a synthetic abstract built by extracting - text around the search terms. If you always - prefer the synthetic abstract, you can reduce this value - and save a little space. + idxabsmlen + &RCL; stores an abstract for each indexed + file inside the database. The text can come from an actual + 'abstract' section in the document or will just be the + beginning of the document. It is stored in the index so + that it can be displayed inside the result lists without + decoding the original + file. The idxabsmlen parameter defines + the size of the stored abstract. The default value is 250 bytes. + The search interface gives you the choice to display this + stored text or a synthetic abstract built by extracting + text around the search terms. If you always + prefer the synthetic abstract, you can reduce this value + and save a little space. - - + + aspellLanguage Language definitions to use when creating @@ -2969,24 +3265,33 @@ skippedPaths = ~/somedir/∗.txt - nocjk - If this set to true, specific east asian - (Chinese Korean Japanese) characters/word splitting is - turned off. This will save a small amount of cpu if you - have no CJK documents. If your document base does include - such text but you are not interested in searching it, - setting nocjk may be a significant time - and space saver. - - - cjkngramlen - This lets you adjust the size of n-grams - used for indexing CJK text. The default value of 2 is - probably appropriate in most cases. A value of 3 would - allow more precision and efficiency on longer words, but - the index will be approximately twice as large. - - + nocjk + If this set to true, specific east asian + (Chinese Korean Japanese) characters/word splitting is + turned off. This will save a small amount of cpu if you + have no CJK documents. If your document base does include + such text but you are not interested in searching it, + setting nocjk may be a significant time + and space saver. + + + + cjkngramlen + This lets you adjust the size of n-grams + used for indexing CJK text. The default value of 2 is + probably appropriate in most cases. A value of 3 would + allow more precision and efficiency on longer words, but + the index will be approximately twice as large. + + + + guesscharset + Decide if we try to guess the character + set of files if no internal value is available (ie: for + plain text files). This does not work well in general, and + should probably not be used. + + @@ -2998,7 +3303,7 @@ skippedPaths = ~/somedir/∗.txt mimemap specifies the file name extension to mime type mappings. - For file names without an extension, or with an unknown + For file names without an extension, or with an unknown one, the system's file -i command will be executed to determine the mime type (this can be switched off inside the main configuration file). @@ -3033,7 +3338,7 @@ skippedPaths = ~/somedir/∗.txt mimeconf specifies how the different mime types are handled for indexing, and which icons - are displayed in the recoll result lists. + are displayed in the recoll result lists. Changing the parameters in the [index] section is probably not a good idea except if you are a &RCL; @@ -3062,16 +3367,22 @@ skippedPaths = ~/somedir/∗.txt Changes to this file can be done by direct editing, or through the recoll user preferences dialog. - As for the other configuration files, the normal usage - is to have a mimeview inside your own - configuration directory, with just the non-default entries, - which will override those from the central configuration - file. - Please note that these entries must be placed under a - [view] section. + As for the other configuration files, the normal usage + is to have a mimeview inside your own + configuration directory, with just the non-default entries, + which will override those from the central configuration + file. + Please note that these entries must be placed under a + [view] section. - If Use desktop preferences to choose - document editor is checked in the user preferences, + The keys in the file are normally mime types. You can add an + application tag to specialize the choice for an area of the + filesystem (using a localfields specification + in mimeconf). The syntax for the key is +mimetype|tag + + If Use desktop preferences to choose + document editor is checked in the user preferences, all mimeview entries will be ignored except the one labelled application/x-all (which is set to use xdg-open by default). @@ -3080,98 +3391,98 @@ skippedPaths = ~/somedir/∗.txt Examples of configuration adjustments - - Adding an external viewer for an non-indexed type + + Adding an external viewer for an non-indexed type - Imagine that you have some kind of file which does not - have indexable content, but for which you would like to have a - functional Edit link in the result list - (when found by file name). The file names end in - .blob and can be displayed by - application blobviewer. + Imagine that you have some kind of file which does not + have indexable content, but for which you would like to have a + functional Edit link in the result list + (when found by file name). The file names end in + .blob and can be displayed by + application blobviewer. - You need two entries in the configuration files for this - to work: - - In $RECOLL_CONFDIR/mimemap - (typically ~/.recoll/mimemap), add the - following line: - + You need two entries in the configuration files for this + to work: + + In $RECOLL_CONFDIR/mimemap + (typically ~/.recoll/mimemap), add the + following line: + application/x-blobapp = .blob - Note that the mime type is made up here, and you could - call it diesel/oil just the - same. - - In - $RECOLL_CONFDIR/mimeview under the - [view] section: - + Note that the mime type is made up here, and you could + call it diesel/oil just the + same. + + In + $RECOLL_CONFDIR/mimeview under the + [view] section: + application/x-blobapp = blobviewer %f - We are supposing that - blobviewer wants a file name - parameter here, you would use %u if - it liked URLs better. - - + We are supposing that + blobviewer wants a file name + parameter here, you would use %u if + it liked URLs better. + + - If you just wanted to change the application used by - &RCL; to display a mime type which it already knows, you - would just need to edit mimeview. The - entries you add in your personal file override those in the - central configuration, which you do not need to alter + If you just wanted to change the application used by + &RCL; to display a mime type which it already knows, you + would just need to edit mimeview. The + entries you add in your personal file override those in the + central configuration, which you do not need to alter - + - - Adding indexing support for a new file type + + Adding indexing support for a new file type - Let us now imagine that the above - .blob files actually contain - indexable text and that you know how to extract it with a - command line program. Getting &RCL; to index the files is - easy. You need to perform the above alteration, and also to - add data to the mimeconf file - (typically in ~/.recoll/mimeconf): + Let us now imagine that the above + .blob files actually contain + indexable text and that you know how to extract it with a + command line program. Getting &RCL; to index the files is + easy. You need to perform the above alteration, and also to + add data to the mimeconf file + (typically in ~/.recoll/mimeconf): - - Under the [index] - section, add the following line (more about the - rclblob indexing script later): - + + Under the [index] + section, add the following line (more about the + rclblob indexing script later): + application/x-blobapp = exec rclblob - - + + - Under the [icons] - section, you should choose an icon to be displayed for the - files inside the result lists. Icons are normally 64x64 - pixels PNG files which live in - /usr/[local/]share/recoll/images. + Under the [icons] + section, you should choose an icon to be displayed for the + files inside the result lists. Icons are normally 64x64 + pixels PNG files which live in + /usr/[local/]share/recoll/images. - + - Under the [categories] - section, you should add the mime type where it makes sense - (you can also create a category). Categories may be used - for filtering in advanced search. - + Under the [categories] + section, you should add the mime type where it makes sense + (you can also create a category). Categories may be used + for filtering in advanced search. + - + - The rclblob filter should - be an executable program or script which exists inside - /usr/[local/]share/recoll/filters. It - will be given a file name as argument and should output the - text contents on the standard output. + The rclblob filter should + be an executable program or script which exists inside + /usr/[local/]share/recoll/filters. It + will be given a file name as argument and should output the + text contents on the standard output. - The filter - programming section describes in more detail how to - write a filter. - + The filter + programming section describes in more detail how to + write a filter. + @@ -3181,9 +3492,9 @@ skippedPaths = ~/somedir/∗.txt The KDE Kicker Recoll applet The &RCL; source tree contains the source code to the - recoll_applet, a small application derived - from the find_applet. This can be used to - add a small &RCL; launcher to the KDE panel. + recoll_applet, a small application derived + from the find_applet. This can be used to + add a small &RCL; launcher to the KDE panel. The applet is not automatically built with the main &RCL; programs, nor is it included with the main source distribution