diff --git a/src/doc/user/Makefile b/src/doc/user/Makefile index 0efec86d..0956cf2a 100644 --- a/src/doc/user/Makefile +++ b/src/doc/user/Makefile @@ -49,7 +49,7 @@ usermanual.pdf: usermanual.xml recoll.conf.xml dblatex --xslt-opts="--xinclude" -tpdf $< UTILBUILDS=/home/dockes/tmp/builds/medocutils/ -recoll-conf-xml: +recoll.conf.xml: ../../sampleconf/recoll.conf $(UTILBUILDS)/confxml --docbook \ --idprefix=RCL.INSTALL.CONFIG.RECOLLCONF \ ../../sampleconf/recoll.conf > recoll.conf.xml @@ -65,7 +65,7 @@ recoll-conf-xml: # script. # Also could not get readthedocs to generate the left pane TOC? could # probably be fixed... -#usermanual-rst: recoll-conf-xml +#usermanual-rst: recoll.conf.xml # tail -n +2 recoll.conf.xml > rcl-conf-tail.xml # sed -e '/xi:include/r rcl-conf-tail.xml' \ # < usermanual.xml > full-man.xml diff --git a/src/doc/user/recoll.conf.xml b/src/doc/user/recoll.conf.xml index aa7d4a79..de6cca4c 100644 --- a/src/doc/user/recoll.conf.xml +++ b/src/doc/user/recoll.conf.xml @@ -8,26 +8,34 @@ Space-separated list of files or directories to recursively index. Default to ~ (indexes $HOME). You can use symbolic links in the list, they will be followed, -independently of the value of the followLinks variable. +independently of the value of the followLinks variable. + monitordirs Space-separated list of files or directories to monitor for updates. When running the real-time indexer, this allows monitoring only a subset of the whole indexed area. The elements must be included in the -tree defined by the 'topdirs' members. +tree defined by the 'topdirs' members. + skippedNames Files and directories which should be ignored. White space separated list of wildcard patterns (simple ones, not paths, must contain no -'/' characters), which will be tested against file and directory names. Have a look at the default -configuration for the initial value, some entries may not suit your situation. The easiest way to -see it is through the GUI Index configuration "local parameters" panel. The list in the default -configuration does not exclude hidden directories (names beginning with a dot), which means that -it may index quite a few things that you do not want. On the other hand, email user agents like -Thunderbird usually store messages in hidden directories, and you probably want this indexed. One -possible solution is to have ".*" in "skippedNames", and add things like "~/.thunderbird" -"~/.evolution" to "topdirs". Not even the file names are indexed for patterns in this list, see -the "noContentSuffixes" variable for an alternative approach which indexes the file names. Can be -redefined for any subtree. +'/' characters), which will be tested against file and directory names. + +Have a look at the default configuration for the initial value, some entries may not suit your +situation. The easiest way to see it is through the GUI Index configuration "local parameters" +panel. + +The list in the default configuration does not exclude hidden directories (names beginning with a +dot), which means that it may index quite a few things that you do not want. On the other hand, +email user agents like Thunderbird usually store messages in hidden directories, and you probably +want this indexed. One possible solution is to have ".*" in "skippedNames", and add things like +"~/.thunderbird" "~/.evolution" to "topdirs". + +Not even the file names are indexed for patterns in this list, see the "noContentSuffixes" +variable for an alternative approach which indexes the file names. Can be redefined for any +subtree. + skippedNames- List of name endings to remove from the default skippedNames @@ -40,7 +48,8 @@ list. onlyNames Regular file name filter patterns If this is set, only the file names not in skippedNames and matching one of the patterns will be considered for indexing. Can be -redefined per subtree. Does not apply to directories. +redefined per subtree. Does not apply to directories. + noContentSuffixes List of name endings (not necessarily dot-separated suffixes) for @@ -51,7 +60,8 @@ which will go away in a future release (the move from mimemap to recoll.conf allows editing the list through the GUI). This is different from skippedNames because these are name ending matches only (not wildcard patterns), and the file name itself gets indexed normally. This -can be redefined for subdirectories. +can be redefined for subdirectories. + noContentSuffixes- List of name endings to remove from the default noContentSuffixes @@ -62,19 +72,26 @@ list. list. skippedPaths -Absolute paths we should not go into. Space-separated list of wildcard expressions for absolute -filesystem paths. Must be defined at the top level of the configuration -file, not in a subsection. Can contain files and directories. The database and -configuration directories will automatically be added. The expressions -are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by -default. This means that '/' characters must be matched explicitly. You -can set 'skippedPathsFnmPathname' to 0 to disable the use of FNM_PATHNAME -(meaning that '/*/dir3' will match '/dir1/dir2/dir3'). The default value -contains the usual mount point for removable media to remind you that it -is a bad idea to have Recoll work on these (esp. with the monitor: media -gets indexed on mount, all data gets erased on unmount). Explicitly -adding '/media/xxx' to the 'topdirs' variable will override -this. +Absolute paths we should not go into. Space-separated list of wildcard expressions for absolute filesystem paths (for files or +directories). The variable must be defined at the top level of the configuration file, not in a +subsection. + +Any value in the list must be textually consistent with the values in topdirs, no attempts are +made to resolve symbolic links. In practise, if, as is frequently the case, /home is a link to +/usr/home, your default topdirs will have a single entry '~' which will be translated to +'/home/yourlogin'. In this case, any skippedPaths entry should start with '/home/yourlogin' *not* +with '/usr/home/yourlogin'. + +The index and configuration directories will automatically be added to the list. + +The expressions are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by default. This +means that '/' characters must be matched explicitly. You can set 'skippedPathsFnmPathname' to 0 +to disable the use of FNM_PATHNAME (meaning that '/*/dir3' will match '/dir1/dir2/dir3'). + +The default value contains the usual mount point for removable media to remind you that it is in +most cases a bad idea to have Recoll work on these Explicitly adding '/media/xxx' to the 'topdirs' +variable will override this. + skippedPathsFnmPathname Set to 0 to @@ -83,13 +100,15 @@ paths. nowalkfn File name which will cause its parent directory to be skipped. Any directory containing a file with this name will be skipped as -if it was part of the skippedPaths list. Ex: .recoll-noindex +if it was part of the skippedPaths list. Ex: .recoll-noindex + daemSkippedPaths skippedPaths equivalent specific to real time indexing. This enables having parts of the tree which are initially indexed but not monitored. If daemSkippedPaths is -not set, the daemon uses skippedPaths. +not set, the daemon uses skippedPaths. + zipUseSkippedNames Use skippedNames inside Zip archives. Fetched @@ -115,7 +134,8 @@ multiple indexing of linked files. No effort is made to avoid duplication when this option is set to true. This option can be set individually for each of the 'topdirs' members by using sections. It can not be changed below the 'topdirs' level. Links in the 'topdirs' list itself are always -followed. +followed. + indexedmimetypes Restrictive list of @@ -124,14 +144,16 @@ supported types are indexed). If it is set, only the types from the list will have their contents indexed. The names will be indexed anyway if indexallfilenames is set (default). MIME type names should be taken from the mimemap file (the values may be different from xdg-mime or file -i -output in some cases). Can be redefined for subtrees. +output in some cases). Can be redefined for subtrees. + excludedmimetypes List of excluded MIME types. Lets you exclude some types from indexing. MIME type names should be taken from the mimemap file (the values may be different from xdg-mime or file -i output in some cases) Can be redefined for -subtrees. +subtrees. + nomd5types Don't compute md5 for these types. md5 checksums are used only for deduplicating results, and can be @@ -140,32 +162,37 @@ lets you turn off md5 computation for selected types. It is global (no redefinition for subtrees). At the moment, it only has an effect for external handlers (exec and execm). The file types can be specified by listing either MIME types (e.g. audio/mpeg) or handler names -(e.g. rclaudio). +(e.g. rclaudio). + compressedfilemaxkbs Size limit for compressed files. We need to decompress these in a temporary directory for identification, which can be wasteful in some cases. Limit the waste. Negative means no limit. 0 results in no -processing of any compressed file. Default 50 MB. +processing of any compressed file. Default 50 MB. + textfilemaxmbs Size limit for text files. Mostly for skipping monster -logs. Default 20 MB. +logs. Default 20 MB. + indexallfilenames Index the file names of unprocessed files Index the names of files the contents of which we don't index because of an excluded or unsupported MIME -type. +type. + usesystemfilecommand Use a system command for file MIME type guessing as a final step in file type identification This is generally useful, but will usually cause the indexing of many bogus 'text' files. See 'systemfilecommand' -for the command used. +for the command used. + systemfilecommand Command used to guess @@ -173,12 +200,14 @@ MIME types if the internal methods fails This should be a "file -i" workalike. The file path will be added as a last parameter to the command line. "xdg-mime" works better than the traditional "file" command, and is now the configured default (with a hard-coded fallback to -"file") +"file") + processwebqueue Decide if we process the Web queue. The queue is a directory where the Recoll Web -browser plugins create the copies of visited pages. +browser plugins create the copies of visited pages. + textfilepagekbs Page size for text @@ -187,12 +216,14 @@ into documents of approximately this size. Will reduce memory usage at index time and help with loading data in the preview window at query time. Particularly useful with very big files, such as application or system logs. Also see textfilemaxmbs and -compressedfilemaxkbs. +compressedfilemaxkbs. + membermaxkbs Size limit for archive members. This is passed to the filters in the environment -as RECOLL_FILTER_MAXMEMBERKB. +as RECOLL_FILTER_MAXMEMBERKB. + Parameters affecting how we generate terms and organize the index @@ -204,28 +235,34 @@ searches sensitive to case and diacritics can be performed, but the index will be bigger, and some marginal weirdness may sometimes occur. The default is a stripped index. When using multiple indexes for a search, this parameter must be defined identically for all. Changing the value -implies an index reset. +implies an index reset. + indexStoreDocText Decide if we store the documents' text content in the index. Storing the text allows extracting snippets from it at query time, instead of building them from index position data. + Newer Xapian index formats have rendered our use of positions list unacceptably slow in some cases. The last Xapian index format with good performance for the old method is Chert, which is default for 1.2, still supported but not default in 1.4 and will be dropped in 1.6. + The stored document text is translated from its original format to UTF-8 plain text, but not stripped of upper-case, diacritics, or punctuation signs. Storing it increases the index size by 10-20% typically, but also allows for nicer snippets, so it may be worth enabling it even if not strictly needed for performance if you can afford the space. + The variable only has an effect when creating an index, meaning that the xapiandb directory must not exist yet. Its exact effect depends on the Xapian version. + For Xapian 1.4, if the variable is set to 0, the Chert format will be used, and the text will not be stored. If the variable is 1, Glass will be used, and the text stored. + For Xapian 1.2, and for versions after 1.5 and newer, the index format is always the default, but the variable controls if the text is stored or not, and the abstract generation method. With Xapian 1.5 and later, and @@ -242,26 +279,31 @@ still be). Numbers are often quite interesting to search for, and this should probably not be set except for special situations, ie, scientific documents with huge amounts of numbers in them, where setting nonumbers will reduce the index size. This can only be set for a whole index, not -for a subtree. +for a subtree. + dehyphenate Determines if we index 'coworker' also when the input is 'co-worker'. This is new in version 1.22, and on by default. Setting the variable to off allows -restoring the previous behaviour. +restoring the previous behaviour. + backslashasletter Process backslash as normal letter. This may make sense for people wanting to index TeX commands as -such but is not of much general use. +such but is not of much general use. + underscoreasletter Process underscore as normal letter. This makes sense in so many cases that one wonders if it should -not be the default. +not be the default. + maxtermlength Maximum term length. Words longer than this will be discarded. The default is 40 and used to be hard-coded, but it can now be -adjusted. You need an index reset if you change the value. +adjusted. You need an index reset if you change the value. + nocjk Decides if specific East Asian @@ -269,20 +311,23 @@ adjusted. You need an index reset if you change the value. +significant time and space saver. + cjkngramlen This lets you adjust the size of n-grams used for indexing CJK text. The default value of 2 is probably appropriate in most cases. A value of 3 would allow more precision and efficiency on longer words, but the index will be approximately twice -as large. +as large. + indexstemminglanguages Languages for which to create stemming expansion data. Stemmer names can be found by executing 'recollindex -l', or this can also be set from a list in the GUI. The values are full -language names, e.g. english, french... +language names, e.g. english, french... + defaultcharset Default character @@ -293,37 +338,39 @@ set, the default character set is the one defined by the NLS environment ($LC_ALL, $LC_CTYPE, $LANG), or ultimately iso-8859-1 (cp-1252 in fact). If for some reason you want a general default which does not match your LANG and is not 8859-1, use this variable. This can be redefined for any -sub-directory. +sub-directory. + unac_except_trans -A list of characters, -encoded in UTF-8, which should be handled specially -when converting text to unaccented lowercase. For -example, in Swedish, the letter a with diaeresis has full alphabet -citizenship and should not be turned into an a. -Each element in the space-separated list has the special character as -first element and the translation following. The handling of both the -lowercase and upper-case versions of a character should be specified, as -appartenance to the list will turn-off both standard accent and case -processing. The value is global and affects both indexing and querying. +A list of characters, encoded in UTF-8, which should be handled specially when converting +text to unaccented lowercase. For example, in Swedish, the letter a with diaeresis has full alphabet citizenship and +should not be turned into an a. Each element in the space-separated list has the special +character as first element and the translation following. The handling of both the lowercase and +upper-case versions of a character should be specified, as appartenance to the list will turn-off +both standard accent and case processing. The value is global and affects both indexing and +querying. We also convert a few confusing Unicode characters (quotes, hyphen) to their ASCII +equivalent to avoid "invisible" search failures. + Examples: Swedish: -unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå +unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå ’' ❜' ʼ' ‐- . German: -unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl +unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' ʼ' ‐- . French: you probably want to decompose oe and ae and nobody would type a German ß -unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl +unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' ʼ' ‐- . The default for all until someone protests follows. These decompositions are not performed by unac, but it is unlikely that someone would type the composed forms in a search. -unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl +unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' ʼ' ‐- + maildefcharset Overrides the default character set for email messages which don't specify one. This is mainly useful for readpst (libpst) dumps, -which are utf-8 but do not say so. +which are utf-8 but do not say so. + localfields Set fields on all files @@ -331,7 +378,8 @@ which are utf-8 but do not say so. name = value ; attr1 = val1 ; [...] value is empty so this needs an initial semi-colon. This is useful, e.g., for setting the rclaptg field for application selection inside -mimeview. +mimeview. + testmodifusemtime Use mtime instead of @@ -353,12 +401,12 @@ undetected). Perform a full index reset after changing this. noxattrfields Disable extended attributes conversion to metadata fields. This probably needs to be -set if testmodifusemtime is set. +set if testmodifusemtime is set. + metadatacmds Define commands to -gather external metadata, e.g. tmsu tags. -There can be several entries, separated by semi-colons, each defining +gather external metadata, e.g. tmsu tags. There can be several entries, separated by semi-colons, each defining which field name the data goes into and the command to use. Don't forget the initial semi-colon. All the field names must be different. You can use aliases in the "field" file if necessary. @@ -383,13 +431,15 @@ cachedir is ~/.cache/recoll, the default dbdir would be mboxcachedir, aspellDicDir, which can still be individually specified to override cachedir. Note that if you have multiple configurations, each must have a different cachedir, there is no automatic computation of a -subpath under cachedir. +subpath under cachedir. + maxfsoccuppc Maximum file system occupation over which we stop indexing. The value is a percentage, corresponding to what the "Capacity" df output column shows. The default -value is 0, meaning no checking. +value is 0, meaning no checking. + dbdir Xapian database directory @@ -397,36 +447,43 @@ location. This will be created on first indexing. If the value is not an absolute path, it will be interpreted as relative to cachedir if set, or the configuration directory (-c argument or $RECOLL_CONFDIR). If nothing is specified, the default is then -~/.recoll/xapiandb/ +~/.recoll/xapiandb/ + idxstatusfile Name of the scratch file where the indexer process updates its status. Default: idxstatus.txt inside the configuration -directory. +directory. + mboxcachedir Directory location for storing mbox message offsets cache files. This is normally 'mboxcache' under cachedir if set, or else under the configuration directory, but it may be useful to share -a directory between different configurations. +a directory between different configurations. + mboxcacheminmbs Minimum mbox file size over which we cache the offsets. There is really no sense in caching offsets for small files. The -default is 5 MB. +default is 5 MB. + mboxmaxmsgmbs Maximum mbox member message size in megabytes. Size over which we assume that the mbox format is bad or we -misinterpreted it, at which point we just stop processing the file. +misinterpreted it, at which point we just stop processing the file. + webcachedir Directory where we store the archived web pages. This is only used by the web history indexing code Default: cachedir/webcache if cachedir is set, else -$RECOLL_CONFDIR/webcache +$RECOLL_CONFDIR/webcache + webcachemaxmbs Maximum size in MB of the Web archive. This is only used by the web history indexing code. Default: 40 MB. -Reducing the size will not physically truncate the file. +Reducing the size will not physically truncate the file. + webqueuedir The path to the Web indexing queue. This used to be @@ -434,36 +491,42 @@ hard-coded in the old plugin as ~/.recollweb/ToIndex so there would be no need or possibility to change it, but the WebExtensions plugin now downloads the files to the user Downloads directory, and a script moves them to webqueuedir. The script reads this value from the config so it has become -possible to change it. +possible to change it. + webdownloadsdir The path to browser downloads directory. This is where the new browser add-on extension has to create the files. They are -then moved by a script to webqueuedir. +then moved by a script to webqueuedir. + webcachekeepinterval Page recycle interval By default, only one instance of an URL is kept in the cache. This can be changed by setting this to a value determining at what frequency we keep multiple instances ('day', 'week', 'month', 'year'). Note that increasing the interval will not erase existing -entries. +entries. + aspellDicDir Aspell dictionary storage directory location. The aspell dictionary (aspdict.(lang).rws) is normally stored in the directory specified by cachedir if set, or under the configuration -directory. +directory. + filtersdir Directory location for executable input handlers. If RECOLL_FILTERSDIR is set in the environment, we use it instead. Defaults to $prefix/share/recoll/filters. Can be redefined for -subdirectories. +subdirectories. + iconsdir Directory location for icons. The only reason to change this would be if you want to change the icons displayed in the -result list. Defaults to $prefix/share/recoll/images +result list. Defaults to $prefix/share/recoll/images + Parameters affecting indexing performance and resource usage @@ -481,13 +544,15 @@ value (from this file) is now 50 MB, and should be ok in many cases. You can set it as low as 10 to conserve memory, but if you are looking for maximum speed, you may want to experiment with values between 20 and 200. In my experience, values beyond this are always counterproductive. If -you find otherwise, please drop me a note. +you find otherwise, please drop me a note. + filtermaxseconds Maximum external filter execution time in seconds. Default 1200 (20mn). Set to 0 for no limit. This is mainly to avoid infinite loops in postscript files -(loop.ps) +(loop.ps) + filtermaxmbytes Maximum virtual memory space for filter processes @@ -495,7 +560,8 @@ is mainly to avoid infinite loops in postscript files Linux way to limit the data space only), so we need to be a bit generous here. Anything over 2000 will be ignored on 32 bits machines. The previous default value of 2000 would prevent java pdftk to work when -executed from Python rclpdf.py. +executed from Python rclpdf.py. + thrQSizes Stage input queues configuration. There are three @@ -507,7 +573,8 @@ next stage. In practise, deep queues have not been shown to increase performance. Default: a value of 0 for the first queue tells Recoll to perform autoconfiguration based on the detected number of CPUs (no need for the two other values in this case). Use thrQSizes = -1 -1 -1 to -disable multithreading entirely. +disable multithreading entirely. + thrTCounts Number of threads used for each indexing stage. The @@ -517,7 +584,8 @@ in thrQSizes: if the first queue depth is 0, all counts are ignored (autoconfigured); if a value of -1 is used for a queue depth, the corresponding thread count is ignored. It makes no sense to use a value other than 1 for the last stage because updating the Xapian index is -necessarily single-threaded (and protected by a mutex). +necessarily single-threaded (and protected by a mutex). + Miscellaneous parameters @@ -525,7 +593,8 @@ necessarily single-threaded (and protected by a mutex).loglevel Log file verbosity 1-6. A value of 2 will print only errors and warnings. 3 will print information like document updates, -4 is quite verbose and 6 very verbose. +4 is quite verbose and 6 very verbose. + logfilename Log file destination. Use 'stderr' (default) to write to the @@ -541,17 +610,20 @@ console. Destination file for external helpers standard error output. The external program error output is left alone by default, e.g. going to the terminal when the recoll[index] program is executed from the command line. Use /dev/null or a file inside a non-existent -directory to completely suppress the output. +directory to completely suppress the output. + daemloglevel Override loglevel for the indexer in real time mode. The default is to use the idx... values if set, else -the log... values. +the log... values. + daemlogfilename Override logfilename for the indexer in real time mode. The default is to use the idx... values if set, else -the log... values. +the log... values. + pyloglevel Override loglevel for the python module. @@ -564,7 +636,8 @@ the log... values. configuration directory inside the directory tree makes it possible to provide automatic query time path translations once the data set has moved (for example, because it has been mounted on another -location). +location). + curidxconfdir Current location of the configuration directory. Complement orgidxconfdir for movable datasets. This should be used @@ -576,7 +649,8 @@ example if a dataset originally indexed as '/home/me/mydata/config' has been mounted to '/media/me/mydata', and the GUI is running from a copied configuration, orgidxconfdir would be '/home/me/mydata/config', and curidxconfdir (as set in the copied configuration) would be -'/media/me/mydata/config'. +'/media/me/mydata/config'. + idxrundir Indexing process current directory. The input @@ -585,19 +659,22 @@ makes sense to have recollindex chdir to some temporary directory. If the value is empty, the current directory is not changed. If the value is (literal) tmp, we use the temporary directory as set by the environment (RECOLL_TMPDIR else TMPDIR else /tmp). If the value is an -absolute path to a directory, we go there. +absolute path to a directory, we go there. + checkneedretryindexscript Script used to heuristically check if we need to retry indexing files which previously failed. The default script checks the modified dates on /usr/bin and /usr/local/bin. A relative path will be looked up in the filters dirs, then in the path. Use an absolute path -to do otherwise. +to do otherwise. + recollhelperpath Additional places to search for helper executables. This is used, e.g., on Windows by the Python code, and on Mac OS by the bundled recoll.app (because I could find no reliable way to tell launchd to set the PATH). The example below is for -Windows. Use ':' as entry separator for Mac and Ux-like systems, ';' is for Windows only. +Windows. Use ':' as entry separator for Mac and Ux-like systems, ';' is for Windows only. + idxabsmlen Length of abstracts we store while indexing. Recoll stores an abstract for each indexed file. @@ -609,62 +686,72 @@ defines the size of the stored abstract. The default value is 250 bytes. The search interface gives you the choice to display this stored text or a synthetic abstract built by extracting text around the search terms. If you always prefer the synthetic abstract, you can reduce this -value and save a little space. +value and save a little space. + idxmetastoredlen Truncation length of stored metadata fields. This does not affect indexing (the whole field is processed anyway), just the amount of data stored in the index for the purpose of displaying fields inside result lists or previews. The default value is 150 bytes which -may be too low if you have custom fields. +may be too low if you have custom fields. + idxtexttruncatelen Truncation length for all document texts. Only index the beginning of documents. This is not recommended except if you are sure that the interesting keywords are at the top and have severe disk -space issues. +space issues. + idxsynonyms Name of the index-time synonyms file. This is used for indexing multiword synonyms as single terms, which in turn is only useful if you want to perform proximity searches -with such terms. +with such terms. + aspellLanguage Language definitions to use when creating the aspell dictionary. The value must match a set of aspell language definition files. You can type "aspell dicts" to see a list The default if this is not set is to use the NLS environment to guess the value. The -values are the 2-letter language codes (e.g. 'en', 'fr'...) +values are the 2-letter language codes (e.g. 'en', 'fr'...) + aspellAddCreateParam Additional option and parameter to aspell dictionary creation command. Some aspell packages may need an additional option (e.g. on Debian Jessie: --local-data-dir=/usr/lib/aspell). See Debian bug -772415. +772415. + aspellKeepStderr Set this to have a look at aspell dictionary creation errors. There are always many, so this is mostly for -debugging. +debugging. + noaspell Disable aspell use. The aspell dictionary generation takes time, and some combinations of aspell version, language, and local terms, result in aspell crashing, so it sometimes makes sense to just -disable the thing. +disable the thing. + monauxinterval Auxiliary database update interval. The real time indexer only updates the auxiliary databases (stemdb, aspell) periodically, because it would be too costly to do it for every document -change. The default period is one hour. +change. The default period is one hour. + monixinterval Minimum interval (seconds) between processings of the indexing queue. The real time indexer does not process each event when it comes in, but lets the queue accumulate, to diminish overhead and to aggregate multiple events affecting the same file. Default 30 -S. +S. + mondelaypatterns Timing parameters for the real time indexing. Definitions for files which get a longer delay before reindexing @@ -673,21 +760,25 @@ reindexed once in a while. A list of wildcardPattern:seconds pairs. The patterns are matched with fnmatch(pattern, path, 0) You can quote entries containing white space with double quotes (quote the whole entry, not the pattern). The default is empty. -Example: mondelaypatterns = *.log:20 "*with spaces.*:30" +Example: mondelaypatterns = *.log:20 "*with spaces.*:30" + idxniceprio "nice" process priority for the indexing processes. Default: 19 -(lowest) Appeared with 1.26.5. Prior versions were fixed at 19. +(lowest) Appeared with 1.26.5. Prior versions were fixed at 19. + monioniceclass ionice class for the indexing process. Despite the misleading name, and on platforms where this is supported, this affects all indexing processes, not only the real time/monitoring ones. The default value is 3 (use -lowest "Idle" priority). +lowest "Idle" priority). + monioniceclassdata ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no -levels. +levels. + Query-time parameters (no impact on the index) @@ -696,7 +787,8 @@ levels. auto-trigger diacritics sensitivity (raw index only). IF the index is not stripped, decide if we automatically trigger diacritics sensitivity if the search term has accented characters (not in unac_except_trans). Else you need to use the query language and the "D" -modifier to specify diacritics sensitivity. Default is no. +modifier to specify diacritics sensitivity. Default is no. + autocasesens auto-trigger case sensitivity (raw index only). IF @@ -704,40 +796,46 @@ the index is not stripped (see indexStripChars), decide if we automatically trigger character case sensitivity if the search term has upper-case characters in any but the first position. Else you need to use the query language and the "C" modifier to specify character-case -sensitivity. Default is yes. +sensitivity. Default is yes. + maxTermExpand Maximum query expansion count for a single term (e.g.: when using wildcards). This only affects queries, not indexing. We used to not limit this at all (except for filenames where the limit was too low at 1000), but it is -unreasonable with a big index. Default 10000. +unreasonable with a big index. Default 10000. + maxXapianClauses Maximum number of clauses we add to a single Xapian query. This only affects queries, not indexing. In some cases, the result of term expansion can be multiplicative, and we want to avoid eating all the memory. Default -50000. +50000. + snippetMaxPosWalk Maximum number of positions we walk while populating a snippet for the result list. The default of 1,000,000 may be insufficient for very big documents, the consequence would be snippets -with possibly meaning-altering missing words. +with possibly meaning-altering missing words. + Parameters for the PDF input script pdfocr Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because -OCR is so very slow. +OCR is so very slow. + pdfattach Enable PDF attachment extraction by executing pdftk (if available). This is normally disabled, because it does slow down PDF indexing a bit even if -not one attachment is ever found. +not one attachment is ever found. + pdfextrameta Extract text from selected XMP metadata tags. This @@ -745,7 +843,8 @@ is a space-separated list of qualified XMP tag names. Each element can also include a translation to a Recoll field name, separated by a '|' character. If the second element is absent, the tag name is used as the Recoll field names. You will also need to add specifications to the -"fields" file to direct processing of the extracted data. +"fields" file to direct processing of the extracted data. + pdfextrametafix Define name of XMP field editing script. This @@ -754,7 +853,8 @@ values. The script should define a 'MetaFixer' class with a metafix() method which will be called with the qualified tag name and value of each selected field, for editing or erasing. A new instance is created for each document, so that the object can keep state for, e.g. eliminating -duplicate values. +duplicate values. + Parameters for OCR processing @@ -766,17 +866,20 @@ the input file. Modules for tesseract (tesseract) and ABBYY FineReader (abbyy) are present in the standard distribution. For compatibility with the previous version, if this is not defined at all, the default value is "tesseract". Use an explicit empty value if needed. A value of "abbyy -tesseract" will try everything. +tesseract" will try everything. + ocrcachedir Location for caching OCR data. The default if this is empty or undefined is to store the cached -OCR data under $RECOLL_CONFDIR/ocrcache. +OCR data under $RECOLL_CONFDIR/ocrcache. + tesseractlang Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set through the contents of a file in the currently processed directory. See the rclocrtesseract.py -script. Example values: eng, fra... See the tesseract documentation. +script. Example values: eng, fra... See the tesseract documentation. + tesseractcmd Path for the tesseract command. Do not quote. This is mostly useful on Windows, or for specifying a non-default @@ -800,6 +903,7 @@ script. Typical values: English, French... See the ABBYY documentation. mhmboxquirks Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the directory where the email mbox files are -stored. +stored. + diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index 582d2b2a..7cee7dcb 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -8929,24 +8929,26 @@ hasextract = False White space separated list of wildcard patterns (simple ones, not paths, must contain no '/' characters), which will be tested against file - and directory names. Have a look at the default - configuration for the initial value, some entries - may not suit your situation. The easiest way to - see it is through the GUI Index configuration - "local parameters" panel. The list in the default - configuration does not exclude hidden directories - (names beginning with a dot), which means that it - may index quite a few things that you do not - want. On the other hand, email user agents like - Thunderbird usually store messages in hidden - directories, and you probably want this indexed. - One possible solution is to have ".*" in - "skippedNames", and add things like - "~/.thunderbird" "~/.evolution" to "topdirs". Not - even the file names are indexed for patterns in - this list, see the "noContentSuffixes" variable - for an alternative approach which indexes the - file names. Can be redefined for any subtree.

+ and directory names.

+

Have a look at the default configuration for + the initial value, some entries may not suit your + situation. The easiest way to see it is through + the GUI Index configuration "local parameters" + panel.

+

The list in the default configuration does not + exclude hidden directories (names beginning with + a dot), which means that it may index quite a few + things that you do not want. On the other hand, + email user agents like Thunderbird usually store + messages in hidden directories, and you probably + want this indexed. One possible solution is to + have ".*" in "skippedNames", and add things like + "~/.thunderbird" "~/.evolution" to "topdirs".

+

Not even the file names are indexed for + patterns in this list, see the + "noContentSuffixes" variable for an alternative + approach which indexes the file names. Can be + redefined for any subtree.

Absolute paths we should not go into. Space-separated list of wildcard expressions for - absolute filesystem paths. Must be defined at the + absolute filesystem paths (for files or + directories). The variable must be defined at the top level of the configuration file, not in a - subsection. Can contain files and directories. - The database and configuration directories will - automatically be added. The expressions are - matched using 'fnmatch(3)' with the FNM_PATHNAME - flag set by default. This means that '/' - characters must be matched explicitly. You can - set 'skippedPathsFnmPathname' to 0 to disable the - use of FNM_PATHNAME (meaning that '/*/dir3' will - match '/dir1/dir2/dir3'). The default value - contains the usual mount point for removable - media to remind you that it is a bad idea to have - Recoll work on these (esp. with the monitor: - media gets indexed on mount, all data gets erased - on unmount). Explicitly adding '/media/xxx' to - the 'topdirs' variable will override this.

+ subsection.

+

Any value in the list must be textually + consistent with the values in topdirs, no + attempts are made to resolve symbolic links. In + practise, if, as is frequently the case, /home is + a link to /usr/home, your default topdirs will + have a single entry '~' which will be translated + to '/home/yourlogin'. In this case, any + skippedPaths entry should start with + '/home/yourlogin' *not* with + '/usr/home/yourlogin'.

+

The index and configuration directories will + automatically be added to the list.

+

The expressions are matched using 'fnmatch(3)' + with the FNM_PATHNAME flag set by default. This + means that '/' characters must be matched + explicitly. You can set 'skippedPathsFnmPathname' + to 0 to disable the use of FNM_PATHNAME (meaning + that '/*/dir3' will match '/dir1/dir2/dir3').

+

The default value contains the usual mount + point for removable media to remind you that it + is in most cases a bad idea to have Recoll work + on these Explicitly adding '/media/xxx' to the + 'topdirs' variable will override this.

Decide if we store the documents' text content in the index. Storing the text allows extracting snippets from it at query time, instead of - building them from index position data. Newer - Xapian index formats have rendered our use of - positions list unacceptably slow in some cases. - The last Xapian index format with good + building them from index position data.

+

Newer Xapian index formats have rendered our + use of positions list unacceptably slow in some + cases. The last Xapian index format with good performance for the old method is Chert, which is default for 1.2, still supported but not default - in 1.4 and will be dropped in 1.6. The stored - document text is translated from its original - format to UTF-8 plain text, but not stripped of - upper-case, diacritics, or punctuation signs. - Storing it increases the index size by 10-20% - typically, but also allows for nicer snippets, so - it may be worth enabling it even if not strictly - needed for performance if you can afford the - space. The variable only has an effect when - creating an index, meaning that the xapiandb - directory must not exist yet. Its exact effect - depends on the Xapian version. For Xapian 1.4, if - the variable is set to 0, the Chert format will - be used, and the text will not be stored. If the - variable is 1, Glass will be used, and the text - stored. For Xapian 1.2, and for versions after - 1.5 and newer, the index format is always the - default, but the variable controls if the text is - stored or not, and the abstract generation - method. With Xapian 1.5 and later, and the - variable set to 0, abstract generation may be - very slow, but this setting may still be useful - to save space if you do not use abstract - generation at all.

+ in 1.4 and will be dropped in 1.6.

+

The stored document text is translated from + its original format to UTF-8 plain text, but not + stripped of upper-case, diacritics, or + punctuation signs. Storing it increases the index + size by 10-20% typically, but also allows for + nicer snippets, so it may be worth enabling it + even if not strictly needed for performance if + you can afford the space.

+

The variable only has an effect when creating + an index, meaning that the xapiandb directory + must not exist yet. Its exact effect depends on + the Xapian version.

+

For Xapian 1.4, if the variable is set to 0, + the Chert format will be used, and the text will + not be stored. If the variable is 1, Glass will + be used, and the text stored.

+

For Xapian 1.2, and for versions after 1.5 and + newer, the index format is always the default, + but the variable controls if the text is stored + or not, and the abstract generation method. With + Xapian 1.5 and later, and the variable set to 0, + abstract generation may be very slow, but this + setting may still be useful to save space if you + do not use abstract generation at all.

+

Examples: Swedish: unac_except_trans = ää Ää + öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå + ’' ❜' ʼ' ‐- . German: unac_except_trans = ää Ää + öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' + ʼ' ‐- . French: you probably want to decompose oe + and ae and nobody would type a German ß + unac_except_trans = ßss œoe Œoe æae Æae ffff fifi + flfl ’' ❜' ʼ' ‐- . The default for all until + someone protests follows. These decompositions + are not performed by unac, but it is unlikely + that someone would type the composed forms in a search. unac_except_trans = ßss œoe Œoe æae Æae - ffff fifi flfl

+ ffff fifi flfl ’' ❜' ʼ' ‐-

Files and directories which should be ignored. # # White space separated list of wildcard patterns (simple ones, not paths, must contain no -# '/' characters), which will be tested against file and directory names. Have a look at the default -# configuration for the initial value, some entries may not suit your situation. The easiest way to -# see it is through the GUI Index configuration "local parameters" panel. The list in the default -# configuration does not exclude hidden directories (names beginning with a dot), which means that -# it may index quite a few things that you do not want. On the other hand, email user agents like -# Thunderbird usually store messages in hidden directories, and you probably want this indexed. One -# possible solution is to have ".*" in "skippedNames", and add things like "~/.thunderbird" -# "~/.evolution" to "topdirs". Not even the file names are indexed for patterns in this list, see -# the "noContentSuffixes" variable for an alternative approach which indexes the file names. Can be -# redefined for any subtree. +# '/' characters), which will be tested against file and directory names. +# +# Have a look at the default configuration for the initial value, some entries may not suit your +# situation. The easiest way to see it is through the GUI Index configuration "local parameters" +# panel. +# +# The list in the default configuration does not exclude hidden directories (names beginning with a +# dot), which means that it may index quite a few things that you do not want. On the other hand, +# email user agents like Thunderbird usually store messages in hidden directories, and you probably +# want this indexed. One possible solution is to have ".*" in "skippedNames", and add things like +# "~/.thunderbird" "~/.evolution" to "topdirs". +# +# Not even the file names are indexed for patterns in this list, see the "noContentSuffixes" +# variable for an alternative approach which indexes the file names. Can be redefined for any +# subtree. # # skippedNames = #* CVS Cache cache* .cache caughtspam tmp \ @@ -104,19 +109,26 @@ noContentSuffixes+ = # # # Absolute paths we should not go into. -# Space-separated list of wildcard expressions for absolute -# filesystem paths. Must be defined at the top level of the configuration -# file, not in a subsection. Can contain files and directories. The database and -# configuration directories will automatically be added. The expressions -# are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by -# default. This means that '/' characters must be matched explicitly. You -# can set 'skippedPathsFnmPathname' to 0 to disable the use of FNM_PATHNAME -# (meaning that '/*/dir3' will match '/dir1/dir2/dir3'). The default value -# contains the usual mount point for removable media to remind you that it -# is a bad idea to have Recoll work on these (esp. with the monitor: media -# gets indexed on mount, all data gets erased on unmount). Explicitly -# adding '/media/xxx' to the 'topdirs' variable will override -# this. +# +# Space-separated list of wildcard expressions for absolute filesystem paths (for files or +# directories). The variable must be defined at the top level of the configuration file, not in a +# subsection. +# +# Any value in the list must be textually consistent with the values in topdirs, no attempts are +# made to resolve symbolic links. In practise, if, as is frequently the case, /home is a link to +# /usr/home, your default topdirs will have a single entry '~' which will be translated to +# '/home/yourlogin'. In this case, any skippedPaths entry should start with '/home/yourlogin' *not* +# with '/usr/home/yourlogin'. +# +# The index and configuration directories will automatically be added to the list. +# +# The expressions are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by default. This +# means that '/' characters must be matched explicitly. You can set 'skippedPathsFnmPathname' to 0 +# to disable the use of FNM_PATHNAME (meaning that '/*/dir3' will match '/dir1/dir2/dir3'). +# +# The default value contains the usual mount point for removable media to remind you that it is in +# most cases a bad idea to have Recoll work on these Explicitly adding '/media/xxx' to the 'topdirs' +# variable will override this. skippedPaths = /media # Set to 0 to