This commit is contained in:
Jean-Francois Dockes 2019-04-14 16:18:39 +02:00
parent 48bc71da70
commit 567aaa2035
4 changed files with 875 additions and 1037 deletions

View File

@ -54,12 +54,20 @@ home directory.
Where values are lists, white space is used for separation, and elements with Where values are lists, white space is used for separation, and elements with
embedded spaces can be quoted with double-quotes. embedded spaces can be quoted with double-quotes.
.SH OPTIONS .SH OPTIONS
.TP .TP
.BI "topdirs = "string .BI "topdirs = "string
Space-separated list of files or Space-separated list of files or
directories to recursively index. Default to ~ (indexes directories to recursively index. Default to ~ (indexes
$HOME). You can use symbolic links in the list, they will be followed, $HOME). You can use symbolic links in the list, they will be followed,
independently of the value of the followLinks variable. independantly of the value of the followLinks variable.
.TP
.BI "monitordirs = "string
Space-separated list of files or directories to monitor for
updates. When running the real-time indexer, this allows monitoring only a
subset of the whole indexed area. The elements must be included in the
tree defined by the 'topdirs' members.
.TP .TP
.BI "skippedNames = "string .BI "skippedNames = "string
Files and directories which should be ignored. Files and directories which should be ignored.
@ -69,13 +77,21 @@ names. The list in the default configuration does not exclude hidden
directories (names beginning with a dot), which means that it may index directories (names beginning with a dot), which means that it may index
quite a few things that you do not want. On the other hand, email user quite a few things that you do not want. On the other hand, email user
agents like Thunderbird usually store messages in hidden directories, and agents like Thunderbird usually store messages in hidden directories, and
you probably want this indexed. One possible solution is to have '.*' you probably want this indexed. One possible solution is to have ".*" in
in 'skippedNames', and add things like '~/.thunderbird' '~/.evolution' "skippedNames", and add things like "~/.thunderbird" "~/.evolution" to
to 'topdirs'. Not even the file names are indexed for patterns in this "topdirs". Not even the file names are indexed for patterns in this
list, see the 'noContentSuffixes' variable for an alternative approach list, see the "noContentSuffixes" variable for an alternative approach
which indexes the file names. Can be redefined for any which indexes the file names. Can be redefined for any
subtree. subtree.
.TP .TP
.BI "skippedNames- = "string
List of name endings to remove from the default skippedNames
list.
.TP
.BI "skippedNames+ = "string
List of name endings to add to the default skippedNames
list.
.TP
.BI "noContentSuffixes = "string .BI "noContentSuffixes = "string
List of name endings (not necessarily dot-separated suffixes) for List of name endings (not necessarily dot-separated suffixes) for
which we don't try MIME type identification, and don't uncompress or which we don't try MIME type identification, and don't uncompress or
@ -87,38 +103,59 @@ from skippedNames because these are name ending matches only (not
wildcard patterns), and the file name itself gets indexed normally. This wildcard patterns), and the file name itself gets indexed normally. This
can be redefined for subdirectories. can be redefined for subdirectories.
.TP .TP
.BI "noContentSuffixes- = "string
List of name endings to remove from the default noContentSuffixes
list.
.TP
.BI "noContentSuffixes+ = "string
List of name endings to add to the default noContentSuffixes
list.
.TP
.BI "skippedPaths = "string .BI "skippedPaths = "string
Paths we should not go into. Space-separated list of Absolute paths we should not go into. Space-separated list of wildcard expressions for absolute
wildcard expressions for filesystem paths. Can contain files and filesystem paths. Must be defined at the top level of the configuration
directories. The database and configuration directories will file, not in a subsection. Can contain files and directories. The database and
automatically be added. The expressions are matched using 'fnmatch(3)' configuration directories will automatically be added. The expressions
with the FNM_PATHNAME flag set by default. This means that '/' characters are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by
must be matched explicitly. You can set 'skippedPathsFnmPathname' to 0 default. This means that '/' characters must be matched explicitely. You
to disable the use of FNM_PATHNAME (meaning that '/*/dir3' will can set 'skippedPathsFnmPathname' to 0 to disable the use of FNM_PATHNAME
match '/dir1/dir2/dir3'). The default value contains the usual mount point (meaning that '/*/dir3' will match '/dir1/dir2/dir3'). The default value
for removable media to remind you that it is a bad idea to have Recoll work contains the usual mount point for removable media to remind you that it
on these (esp. with the monitor: media gets indexed on mount, all data is a bad idea to have Recoll work on these (esp. with the monitor: media
gets erased on unmount). Explicitly adding '/media/xxx' to the topdirs gets indexed on mount, all data gets erased on unmount). Explicitely
will override this. adding '/media/xxx' to the 'topdirs' variable will override
this.
.TP .TP
.BI "skippedPathsFnmPathname = "bool .BI "skippedPathsFnmPathname = "bool
Set to 0 to Set to 0 to
override use of FNM_PATHNAME for matching skipped override use of FNM_PATHNAME for matching skipped
paths. paths.
.TP .TP
.BI "nowalkfn = "string
File name which will cause its parent directory to be skipped. Any directory containing a file with this name will be skipped as
if it was part of the skippedPaths list. Ex: .recoll-noindex
.TP
.BI "daemSkippedPaths = "string .BI "daemSkippedPaths = "string
skippedPaths equivalent specific to skippedPaths equivalent specific to
real time indexing. This enables having parts of the tree real time indexing. This enables having parts of the tree
which are initially indexed but not monitored. If daemSkippedPaths is which are initially indexed but not monitored. If daemSkippedPaths is
not set, the daemon uses skippedPaths. not set, the daemon uses skippedPaths.
.TP
.BI "zipUseSkippedNames = "bool
Use skippedNames inside Zip archives. Fetched
directly by the rclzip handler. Skip the patterns defined by skippedNames
inside Zip archives. Can be redefined for subdirectories.
See https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html
.TP .TP
.BI "zipSkippedNames = "string .BI "zipSkippedNames = "string
Space-separated list of wildcard expressions for names that should Space-separated list of wildcard expressions for names that should
be ignored inside zip archives. This is used directly by be ignored inside zip archives. This is used directly by
the zip handler, and has a function similar to skippedNames, but works the zip handler. If zipUseSkippedNames is not set, zipSkippedNames
independently. Can be redefined for subdirectories. Supported by recoll defines the patterns to be skipped inside archives. If zipUseSkippedNames
1.20 and newer. See is set, the two lists are concatenated and used. Can be redefined for
https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members subdirectories.
See https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html
.TP .TP
.BI "followLinks = "bool .BI "followLinks = "bool
@ -133,16 +170,27 @@ followed.
.BI "indexedmimetypes = "string .BI "indexedmimetypes = "string
Restrictive list of Restrictive list of
indexed mime types. Normally not set (in which case all indexed mime types. Normally not set (in which case all
supported types are indexed). If it is set, supported types are indexed). If it is set, only the types from the list
only the types from the list will have their contents indexed. The names will have their contents indexed. The names will be indexed anyway if
will be indexed anyway if indexallfilenames is set (default). MIME indexallfilenames is set (default). MIME type names should be taken from
type names should be taken from the mimemap file. Can be redefined for the mimemap file (the values may be different from xdg-mime or file -i
subtrees. output in some cases). Can be redefined for subtrees.
.TP .TP
.BI "excludedmimetypes = "string .BI "excludedmimetypes = "string
List of excluded MIME List of excluded MIME
types. Lets you exclude some types from indexing. Can be types. Lets you exclude some types from indexing. MIME type
redefined for subtrees. names should be taken from the mimemap file (the values may be different
from xdg-mime or file -i output in some cases) Can be redefined for
subtrees.
.TP
.BI "nomd5types = "string
Don't compute md5 for these types. md5 checksums are used only for deduplicating results, and can be
very expensive to compute on multimedia or other big files. This list
lets you turn off md5 computation for selected types. It is global (no
redefinition for subtrees). At the moment, it only has an effect for
external handlers (exec and execm). The file types can be specified by
listing either MIME types (e.g. audio/mpeg) or handler names
(e.g. rclaudio).
.TP .TP
.BI "compressedfilemaxkbs = "int .BI "compressedfilemaxkbs = "int
Size limit for compressed Size limit for compressed
@ -173,9 +221,9 @@ for the command used.
Command used to guess Command used to guess
MIME types if the internal methods fails This should be a MIME types if the internal methods fails This should be a
"file -i" workalike. The file path will be added as a last parameter to "file -i" workalike. The file path will be added as a last parameter to
the command line. 'xdg-mime' works better than the traditional 'file' the command line. "xdg-mime" works better than the traditional "file"
command, and is now the configured default (with a hard-coded fallback command, and is now the configured default (with a hard-coded fallback to
to 'file') "file")
.TP .TP
.BI "processwebqueue = "bool .BI "processwebqueue = "bool
Decide if we process the Decide if we process the
@ -204,6 +252,34 @@ will be bigger, and some marginal weirdness may sometimes occur. The
default is a stripped index. When using multiple indexes for a search, default is a stripped index. When using multiple indexes for a search,
this parameter must be defined identically for all. Changing the value this parameter must be defined identically for all. Changing the value
implies an index reset. implies an index reset.
.TP
.BI "indexStoreDocText = "bool
Decide if we store the
documents' text content in the index. Storing the text
allows extracting snippets from it at query time, instead of building
them from index position data.
Newer Xapian index formats have rendered our use of positions list
unacceptably slow in some cases. The last Xapian index format with good
performance for the old method is Chert, which is default for 1.2, still
supported but not default in 1.4 and will be dropped in 1.6.
The stored document text is translated from its original format to UTF-8
plain text, but not stripped of upper-case, diacritics, or punctuation
signs. Storing it increases the index size by 10-20% typically, but also
allows for nicer snippets, so it may be worth enabling it even if not
strictly needed for performance if you can afford the space.
The variable only has an effect when creating an index, meaning that the
xapiandb directory must not exist yet. Its exact effect depends on the
Xapian version.
For Xapian 1.4, if the variable is set to 0, the Chert format will be
used, and the text will not be stored. If the variable is 1, Glass will
be used, and the text stored.
For Xapian 1.2, and for versions after 1.5 and newer, the index format is
always the default, but the variable controls if the text is stored or
not, and the abstract generation method. With Xapian 1.5 and later, and
the variable set to 0, abstract generation may be very slow, but this
setting may still be useful to save space if you do not use abstract
generation at all.
.TP .TP
.BI "nonumbers = "bool .BI "nonumbers = "bool
Decides if terms will be Decides if terms will be
@ -216,9 +292,19 @@ will reduce the index size. This can only be set for a whole index, not
for a subtree. for a subtree.
.TP .TP
.BI "dehyphenate = "bool .BI "dehyphenate = "bool
Determines if we index 'coworker' also when the input is 'co-worker'. Determines if we index
This is new in version 1.22, and on by default. Setting the variable to off 'coworker' also when the input is 'co-worker'. This is new
allows restoring the previous behaviour. in version 1.22, and on by default. Setting the variable to off allows
restoring the previous behaviour.
.TP
.BI "backslashasletter = "bool
Process backslash as normal letter This may make sense for people wanting to index TeX commands as
such but is not of much general use.
.TP
.BI "maxtermlength = "int
Maximum term length. Words longer than this will be discarded.
The default is 40 and used to be hard-coded, but it can now be
adjusted. You need an index reset if you change the value.
.TP .TP
.BI "nocjk = "bool .BI "nocjk = "bool
Decides if specific East Asian Decides if specific East Asian
@ -263,24 +349,16 @@ lowercase and upper-case versions of a character should be specified, as
appartenance to the list will turn-off both standard accent and case appartenance to the list will turn-off both standard accent and case
processing. The value is global and affects both indexing and querying. processing. The value is global and affects both indexing and querying.
Examples: Examples:
Swedish: Swedish:
unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå
. German:
German:
unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl
In French, you probably want to decompose oe and ae and nobody would type In French, you probably want to decompose oe and ae and nobody would type
a German ß a German ß
unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl
. The default for all until someone protests follows. These decompositions
The default for all until someone protests follows. These decompositions
are not performed by unac, but it is unlikely that someone would type the are not performed by unac, but it is unlikely that someone would type the
composed forms in a search. composed forms in a search.
unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl
.TP .TP
.BI "maildefcharset = "string .BI "maildefcharset = "string
@ -352,7 +430,7 @@ over which we stop indexing. The value is a percentage,
corresponding to what the "Capacity" df output column shows. The default corresponding to what the "Capacity" df output column shows. The default
value is 0, meaning no checking. value is 0, meaning no checking.
.TP .TP
.BI "xapiandb = "dfn .BI "dbdir = "dfn
Xapian database directory Xapian database directory
location. This will be created on first indexing. If the location. This will be created on first indexing. If the
value is not an absolute path, it will be interpreted as relative to value is not an absolute path, it will be interpreted as relative to
@ -386,9 +464,17 @@ Default: 40 MB.
Reducing the size will not physically truncate the file. Reducing the size will not physically truncate the file.
.TP .TP
.BI "webqueuedir = "fn .BI "webqueuedir = "fn
The path to the Web indexing queue. This is The path to the Web indexing queue. This used to be
hard-coded in the plugin as ~/.recollweb/ToIndex so there should be no hard-coded in the old plugin as ~/.recollweb/ToIndex so there would be no
need or possibility to change it. need or possibility to change it, but the WebExtensions plugin now downloads
the files to the user Downloads directory, and a script moves them to
webqueuedir. The script reads this value from the config so it has become
possible to change it.
.TP
.BI "webdownloadsdir = "fn
The path to browser downloads directory. This is
where the new browser add-on extension has to create the files. They are
then moved by a script to webqueuedir.
.TP .TP
.BI "aspellDicDir = "dfn .BI "aspellDicDir = "dfn
Aspell dictionary storage directory location. The Aspell dictionary storage directory location. The
@ -415,10 +501,11 @@ which lets Xapian perform its own thing, meaning flushing every
$XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory $XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory
usage depends on average document size, not only document count, the usage depends on average document size, not only document count, the
Xapian approach is is not very useful, and you should let Recoll manage Xapian approach is is not very useful, and you should let Recoll manage
the flushes. The default value of idxflushmb is 10 MB, and may be a bit the flushes. The program compiled value is 0. The configured default
low. If you are looking for maximum speed, you may want to experiment value (from this file) is now 50 MB, and should be ok in many cases.
with values between 20 and You can set it as low as 10 to conserve memory, but if you are looking
80. In my experience, values beyond 100 are always counterproductive. If for maximum speed, you may want to experiment with values between 20 and
200. In my experience, values beyond this are always counterproductive. If
you find otherwise, please drop me a note. you find otherwise, please drop me a note.
.TP .TP
.BI "filtermaxseconds = "int .BI "filtermaxseconds = "int
@ -481,6 +568,25 @@ Override logfilename for the indexer in real time
mode. The default is to use the idx... values if set, else mode. The default is to use the idx... values if set, else
the log... values. the log... values.
.TP .TP
.BI "orgidxconfdir = "dfn
Original location of the configuration directory. This is used exclusively for movable datasets. Locating the
configuration directory inside the directory tree makes it possible to
provide automatic query time path translations once the data set has
moved (for example, because it has been mounted on another
location).
.TP
.BI "curidxconfdir = "dfn
Current location of the configuration directory. Complement orgidxconfdir for movable datasets. This should be used
if the configuration directory has been copied from the dataset to
another location, either because the dataset is readonly and an r/w copy
is desired, or for performance reasons. This records the original moved
location before copy, to allow path translation computations. For
example if a dataset originally indexed as '/home/me/mydata/config' has
been mounted to '/media/me/mydata', and the GUI is running from a copied
configuration, orgidxconfdir would be '/home/me/mydata/config', and
curidxconfdir (as set in the copied configuration) would be
'/media/me/mydata/config'.
.TP
.BI "idxrundir = "dfn .BI "idxrundir = "dfn
Indexing process current directory. The input Indexing process current directory. The input
handlers sometimes leave temporary files in the current directory, so it handlers sometimes leave temporary files in the current directory, so it
@ -519,6 +625,12 @@ amount of data stored in the index for the purpose of displaying fields
inside result lists or previews. The default value is 150 bytes which inside result lists or previews. The default value is 150 bytes which
may be too low if you have custom fields. may be too low if you have custom fields.
.TP .TP
.BI "idxtexttruncatelen = "int
Truncation length for all document texts. Only index
the beginning of documents. This is not recommended except if you are
sure that the interesting keywords are at the top and have severe disk
space issues.
.TP
.BI "aspellLanguage = "string .BI "aspellLanguage = "string
Language definitions to use when creating the aspell Language definitions to use when creating the aspell
dictionary. The value must match a set of aspell language dictionary. The value must match a set of aspell language
@ -612,16 +724,39 @@ Attempt OCR of PDF files with no text content if both tesseract and
pdftoppm are installed. The default is off because OCR is so pdftoppm are installed. The default is off because OCR is so
very slow. very slow.
.TP .TP
.BI "pdfocrlang = "string
Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
with tesseract. This can also be set through a configuration variable
or directory-local parameters. See the rclpdf.py script.
.TP
.BI "pdfattach = "bool .BI "pdfattach = "bool
Enable PDF attachment extraction by executing pdftk (if Enable PDF attachment extraction by executing pdftk (if
available). This is available). This is
normally disabled, because it does slow down PDF indexing a bit even if normally disabled, because it does slow down PDF indexing a bit even if
not one attachment is ever found. not one attachment is ever found.
.TP .TP
.BI "pdfextrameta = "string
Extract text from selected XMP metadata tags. This
is a space-separated list of qualified XMP tag names. Each element can also
include a translation to a Recoll field name, separated by a '|'
character. If the second element is absent, the tag name is used as the
Recoll field names. You will also need to add specifications to the
"fields" file to direct processing of the extracted data.
.TP
.BI "pdfextrametafix = "fn
Define name of XMP field editing script. This
defines the name of a script to be loaded for editing XMP field
values. The script should define a 'MetaFixer' class with a metafix()
method which will be called with the qualified tag name and value of each
selected field, for editing or erasing. A new instance is created for
each document, so that the object can keep state for, e.g. eliminating
duplicate values.
.TP
.BI "mhmboxquirks = "string .BI "mhmboxquirks = "string
Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the directory where the email mbox files are Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the directory where the email mbox files are
stored. stored.
.SH SEE ALSO .SH SEE ALSO
.PP .PP
recollindex(1) recoll(1) recollindex(1) recoll(1)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -216,9 +216,9 @@ usesystemfilecommand = 1
# <var name="systemfilecommand" type="string"><brief>Command used to guess # <var name="systemfilecommand" type="string"><brief>Command used to guess
# MIME types if the internal methods fails</brief><descr>This should be a # MIME types if the internal methods fails</brief><descr>This should be a
# "file -i" workalike. The file path will be added as a last parameter to # "file -i" workalike. The file path will be added as a last parameter to
# the command line. 'xdg-mime' works better than the traditional 'file' # the command line. "xdg-mime" works better than the traditional "file"
# command, and is now the configured default (with a hard-coded fallback to # command, and is now the configured default (with a hard-coded fallback to
# 'file')</descr></var> # "file")</descr></var>
systemfilecommand = xdg-mime query filetype systemfilecommand = xdg-mime query filetype
# <var name="processwebqueue" type="bool"><brief>Decide if we process the # <var name="processwebqueue" type="bool"><brief>Decide if we process the
@ -885,7 +885,7 @@ snippetMaxPosWalk = 1000000
# include a translation to a Recoll field name, separated by a '|' # include a translation to a Recoll field name, separated by a '|'
# character. If the second element is absent, the tag name is used as the # character. If the second element is absent, the tag name is used as the
# Recoll field names. You will also need to add specifications to the # Recoll field names. You will also need to add specifications to the
# 'fields' file to direct processing of the extracted data.</descr></var> # "fields" file to direct processing of the extracted data.</descr></var>
#pdfextrameta = bibtex:location|location bibtex:booktitle bibtex:pages #pdfextrameta = bibtex:location|location bibtex:booktitle bibtex:pages
# <var name="pdfextrametafix" type="fn"> # <var name="pdfextrametafix" type="fn">