doc

2019-04-14 16:18:39 +02:00 · 2019-04-14 16:18:39 +02:00 · 567aaa2035
commit 567aaa2035
parent 48bc71da70
4 changed files with 875 additions and 1037 deletions
--- a/src/doc/man/recoll.conf.5
+++ b/src/doc/man/recoll.conf.5
@ -54,12 +54,20 @@ home directory.
 Where values are lists, white space is used for separation, and elements with
 embedded spaces can be quoted with double-quotes.
 .SH OPTIONS
 .TP
 .BI "topdirs = "string
 Space-separated list of files or
 directories to recursively index. Default to ~ (indexes
 $HOME). You can use symbolic links in the list, they will be followed,
-independently of the value of the followLinks variable.
+independantly of the value of the followLinks variable.
 .TP
 .BI "monitordirs = "string
 Space-separated list of files or directories to monitor for
 updates. When running the real-time indexer, this allows monitoring only a
 subset of the whole indexed area. The elements must be included in the
 tree defined by the 'topdirs' members.
 .TP
 .BI "skippedNames = "string
 Files and directories which should be ignored. 
@ -69,13 +77,21 @@ names.  The list in the default configuration does not exclude hidden
 directories (names beginning with a dot), which means that it may index
 quite a few things that you do not want. On the other hand, email user
 agents like Thunderbird usually store messages in hidden directories, and
-you probably want this indexed. One possible solution is to have '.*'
+you probably want this indexed. One possible solution is to have ".*" in
-in 'skippedNames', and add things like '~/.thunderbird' '~/.evolution'
+"skippedNames", and add things like "~/.thunderbird" "~/.evolution" to
-to 'topdirs'.  Not even the file names are indexed for patterns in this
+"topdirs".  Not even the file names are indexed for patterns in this
-list, see the 'noContentSuffixes' variable for an alternative approach
+list, see the "noContentSuffixes" variable for an alternative approach
 which indexes the file names. Can be redefined for any
 subtree.
 .TP
 .BI "skippedNames- = "string
 List of name endings to remove from the default skippedNames
 list. 
 .TP
 .BI "skippedNames+ = "string
 List of name endings to add to the default skippedNames
 list. 
 .TP
 .BI "noContentSuffixes = "string
 List of name endings (not necessarily dot-separated suffixes) for
 which we don't try MIME type identification, and don't uncompress or
@ -87,38 +103,59 @@ from skippedNames because these are name ending matches only (not
 wildcard patterns), and the file name itself gets indexed normally. This
 can be redefined for subdirectories.
 .TP
 .BI "noContentSuffixes- = "string
 List of name endings to remove from the default noContentSuffixes
 list. 
 .TP
 .BI "noContentSuffixes+ = "string
 List of name endings to add to the default noContentSuffixes
 list. 
 .TP
 .BI "skippedPaths = "string
-Paths we should not go into. Space-separated list of
+Absolute paths we should not go into. Space-separated list of wildcard expressions for absolute
-wildcard expressions for filesystem paths. Can contain files and
+filesystem paths. Must be defined at the top level of the configuration
-directories. The database and configuration directories will
+file, not in a subsection. Can contain files and directories. The database and
-automatically be added. The expressions are matched using 'fnmatch(3)'
+configuration directories will automatically be added. The expressions
-with the FNM_PATHNAME flag set by default. This means that '/' characters
+are matched using 'fnmatch(3)' with the FNM_PATHNAME flag set by
-must be matched explicitly. You can set 'skippedPathsFnmPathname' to 0
+default. This means that '/' characters must be matched explicitely. You
-to disable the use of FNM_PATHNAME (meaning that '/*/dir3' will
+can set 'skippedPathsFnmPathname' to 0 to disable the use of FNM_PATHNAME
-match '/dir1/dir2/dir3').  The default value contains the usual mount point
+(meaning that '/*/dir3' will match '/dir1/dir2/dir3'). The default value
-for removable media to remind you that it is a bad idea to have Recoll work
+contains the usual mount point for removable media to remind you that it
-on these (esp. with the monitor: media gets indexed on mount, all data
+is a bad idea to have Recoll work on these (esp. with the monitor: media
-gets erased on unmount).  Explicitly adding '/media/xxx' to the topdirs
+gets indexed on mount, all data gets erased on unmount). Explicitely
-will override this.
+adding '/media/xxx' to the 'topdirs' variable will override
 this.
 .TP
 .BI "skippedPathsFnmPathname = "bool
 Set to 0 to
 override use of FNM_PATHNAME for matching skipped
 paths. 
 .TP
 .BI "nowalkfn = "string
 File name which will cause its parent directory to be skipped. Any directory containing a file with this name will be skipped as
 if it was part of the skippedPaths list. Ex: .recoll-noindex
 .TP
 .BI "daemSkippedPaths = "string
 skippedPaths equivalent specific to
 real time indexing. This enables having parts of the tree
 which are initially indexed but not monitored. If daemSkippedPaths is
 not set, the daemon uses skippedPaths.
 .TP
 .BI "zipUseSkippedNames = "bool
 Use skippedNames inside Zip archives. Fetched
 directly by the rclzip handler. Skip the patterns defined by skippedNames
 inside Zip archives. Can be redefined for subdirectories.
 See https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html
 .TP
 .BI "zipSkippedNames = "string
 Space-separated list of wildcard expressions for names that should
 be ignored inside zip archives. This is used directly by
-the zip handler, and has a function similar to skippedNames, but works
+the zip handler. If zipUseSkippedNames is not set, zipSkippedNames
-independently. Can be redefined for subdirectories. Supported by recoll
+defines the patterns to be skipped inside archives. If zipUseSkippedNames
-1.20 and newer. See
+is set, the two lists are concatenated and used. Can be redefined for
-https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members
+subdirectories.
 See https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html
 .TP
 .BI "followLinks = "bool
@ -133,16 +170,27 @@ followed.
 .BI "indexedmimetypes = "string
 Restrictive list of
 indexed mime types. Normally not set (in which case all
-supported types are indexed). If it is set,
+supported types are indexed). If it is set, only the types from the list
-only the types from the list will have their contents indexed. The names
+will have their contents indexed. The names will be indexed anyway if
-will be indexed anyway if indexallfilenames is set (default). MIME
+indexallfilenames is set (default). MIME type names should be taken from
-type names should be taken from the mimemap file. Can be redefined for
+the mimemap file (the values may be different from xdg-mime or file -i
-subtrees.
+output in some cases). Can be redefined for subtrees.
 .TP
 .BI "excludedmimetypes = "string
 List of excluded MIME
-types. Lets you exclude some types from indexing. Can be
+types. Lets you exclude some types from indexing. MIME type
-redefined for subtrees.
+names should be taken from the mimemap file (the values may be different
 from xdg-mime or file -i output in some cases) Can be redefined for
 subtrees.
 .TP
 .BI "nomd5types = "string
 Don't compute md5 for these types. md5 checksums are used only for deduplicating results, and can be
 very expensive to compute on multimedia or other big files. This list
 lets you turn off md5 computation for selected types. It is global (no
 redefinition for subtrees). At the moment, it only has an effect for
 external handlers (exec and execm). The file types can be specified by
 listing either MIME types (e.g. audio/mpeg) or handler names
 (e.g. rclaudio).
 .TP
 .BI "compressedfilemaxkbs = "int
 Size limit for compressed
@ -173,9 +221,9 @@ for the command used.
 Command used to guess
 MIME types if the internal methods fails This should be a
 "file -i" workalike.  The file path will be added as a last parameter to
-the command line. 'xdg-mime' works better than the traditional 'file'
+the command line. "xdg-mime" works better than the traditional "file"
-command, and is now the configured default (with a hard-coded fallback
+command, and is now the configured default (with a hard-coded fallback to
-to 'file')
+"file")
 .TP
 .BI "processwebqueue = "bool
 Decide if we process the
@ -204,6 +252,34 @@ will be bigger, and some marginal weirdness may sometimes occur. The
 default is a stripped index. When using multiple indexes for a search,
 this parameter must be defined identically for all. Changing the value
 implies an index reset.
 .TP
 .BI "indexStoreDocText = "bool
 Decide if we store the
 documents' text content in the index. Storing the text
 allows extracting snippets from it at query time, instead of building
 them from index position data.
 Newer Xapian index formats have rendered our use of positions list
 unacceptably slow in some cases. The last Xapian index format with good
 performance for the old method is Chert, which is default for 1.2, still
 supported but not default in 1.4 and will be dropped in 1.6.
 The stored document text is translated from its original format to UTF-8
 plain text, but not stripped of upper-case, diacritics, or punctuation
 signs. Storing it increases the index size by 10-20% typically, but also
 allows for nicer snippets, so it may be worth enabling it even if not
 strictly needed for performance if you can afford the space.
 The variable only has an effect when creating an index, meaning that the
 xapiandb directory must not exist yet. Its exact effect depends on the
 Xapian version.
 For Xapian 1.4, if the variable is set to 0, the Chert format will be
 used, and the text will not be stored. If the variable is 1, Glass will
 be used, and the text stored.
 For Xapian 1.2, and for versions after 1.5 and newer, the index format is
 always the default, but the variable controls if the text is stored or
 not, and the abstract generation method. With Xapian 1.5 and later, and
 the variable set to 0, abstract generation may be very slow, but this
 setting may still be useful to save space if you do not use abstract
 generation at all.
 .TP
 .BI "nonumbers = "bool
 Decides if terms will be
@ -216,9 +292,19 @@ will reduce the index size. This can only be set for a whole index, not
 for a subtree.
 .TP
 .BI "dehyphenate = "bool
-Determines if we index 'coworker' also when the input is 'co-worker'.
+Determines if we index
-This is new in version 1.22, and on by default. Setting the variable to off
+'coworker' also when the input is 'co-worker'. This is new
-allows restoring the previous behaviour.
+in version 1.22, and on by default. Setting the variable to off allows
 restoring the previous behaviour.
 .TP
 .BI "backslashasletter = "bool
 Process backslash as normal letter This may make sense for people wanting to index TeX commands as
 such but is not of much general use.
 .TP
 .BI "maxtermlength = "int
 Maximum term length. Words longer than this will be discarded.
 The default is 40 and used to be hard-coded, but it can now be
 adjusted. You need an index reset if you change the value.
 .TP
 .BI "nocjk = "bool
 Decides if specific East Asian
@ -263,24 +349,16 @@ lowercase and upper-case versions of a character should be specified, as
 appartenance to the list will turn-off both standard accent and case
 processing. The value is global and affects both indexing and querying.
 Examples:
 Swedish:
 unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ﬀff ﬁfi ﬂfl åå Åå
-
+. German:
 German:
 unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ﬀff ﬁfi ﬂfl
 In French, you probably want to decompose oe and ae and nobody would type
 a German ß
 unac_except_trans = ßss œoe Œoe æae Æae ﬀff ﬁfi ﬂfl
-
+. The default for all until someone protests follows. These decompositions
 The default for all until someone protests follows. These decompositions
 are not performed by unac, but it is unlikely that someone would type the
 composed forms in a search.
 unac_except_trans = ßss œoe Œoe æae Æae ﬀff ﬁfi ﬂfl
 .TP
 .BI "maildefcharset = "string
@ -352,7 +430,7 @@ over which we stop indexing. The value is a percentage,
 corresponding to what the "Capacity" df output column shows. The default
 value is 0, meaning no checking.
 .TP
-.BI "xapiandb = "dfn
+.BI "dbdir = "dfn
 Xapian database directory
 location. This will be created on first indexing. If the
 value is not an absolute path, it will be interpreted as relative to
@ -386,9 +464,17 @@ Default: 40 MB.
 Reducing the size will not physically truncate the file.
 .TP
 .BI "webqueuedir = "fn
-The path to the Web indexing queue. This is
+The path to the Web indexing queue. This used to be
-hard-coded in the plugin as ~/.recollweb/ToIndex so there should be no
+hard-coded in the old plugin as ~/.recollweb/ToIndex so there would be no
-need or possibility to change it.
+need or possibility to change it, but the WebExtensions plugin now downloads
 the files to the user Downloads directory, and a script moves them to
 webqueuedir. The script reads this value from the config so it has become
 possible to change it.
 .TP
 .BI "webdownloadsdir = "fn
 The path to browser downloads directory. This is
 where the new browser add-on extension has to create the files. They are
 then moved by a script to webqueuedir.
 .TP
 .BI "aspellDicDir = "dfn
 Aspell dictionary storage directory location. The
@ -415,10 +501,11 @@ which lets Xapian perform its own thing, meaning flushing every
 $XAPIAN_FLUSH_THRESHOLD documents created, modified or deleted: as memory
 usage depends on average document size, not only document count, the
 Xapian approach is is not very useful, and you should let Recoll manage
-the flushes.  The default value of idxflushmb is 10 MB, and may be a bit
+the flushes. The program compiled value is 0. The configured default
-low. If you are looking for maximum speed, you may want to experiment
+value (from this file) is now 50 MB, and should be ok in many cases.
-with values between 20 and
+You can set it as low as 10 to conserve memory, but if you are looking
-80. In my experience, values beyond 100 are always counterproductive. If
+for maximum speed, you may want to experiment with values between 20 and
 200. In my experience, values beyond this are always counterproductive. If
 you find otherwise, please drop me a note.
 .TP
 .BI "filtermaxseconds = "int
@ -481,6 +568,25 @@ Override logfilename for the indexer in real time
 mode. The default is to use the idx... values if set, else
 the log... values.
 .TP
 .BI "orgidxconfdir = "dfn
 Original location of the configuration directory. This is used exclusively for movable datasets. Locating the
 configuration directory inside the directory tree makes it possible to
 provide automatic query time path translations once the data set has
 moved (for example, because it has been mounted on another
 location).
 .TP
 .BI "curidxconfdir = "dfn
 Current location of the configuration directory. Complement orgidxconfdir for movable datasets. This should be used
 if the configuration directory has been copied from the dataset to
 another location, either because the dataset is readonly and an r/w copy
 is desired, or for performance reasons. This records the original moved
 location before copy, to allow path translation computations.  For
 example if a dataset originally indexed as '/home/me/mydata/config' has
 been mounted to '/media/me/mydata', and the GUI is running from a copied
 configuration, orgidxconfdir would be '/home/me/mydata/config', and
 curidxconfdir (as set in the copied configuration) would be
 '/media/me/mydata/config'.
 .TP
 .BI "idxrundir = "dfn
 Indexing process current directory. The input
 handlers sometimes leave temporary files in the current directory, so it
@ -519,6 +625,12 @@ amount of data stored in the index for the purpose of displaying fields
 inside result lists or previews. The default value is 150 bytes which
 may be too low if you have custom fields.
 .TP
 .BI "idxtexttruncatelen = "int
 Truncation length for all document texts. Only index
 the beginning of documents. This is not recommended except if you are
 sure that the interesting keywords are at the top and have severe disk
 space issues.
 .TP
 .BI "aspellLanguage = "string
 Language definitions to use when creating the aspell
 dictionary. The value must match a set of aspell language
@ -612,16 +724,39 @@ Attempt OCR of PDF files with no text content if both tesseract and
 pdftoppm are installed. The default is off because OCR is so
 very slow.
 .TP
 .BI "pdfocrlang = "string
 Language to assume for PDF OCR. This is very important for having a reasonable rate of errors
 with tesseract. This can also be set through a configuration variable
 or directory-local parameters. See the rclpdf.py script.
 .TP
 .BI "pdfattach = "bool
 Enable PDF attachment extraction by executing pdftk (if
 available). This is
 normally disabled, because it does slow down PDF indexing a bit even if
 not one attachment is ever found.
 .TP
 .BI "pdfextrameta = "string
 Extract text from selected XMP metadata tags. This
 is a space-separated list of qualified XMP tag names. Each element can also
 include a translation to a Recoll field name, separated by a '|'
 character. If the second element is absent, the tag name is used as the
 Recoll field names. You will also need to add specifications to the
 "fields" file to direct processing of the extracted data.
 .TP
 .BI "pdfextrametafix = "fn
 Define name of XMP field editing script. This
 defines the name of a script to be loaded for editing XMP field
 values. The script should define a 'MetaFixer' class with a metafix()
 method which will be called with the qualified tag name and value of each
 selected field, for editing or erasing. A new instance is created for
 each document, so that the object can keep state for, e.g. eliminating
 duplicate values.
 .TP
 .BI "mhmboxquirks = "string
 Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the directory where the email mbox files are
 stored.
 .SH SEE ALSO
 .PP
 recollindex(1) recoll(1)
--- a/src/doc/user/usermanual.html
+++ b/src/doc/user/usermanual.html
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
--- a/src/sampleconf/recoll.conf
+++ b/src/sampleconf/recoll.conf
@ -216,9 +216,9 @@ usesystemfilecommand = 1
 # <var name="systemfilecommand" type="string"><brief>Command used to guess
 # MIME types if the internal methods fails</brief><descr>This should be a
 # "file -i" workalike.  The file path will be added as a last parameter to
-# the command line. 'xdg-mime' works better than the traditional 'file'
+# the command line. "xdg-mime" works better than the traditional "file"
 # command, and is now the configured default (with a hard-coded fallback to
-# 'file')</descr></var>
+# "file")</descr></var>
 systemfilecommand = xdg-mime query filetype
 # <var name="processwebqueue" type="bool"><brief>Decide if we process the
@ -885,7 +885,7 @@ snippetMaxPosWalk = 1000000
 # include a translation to a Recoll field name, separated by a '|'
 # character. If the second element is absent, the tag name is used as the
 # Recoll field names. You will also need to add specifications to the
-# 'fields' file to direct processing of the extracted data.</descr></var>
+# "fields" file to direct processing of the extracted data.</descr></var>
 #pdfextrameta =  bibtex:location|location bibtex:booktitle bibtex:pages
 # <var name="pdfextrametafix" type="fn">