123 lines
5.6 KiB
Plaintext
123 lines
5.6 KiB
Plaintext
== Character case and diacritic marks (2), user interface
|
|
|
|
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
|
|
of the problems which arise when mixing case/diacritics sensitivity and
|
|
stemming.
|
|
|
|
As of version 1.18, Recoll can create two types of indexes:
|
|
* _Dumb_ indexes contain terms which are lowercased and stripped of
|
|
diacritics. Searches using such an index are naturally case- and
|
|
diacritics- insensitive: search terms are stripped before processing.
|
|
* _Raw_ indexes contain terms which are just like they were found in the
|
|
source document. Searching such an index is naturally sensitive to case
|
|
and diacritics, and can be made insensitive by further processing.
|
|
|
|
The following explains how users can control these Recoll features.
|
|
|
|
=== Controlling the type of index we create: stripped or raw
|
|
|
|
The kind of index that recoll creates is determined by:
|
|
|
|
* A build-time *configure* switch: _--enable-stripchars_. If this is
|
|
set, the code for case and diacritics sensitivity is not compiled in and
|
|
recoll will work like the previous versions: unaccented and casefolded
|
|
index, no runtime options for case or diacritics sensitivity
|
|
|
|
* An indexing configuration switch (in recoll.conf): if Recoll was built
|
|
with _--disable-stripchars_, this will provide a dynamic way to return
|
|
to the "traditional" index. The case and diacritics code will be present
|
|
but inactive. Normally, a recoll installation with this switch set
|
|
should behave exactly like one built with _--enable-stripchars_. When
|
|
using multiple indexes, this switch MUST be consistent between
|
|
indexes. There is no support whatsoever for mixing raw and dumb indexes.
|
|
The option is named _indexStripChars_, and it is not settable from the
|
|
GUI to avoid errors. This is something that would typically be set once
|
|
and for all for a given installation. We need to decide what the default
|
|
value will be for 1.18
|
|
|
|
* A number of query time switches. Using these it is also possible to
|
|
perform a search insensitive to case and diacritics on a raw index. Note
|
|
however, that, given the complexity of the issues involved, I give no
|
|
guaranty at this time that this will yield exactly the same results as
|
|
searching a dumb index. Details about query time behaviour follow.
|
|
|
|
|
|
=== Controlling stem, case and diacritics expansion: user query interface
|
|
|
|
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
|
|
needed to give the user a way to control stem expansion. This was done in
|
|
three ways:
|
|
|
|
* Globally, by setting a menu option.
|
|
* Globally, by setting the stemming language value to empty.
|
|
* On a term by term basis by Capitalizing the term, or, in query language
|
|
mode only, by using an 'l' clause modifier (_"term"l_).
|
|
|
|
After switching to an unstripped index, capable of case and diacritic
|
|
sensitivity, we need ways to control what processing is performed among:
|
|
|
|
* Case expansion.
|
|
* Diacritics expansion.
|
|
* Stem expansion.
|
|
|
|
The default mode will be compatible with the previous version, because
|
|
this is is most generally what we want to do: ignore case and diacritics,
|
|
expand stems.
|
|
|
|
There are two easy approaches for controlling the parameters:
|
|
* Global options set in the GUI menus or as *recollq* command line
|
|
switches.
|
|
* Per-clause options set by modifiers in the query language.
|
|
|
|
We would like, however to let the user entry automatically override the
|
|
defaults in a sensible way. For example:
|
|
|
|
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
|
(for this term only).
|
|
* If a term is entered with upper-case characters, case sensitivity is
|
|
turned on. In this case, we turn off stem expansion, because it makes
|
|
really no sense with case sensitivity.
|
|
|
|
With this method we are stuck with 3 problems (only if the global mode is
|
|
set to insensitive, and we're not using the query language):
|
|
|
|
* Turning off stemming without turning on case sensitivity.
|
|
* Searching for an all lower-case term in case-sensitive mode.
|
|
* Searching for a term without diacritics in diacritic-sensitive mode.
|
|
|
|
The two latter issues are relatively marginal and can be worked around easily
|
|
by switching to query language mode or using negative clauses in the
|
|
advanced search.
|
|
|
|
However, we need to be able to turn stemming off while remaining
|
|
insensitive to case, and we need to stay reasonably compatible with the
|
|
previous versions. This means that a term which has a capital first letter
|
|
but is otherwise lowercase will turn stemming off, but not case sensitivity
|
|
on.
|
|
|
|
So we're left with how to search for such a term in a case-sensitive way,
|
|
and for this, you'll have to use global options or the query language.
|
|
|
|
The modified method is:
|
|
|
|
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
|
(for this term only).
|
|
* If the first letter in a term is upper-case and the rest is lower-case,
|
|
we turn stem expansion off, but we do not become case-sensitive
|
|
* If any letter in a term except the first is upper-case, case sensitivity
|
|
is turned on. Stem expansion is also turned-off (even if the first
|
|
letter is lower-case), because it makes really no sense with case
|
|
sensitivity.
|
|
* To search for an all lower-case or capitalized term in a case-sensitive
|
|
way, use the query language: "Capitalized"C, "lowercase"C
|
|
* Use the query language and the "D" modifier to turn on diacritics
|
|
sensitivity.
|
|
|
|
It can be noted that some combinations of choices do not make sense and
|
|
they are not allowed by Recoll: for example, diacritics or case sensitivity
|
|
do not make sense with stem expansion (which cannot preserve diacritics in
|
|
any meaningful general way).
|
|
|
|
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
|
|
implementation in Recoll 1.18.
|