144 lines
6.7 KiB
Plaintext
144 lines
6.7 KiB
Plaintext
== Character case and diacritic marks (1), issues with stemming
|
|
|
|
=== Case and diacritics in Recoll
|
|
|
|
Recoll versions up to 1.17 almost fully ignore character case and diacritic
|
|
marks.
|
|
|
|
All terms are converted to lower case and unaccented before they are
|
|
written to the index. There are only two exceptions:
|
|
|
|
* File paths (as used in _dir:_ clauses) are not converted. This might
|
|
be a bug or a feature, but the main reason is that we don't know how they
|
|
are encoded.
|
|
* It is possible to specify that some characters will keep their diacritic
|
|
marks, because the entity formed by the character and the diacritic mark
|
|
is considered to be a different letter, not a modified one. This is
|
|
highly dependant on the language. For exemple, in Swedish, +å+ should
|
|
be preserved, not turned into +a+.
|
|
|
|
As a necessary consequence, the same transformations are applied to search
|
|
terms, and it is impossible to search for a specific capitalization of a
|
|
word (+US+ is looked for as +us+), or a specific accented form
|
|
(+café+ will be looked for as +cafe+).
|
|
|
|
However, there are some cases where you would like to be more specific:
|
|
|
|
* Searching for +US+ or +us+ should probably return different results.
|
|
* Diacritics are seldom significant in English, but we can find a
|
|
few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of
|
|
course, there are many more cases in languages which use more diacritics.
|
|
|
|
On the other hand, accents are often mistyped or forgotten (résumé, résume,
|
|
resume?), and capitalization is most often unsignificant, so that it is
|
|
very important to retain the capability to ignore accent and character
|
|
case differences, and that the discrimination can be easily switched on or
|
|
off for each search (or even for specific terms).
|
|
|
|
This text and other pages which will follow will discuss issues in adding
|
|
character case and diacritics sensitivity to Recoll, under the assumption
|
|
that the main index will contain the raw source terms instead of
|
|
case-folded and unaccented ones.
|
|
|
|
The following will use the _unaccent_ neologism to mean _remove
|
|
diacritic marks_ (and not only accents).
|
|
|
|
English examples are used when possible, but given the limited use of
|
|
diacritics in English, some French will probably creep in.
|
|
|
|
=== Diacritics and stemming
|
|
|
|
Stemming is the process by which we extend a search to terms related by
|
|
grammatical inflexion, for example singular/plural, verb tenses, etc. For
|
|
example a search for +floor+ is normally expanded by Recoll to +floors,
|
|
floored, flooring, ...+
|
|
|
|
In practice Recoll has a separate data structure that has stemmed terms
|
|
(stems) as keys pointing to a list of expansion terms
|
|
{{{floor -> (floor,floors,floorings,...)}}}
|
|
|
|
Stemming should be applied to terms before they are stripped of
|
|
diacritics. Accents may have a grammatical significance, and the accent may
|
|
change how the term is stemmed. For example, in French the +âmes+ suffix
|
|
generally marks a past conjugation but +ames+ does not. The standard
|
|
Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem,
|
|
but +évitames+ will be turned into +évitam+ (stripping
|
|
plural and feminine suffixes).
|
|
|
|
When the search is set to ignore diacritics, this poses a specific problem:
|
|
if the user enters the search term without accents (which is correct
|
|
because the system is supposed to ignore them), there is no warranty that
|
|
the term will be correctly expanded by stemming.
|
|
|
|
The diacritic mismatch breaks the family relationship between the stem
|
|
siblings, and this is independant of the type of index: it will happen with
|
|
an index where diacritics are stripped just as with a raw one.
|
|
|
|
The simpler case where diacritics in the original term only affects
|
|
diacritics in the stem also necessitates specific processing, but it is
|
|
easier to work around.
|
|
|
|
Two examples illustrating these issues follow.
|
|
|
|
==== The simple case: diacritics in the term only affect diacritics in the stem
|
|
|
|
Let's imagine that the document set contains the term +éviter+
|
|
(infinitive of +to avoid+), but not +évite+ (present). The only term in
|
|
the actual index is then +éviter+.
|
|
|
|
The user enters an unaccented +evite+, counting on the
|
|
diacritics-insensitive search mode to deal with the accents. As +évite+
|
|
is not present in the index, we have no way to guess that +evite+ is
|
|
really +évite+.
|
|
|
|
The stemmer will turn +evite+ into +evit+. There is no way that this
|
|
can be related to +éviter+, and this legitimate result can't be found.
|
|
|
|
There is a way around this: we can compute a separate
|
|
stem expansion dictionary for unaccented terms. This dictionary, to be used
|
|
with diacritic-unsensitive searches only, contains the relationship
|
|
between +evit+ and +eviter+ (as +éviter+ is in the index). We can
|
|
then relate +eviter+ and +éviter+ because they differ only by accents,
|
|
and the search will find the document with +éviter+.
|
|
|
|
==== The bad case: diacritics in the term change the stem beyond diacritics
|
|
|
|
Some grammatically significant accents will cause unexpectedly missing
|
|
search results when using a supposedly diacritics-insensitive search mode.
|
|
|
|
Let's imagine that the document set contains the term +éviter+
|
|
(infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming
|
|
expansion table has an entry for +évit+ -> +éviter+.
|
|
|
|
If the user enters an unaccented +evitames+, she would expect to find the
|
|
documents containing +éviter+ in the results, because the latter term is
|
|
a stemming sibling of +évitâmes+ and the search is supposedly not
|
|
influenced by diacritics, so that +evitames+ and +évitâmes+ should be
|
|
equivalent.
|
|
|
|
However, our search is now in trouble, because +évitâmes+ is not in any
|
|
document, so that there is no data in the index which would inform us about
|
|
how to transform the input term into something that differs only by accents
|
|
but would yield a correct input for the stemmer.
|
|
|
|
If we try to feed the raw user input to the stemmer, it will propose
|
|
an +evitam+ stem, which will not work, because the stem that actually
|
|
exists is +évit+, and +evitam+ can not be related to +éviter+.
|
|
|
|
The only palliative approach I can think of would be a spelling correction
|
|
of the input, performed independantly of the actual index contents, which
|
|
would notice that +évitames+ is not a French word and propose a change or an
|
|
expansion to +évitâmes+, which would correctly stem to +évit+ and allow
|
|
us to find +éviter+.
|
|
|
|
This issue is not specific to Recoll or indeed to the fact that the index
|
|
retains accent or not. As far as I can see, it is an intrinsic bad
|
|
interaction between diacritics insensitivity and stemming.
|
|
|
|
It is also interesting to note that this case becomes less probable when
|
|
the data set becomes bigger, because more term inflexions will then be
|
|
present in the index.
|
|
|
|
We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate
|
|
interface].
|