recoll/website/faqsandhowtos/ZDevCaseAndDiacritics1.txt

== Character case and diacritic marks (1), issues with stemming

=== Case and diacritics in Recoll

Recoll versions up to 1.17 almost fully ignore character case and diacritic
marks.

All terms are converted to lower case and unaccented before they are
written to the index. There are only two exceptions:

 * File paths (as used in _dir:_ clauses) are not converted. This might
   be a bug or a feature, but the main reason is that we don't know how they
   are encoded.
 * It is possible to specify that some characters will keep their diacritic
   marks, because the entity formed by the character and the diacritic mark
   is considered to be a different letter, not a modified one. This is
   highly dependant on the language. For exemple, in Swedish, +å+ should
   be preserved, not turned into +a+.

As a necessary consequence, the same transformations are applied to search
terms, and it is impossible to search for a specific capitalization of a
word (+US+ is looked for as +us+), or a specific accented form
(+café+ will be looked for as +cafe+).

However, there are some cases where you would like to be more specific:

 * Searching for +US+ or +us+ should probably return different results.
 * Diacritics are seldom significant in English, but we can find a
   few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of
   course, there are many more cases in languages which use more diacritics.

On the other hand, accents are often mistyped or forgotten (résumé, résume,
resume?), and capitalization is most often unsignificant, so that it is
very important to retain the capability to ignore accent and character
case differences, and that the discrimination can be easily switched on or
off for each search (or even for specific terms).

This text and other pages which will follow will discuss issues in adding
character case and diacritics sensitivity to Recoll, under the assumption
that the main index will contain the raw source terms instead of
case-folded and unaccented ones.

The following will use the _unaccent_ neologism to mean _remove
diacritic marks_ (and not only accents).

English examples are used when possible, but given the limited use of
diacritics in English, some French will probably creep in.

=== Diacritics and stemming

Stemming is the process by which we extend a search to terms related by
grammatical inflexion, for example singular/plural, verb tenses, etc. For
example a search for +floor+ is normally expanded by Recoll to +floors,
floored, flooring, ...+

In practice Recoll has a separate data structure that has stemmed terms
(stems) as keys pointing to a list of expansion terms
{{{floor -> (floor,floors,floorings,...)}}}

Stemming should be applied to terms before they are stripped of
diacritics. Accents may have a grammatical significance, and the accent may
change how the term is stemmed. For example, in French the +âmes+ suffix
generally marks a past conjugation but +ames+ does not. The standard
Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem,
but +évitames+ will be turned into +évitam+ (stripping
plural and feminine suffixes).

When the search is set to ignore diacritics, this poses a specific problem:
if the user enters the search term without accents (which is correct
because the system is supposed to ignore them), there is no warranty that
the term will be correctly expanded by stemming.

The diacritic mismatch breaks the family relationship between the stem
siblings, and this is independant of the type of index: it will happen with
an index where diacritics are stripped just as with a raw one.

The simpler case where diacritics in the original term only affects
diacritics in the stem also necessitates specific processing, but it is
easier to work around.

Two examples illustrating these issues follow.

==== The simple case: diacritics in the term only affect diacritics in the stem

Let's imagine that the document set contains the term +éviter+
(infinitive of +to avoid+), but not +évite+ (present). The only term in
the actual index is then +éviter+.

The user enters an unaccented +evite+, counting on the
diacritics-insensitive search mode to deal with the accents. As +évite+
is not present in the index, we have no way to guess that +evite+ is
really +évite+.

The stemmer will turn +evite+ into +evit+. There is no way that this
can be related to +éviter+, and this legitimate result can't be found.

There is a way around this: we can compute a separate
stem expansion dictionary for unaccented terms. This dictionary, to be used
with diacritic-unsensitive searches only, contains the relationship
between +evit+ and +eviter+ (as +éviter+ is in the index). We can
then relate +eviter+ and +éviter+ because they differ only by accents,
and the search will find the document with +éviter+.

==== The bad case: diacritics in the term change the stem beyond diacritics

Some grammatically significant accents will cause unexpectedly missing
search results when using a supposedly diacritics-insensitive search mode.

Let's imagine that the document set contains the term +éviter+
(infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming
expansion table has an entry for +évit+ -> +éviter+.

If the user enters an unaccented +evitames+, she would expect to find the
documents containing +éviter+ in the results, because the latter term is
a stemming sibling of +évitâmes+ and the search is supposedly not
influenced by diacritics, so that +evitames+ and +évitâmes+ should be
equivalent.

However, our search is now in trouble, because +évitâmes+ is not in any
document, so that there is no data in the index which would inform us about
how to transform the input term into something that differs only by accents
but would yield a correct input for the stemmer.

If we try to feed the raw user input to the stemmer, it will propose
an +evitam+ stem, which will not work, because the stem that actually
exists is +évit+, and +evitam+ can not be related to +éviter+.

The only palliative approach I can think of would be a spelling correction
of the input, performed independantly of the actual index contents, which
would notice that +évitames+ is not a French word and propose a change or an
expansion to +évitâmes+, which would correctly stem to +évit+ and allow
us to find +éviter+.

This issue is not specific to Recoll or indeed to the fact that the index
retains accent or not. As far as I can see, it is an intrinsic bad
interaction between diacritics insensitivity and stemming.

It is also interesting to note that this case becomes less probable when
the data set becomes bigger, because more term inflexions will then be
present in the index.

We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate
interface].