68 lines
3.2 KiB
Plaintext
68 lines
3.2 KiB
Plaintext
== Character case and diacritic marks (3), implementation
|
|
|
|
In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics
|
|
and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate
|
|
interface] for switchable search sensitivity to diacritics and character
|
|
case.
|
|
|
|
So you are in this mood again and you don't want to type accents (maybe you're
|
|
stuck with a QWERTY American english keyboard), or conversely you're
|
|
want to resume looking for your résumé, and you've told Recoll as much,
|
|
using the appropriate interface. What happens then ?
|
|
|
|
The second case is easy if the index is raw, and mostly impossible if it is
|
|
stripped. So we'll concentrate on the first case: how to achieve case and
|
|
diacritics insensitivity on a raw index ?
|
|
|
|
Recoll uses three expansion tables:
|
|
|
|
* The first table has stripped and lowercased terms as keys and raw terms as
|
|
data: +mate -> (mate, maté, MATE,...)+.
|
|
|
|
* The second table has lowercased stems as keys and original lowercase terms
|
|
as data (when using multiple languages, there are several such tables):
|
|
+évit -> (éviter, évite, évitâmes, ...)+.
|
|
|
|
* The third table has stripped and lowercased stems as keys and stripped
|
|
lowercased terms as data:
|
|
+evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+
|
|
|
|
The first table can be used for full case and diacritics expansion or for
|
|
only one of those, by post-filtering the results of full expansion (e.g. if
|
|
we only want diacritics expansion, we filter by stripping diacritics from
|
|
each result term and check that it's identical to the input). For example
|
|
if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to
|
|
only perform case expansion for an input of +maté+, we apply case folding
|
|
to the initial output and keep only +maté+, as +mate+ differs from the
|
|
input.
|
|
|
|
We only perform stemming expansion when case and diacritics sensitivity is
|
|
off. It is performed using the second and third tables, both on the
|
|
lowercased and lowercased/stripped output of the first step, and each term
|
|
in the output stemming is expanded again for case (using the first table).
|
|
|
|
A full example of the expansion occurring during an insensitive search
|
|
for +resume+ using French stemming on a mixed English/French index
|
|
follows. An important thing to remember is that the result of each
|
|
expansion is a function of the terms actually present in the index, not
|
|
some arbitrary computation (and so, of course, many of the possible but
|
|
absent variations are missing).
|
|
|
|
# The case and diacritics expansion of +resume+ yields +RESUME Resume
|
|
Résumé resumé résume résumé resume+
|
|
|
|
# The Stem expansion input list (lower-cased) is:
|
|
+resume resumé résume résumé+, and the output is:
|
|
+resum resume resumenes resumer resumes resumé resumée résum résumait
|
|
résumant résume résumer résumerai résumerait résumes résumez résumé résumée
|
|
résumées résumés+
|
|
|
|
# Each of the above terms is then fed to case and diacritics expansion (first
|
|
table), for the final output:
|
|
+resume résumé Résumé résumer résume Resume résumés RESUME resumes
|
|
resumer résumant resúmenes resumé résumait résumes résumée resumee
|
|
résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.
|
|
|
|
A Xapian OR query is finally constructed from the expanded term list.
|
|
|