doc
This commit is contained in:
parent
0ded457258
commit
b70de3130f
@ -73,7 +73,6 @@ src/doc/user/usermanual.pdf
|
||||
src/doc/user/usermanual.tex-pdf
|
||||
src/doc/user/usermanual.tex-pdf-tmp
|
||||
src/doc/user/usermanual.txt
|
||||
src/doc/user/usermanual.xml
|
||||
src/filters/rclexecm.pyc
|
||||
src/filters/rcllatinclass.pyc
|
||||
src/index/alldeps
|
||||
|
||||
@ -1 +1 @@
|
||||
1.19.12p2
|
||||
1.20.0
|
||||
|
||||
58
src/doc/notes/minus-hyphen-dash.txt
Normal file
58
src/doc/notes/minus-hyphen-dash.txt
Normal file
@ -0,0 +1,58 @@
|
||||
= 2014-04-30: Notes about the hyphen-minus character '-':
|
||||
|
||||
Ascii hyphen-minus used to be glue, but stopped around version 1.18, then
|
||||
was re-instated in 1.20.
|
||||
|
||||
Having - as glue avoids generating phrase searches with bad performance.
|
||||
|
||||
== Dashes
|
||||
|
||||
There is a diversity of Unicode characters used mostly indistinctly (and
|
||||
independant of their correct intent) as dash/minus/hyphen (hyphen, n-dash,
|
||||
em-dash, etc.) in real-world texts.
|
||||
|
||||
The Unicode dashes are properly treated as word-breaking by the splitter,
|
||||
but it means that there will sometimes be a discrepancy between the
|
||||
character in the search (usually an ascii hyphen-minus), and the character
|
||||
in the text (which could be anything because of mis-use).
|
||||
|
||||
It does happen (incorrectly) that a dash is used in a text instead of an
|
||||
hyphen to join a compound word, resulting in no span constructed, and a
|
||||
minus in the question, generating a span search, resulting in missed
|
||||
match.
|
||||
|
||||
A possible solution consisting in changing all dash signs into minus signs
|
||||
at indexing time has been dismissed because this would introduce problems
|
||||
with *correct* uses of dashes (which should be treated as space). This
|
||||
would not be a major issue though because a matching search would probably
|
||||
use white space in this case, and single terms are also generated for the
|
||||
span.
|
||||
|
||||
There are auxiliary arguments:
|
||||
|
||||
- Treating all dash/hyphen/minus as whitespace (except at eol) makes for a
|
||||
smaller index.
|
||||
- Which is especially significant for raw indexes because of
|
||||
multiplicative effects ("jean francois" "Jean francois" "jean Francois"
|
||||
...)
|
||||
|
||||
== Hyphens
|
||||
|
||||
Hyphens have several distinct uses which should yield different treatment:
|
||||
|
||||
- Use with prefixes and suffixes: co-worker should probably be transformed
|
||||
into or supplemented by coworker
|
||||
- Use in compound words: American-football in "American-football player"
|
||||
should certainly not be collapsed.
|
||||
|
||||
If an hyphen-minus is present in the text in the first case, as will be
|
||||
current in practise, there is no way we can get it right anyway, except by
|
||||
using a language dictionary.
|
||||
|
||||
So, given that even a real hyphen needs an ambiguous treatment, we don't
|
||||
try and we just replace a Unicode hyphen (0x2010) with an ascii
|
||||
hyphen-minus while indexing. This has the best chance of matching what a
|
||||
user would type.
|
||||
|
||||
The current (1.20) recoll is unable to match coworker and co-worker. The
|
||||
best treatment for this would probably be synonym expansion at search time.
|
||||
Loading…
x
Reference in New Issue
Block a user