diff --git a/.hgignore b/.hgignore index 1b8209e4..5cd67428 100644 --- a/.hgignore +++ b/.hgignore @@ -73,7 +73,6 @@ src/doc/user/usermanual.pdf src/doc/user/usermanual.tex-pdf src/doc/user/usermanual.tex-pdf-tmp src/doc/user/usermanual.txt -src/doc/user/usermanual.xml src/filters/rclexecm.pyc src/filters/rcllatinclass.pyc src/index/alldeps diff --git a/src/VERSION b/src/VERSION index 35e9b24b..39893559 100644 --- a/src/VERSION +++ b/src/VERSION @@ -1 +1 @@ -1.19.12p2 +1.20.0 diff --git a/src/doc/notes/minus-hyphen-dash.txt b/src/doc/notes/minus-hyphen-dash.txt new file mode 100644 index 00000000..01970577 --- /dev/null +++ b/src/doc/notes/minus-hyphen-dash.txt @@ -0,0 +1,58 @@ += 2014-04-30: Notes about the hyphen-minus character '-': + +Ascii hyphen-minus used to be glue, but stopped around version 1.18, then +was re-instated in 1.20. + +Having - as glue avoids generating phrase searches with bad performance. + +== Dashes + +There is a diversity of Unicode characters used mostly indistinctly (and +independant of their correct intent) as dash/minus/hyphen (hyphen, n-dash, +em-dash, etc.) in real-world texts. + +The Unicode dashes are properly treated as word-breaking by the splitter, +but it means that there will sometimes be a discrepancy between the +character in the search (usually an ascii hyphen-minus), and the character +in the text (which could be anything because of mis-use). + +It does happen (incorrectly) that a dash is used in a text instead of an +hyphen to join a compound word, resulting in no span constructed, and a +minus in the question, generating a span search, resulting in missed +match. + +A possible solution consisting in changing all dash signs into minus signs +at indexing time has been dismissed because this would introduce problems +with *correct* uses of dashes (which should be treated as space). This +would not be a major issue though because a matching search would probably +use white space in this case, and single terms are also generated for the +span. + +There are auxiliary arguments: + + - Treating all dash/hyphen/minus as whitespace (except at eol) makes for a + smaller index. + - Which is especially significant for raw indexes because of + multiplicative effects ("jean francois" "Jean francois" "jean Francois" + ...) + +== Hyphens + +Hyphens have several distinct uses which should yield different treatment: + + - Use with prefixes and suffixes: co-worker should probably be transformed + into or supplemented by coworker + - Use in compound words: American-football in "American-football player" + should certainly not be collapsed. + +If an hyphen-minus is present in the text in the first case, as will be +current in practise, there is no way we can get it right anyway, except by +using a language dictionary. + +So, given that even a real hyphen needs an ambiguous treatment, we don't +try and we just replace a Unicode hyphen (0x2010) with an ascii +hyphen-minus while indexing. This has the best chance of matching what a +user would type. + +The current (1.20) recoll is unable to match coworker and co-worker. The +best treatment for this would probably be synonym expansion at search time.