40 Commits

Author SHA1 Message Date
Jean-Francois Dockes
15ea565e9f m_words_in_span was always properly reset between invocations (if discardspan() was not called for some reason), resulting in crashes 2017-05-15 10:26:38 +02:00
Jean-Francois Dockes
9661a4431e wen 2017-04-18 14:39:12 +02:00
Jean-Francois Dockes
c1c73573d8 more int fixups
--HG--
branch : WINDOWSPORT
2015-09-02 07:34:59 +02:00
Jean-Francois Dockes
1cbf02f713 Suppressed many integer size warnings by a mix of type adjustments and casts,
none of which should have a real effect.

--HG--
branch : WINDOWSPORT
2015-09-01 19:39:20 +02:00
Jean-Francois Dockes
e1bb1a3022 Make dehyphenate (co-worker->coworker) optional 2015-08-19 11:34:26 +02:00
Jean-Francois Dockes
17d0a6cbba namespace std 2015-08-13 18:12:00 +02:00
Jean-Francois Dockes
0755f4f4e2 removed unused method 2015-06-09 19:16:10 +02:00
Jean-Francois Dockes
077aed3018 fix term byte offsets produced by new textsplit: for highlighting 2014-04-24 12:42:10 +02:00
Jean-Francois Dockes
ece15318ab New text splitter with word accumulator and full partial span generation. Search/Index seem ok. Still a pb with use for highlighting (preview) 2014-04-24 10:13:19 +02:00
Jean-Francois Dockes
b4c7efe490 Added (unifdefd) code to detect garbage data like undecoded base64 by looking at word length stats 2013-04-27 08:29:55 +02:00
"Jean-Francois Dockes ext:(%22)
0ebfc496d8 add capability to remember page breaks generated by, e.g. pdftotext, and use them to start an external viewer on a match page 2012-08-21 15:03:02 +02:00
Jean-Francois Dockes
4eaf12fb9c more delistification 2012-04-12 08:15:50 +02:00
Jean-Francois Dockes
5fd31172f5 New text to terms processing pipelines: results identical to 1.16 when used with empty stopfile 2011-10-07 07:53:49 +02:00
Jean-Francois Dockes
cb0794e92c textsplit: eliminate some garbage terms (ie long sequences of dashes) 2011-07-06 16:20:32 +02:00
Jean-Francois Dockes
b28eaf23fb Got rid of all the old RCS id strings 2011-04-27 08:22:17 +02:00
Jean-Francois Dockes
48358c8252 Added option nonumbers not to generate terms for numbers. closes #16 2010-05-05 10:18:56 +02:00
Jean-Francois Dockes
8b2b00bc72 cosmetics: use derived class for actual splitter instead of callback 2010-02-02 15:33:52 +01:00
dockes
6169fdec4b Emit a_b intermediary span when splitting a_b.c 2009-01-27 10:25:26 +00:00
dockes
64ef8d0b81 dont insert space in cjk abstracts 2008-12-12 11:53:45 +00:00
dockes
3414963810 take care of splitting user string with respect to unicode white space, not only ascii 2008-12-05 11:09:31 +00:00
dockes
90e378333e make cjk ngramlen configurable 2007-10-04 12:21:52 +00:00
dockes
4adb351ca4 add flag to disable cjk processing 2007-10-02 11:39:08 +00:00
dockes
069d71ea8f initial cjk support 2007-09-20 08:45:05 +00:00
dockes
ba295fae4f use m_ prefix for members 2007-09-18 20:35:31 +00:00
dockes
d12021b22c handle wildcards in search terms 2007-01-18 12:09:58 +00:00
dockes
554f75c99c only autophrase if query has several terms 2006-12-08 07:11:17 +00:00
dockes
9d6963c95a improved textsplit speed (needs utf8iter modifs too 2006-11-20 11:17:53 +00:00
dockes
b3ab39522b optim ckpt 2006-11-19 18:37:37 +00:00
dockes
31b348b736 phrase queries with bot spans and words must be splitted as words only 2006-11-12 08:35:11 +00:00
dockes
3872f8cf38 *** empty log message *** 2006-01-30 11:15:28 +00:00
dockes
3c78938565 *** empty log message *** 2006-01-28 15:36:59 +00:00
dockes
8c9eb8c6d3 more textsplit tweaking 2006-01-28 10:23:55 +00:00
dockes
ce740a26ad most of adv search working. Still need subtree/filename filters 2005-10-19 10:21:48 +00:00
dockes
8493933aef comments 2005-10-10 13:25:23 +00:00
dockes
4588803281 phrases ok except for preview position 2005-02-08 10:56:13 +00:00
dockes
4c54a8478f fixes in textsplit 2005-02-08 09:34:47 +00:00
dockes
2a020407da simple term highlighting in query preview 2005-02-07 13:17:47 +00:00
dockes
5210139b85 *** empty log message *** 2005-01-24 13:17:59 +00:00
dockes
869b57eb8c *** empty log message *** 2004-12-17 13:01:01 +00:00
dockes
5ca462cdff *** empty log message *** 2004-12-14 17:54:16 +00:00