60 Commits

Author SHA1 Message Date
Jean-Francois Dockes
d86f74a9e8 missing include 2012-10-16 16:10:14 +02:00
Jean-Francois Dockes
9801f0389f fixed bug that would erase search term made of single wildcard 2012-10-05 09:15:09 +02:00
Jean-Francois Dockes
d3a26706b5 add a class for skipped characters 2012-10-03 09:07:59 +02:00
Jean-Francois Dockes
3f331ebb3e fix glitch caused by udi prefix change 2012-10-03 08:05:39 +02:00
Jean-Francois Dockes
efd319025d attempt to eliminate more unicode uninteresting characters 2012-10-02 17:45:16 +02:00
Jean-Francois Dockes
63d97e597b added a bunch of graphic characters to the word breakers list and changed the container used from set to unordered_set for speed 2012-09-19 19:50:45 +02:00
"Jean-Francois Dockes ext:(%22)
0ebfc496d8 add capability to remember page breaks generated by, e.g. pdftotext, and use them to start an external viewer on a match page 2012-08-21 15:03:02 +02:00
Jean-Francois Dockes
4eaf12fb9c more delistification 2012-04-12 08:15:50 +02:00
Jean-Francois Dockes
6c72454396 generate acronyms for dotted abbrevs. ie O.E.C.D -> OECD 2011-10-20 13:24:29 +02:00
Jean-Francois Dockes
0860b559ee get rid of a few garbage terms during indexing. Set a threshold for conversion errors after which we discard the doc. Stabilize the new termproc pipeline but no commongrams for now 2011-10-12 17:55:58 +02:00
"Jean-Francois Dockes ext:(%22)
36516b091b textsplit: discard - in front of words. Handle cjk punctuation characters 2011-07-16 11:51:38 +02:00
Jean-Francois Dockes
cb0794e92c textsplit: eliminate some garbage terms (ie long sequences of dashes) 2011-07-06 16:20:32 +02:00
Jean-Francois Dockes
55f124725f Fix problems that occurred when multiple threads were trying to read/convert files at the same time (ie: indexing and previewing threads in the GUI calling internfile()). Either get rid of or lock-protect all shared data, eliminate misc initialization possible conflicts by using static initializers. Hopefuly closes issue #51 2011-04-28 10:58:33 +02:00
Jean-Francois Dockes
b28eaf23fb Got rid of all the old RCS id strings 2011-04-27 08:22:17 +02:00
Jean-Francois Dockes
8520ec668a recognize more numbers: 1e-10, 1.e3 2010-05-17 09:20:09 +02:00
Jean-Francois Dockes
48358c8252 Added option nonumbers not to generate terms for numbers. closes #16 2010-05-05 10:18:56 +02:00
Jean-Francois Dockes
8b2b00bc72 cosmetics: use derived class for actual splitter instead of callback 2010-02-02 15:33:52 +01:00
dockes
69c27db46a add --enable-camelcase option to configure 2009-12-14 10:10:01 +00:00
dockes
bf3ac8e053 small amd64 fixes: 64 bits size_type, signed chars 2009-12-13 16:13:59 +00:00
dockes
3223d1245a process camelCase 2009-10-09 13:57:33 +00:00
dockes
6169fdec4b Emit a_b intermediary span when splitting a_b.c 2009-01-27 10:25:26 +00:00
dockes
7a22709cab add _ to wordsep/spanglue chars. Add non-ascii test to isCJK for optimization 2009-01-13 16:03:13 +00:00
dockes
64ef8d0b81 dont insert space in cjk abstracts 2008-12-12 11:53:45 +00:00
dockes
3414963810 take care of splitting user string with respect to unicode white space, not only ascii 2008-12-05 11:09:31 +00:00
dockes
46a7f05cbc gcc 4 compat, thanks to Kartik Mistry 2007-12-13 06:58:22 +00:00
dockes
90e378333e make cjk ngramlen configurable 2007-10-04 12:21:52 +00:00
dockes
4adb351ca4 add flag to disable cjk processing 2007-10-02 11:39:08 +00:00
dockes
ea7d3cd26e include assert.h when needed 2007-09-22 08:51:29 +00:00
dockes
645018d574 logs 2007-09-20 12:22:26 +00:00
dockes
069d71ea8f initial cjk support 2007-09-20 08:45:05 +00:00
dockes
ba295fae4f use m_ prefix for members 2007-09-18 20:35:31 +00:00
dockes
25a9b93635 [] are also wildcard chars 2007-01-25 15:40:55 +00:00
dockes
d12021b22c handle wildcards in search terms 2007-01-18 12:09:58 +00:00
dockes
554f75c99c only autophrase if query has several terms 2006-12-08 07:11:17 +00:00
dockes
9d6963c95a improved textsplit speed (needs utf8iter modifs too 2006-11-20 11:17:53 +00:00
dockes
b3ab39522b optim ckpt 2006-11-19 18:37:37 +00:00
dockes
31b348b736 phrase queries with bot spans and words must be splitted as words only 2006-11-12 08:35:11 +00:00
dockes
507ee32fdb 132.jpg was not split 2006-09-21 05:59:02 +00:00
dockes
4928503f60 fixed small glitch in abstract text splitting 2006-04-25 08:17:36 +00:00
dockes
930bdc870d comments and moving some util routines out of rcldb.cpp 2006-04-11 06:49:45 +00:00
dockes
346506a31d use string::erase() not clear() 2006-02-01 14:18:20 +00:00
dockes
91ac7b7885 moved span cleanup where it belonged 2006-01-30 09:28:16 +00:00
dockes
3c78938565 *** empty log message *** 2006-01-28 15:36:59 +00:00
dockes
8c9eb8c6d3 more textsplit tweaking 2006-01-28 10:23:55 +00:00
dockes
2a3075d6a6 reference to GPL in all .cpp files 2006-01-23 13:32:29 +00:00
dockes
36fe342ffd split stdin 2005-12-04 17:10:22 +00:00
dockes
ae8ff5abb3 *** empty log message *** 2005-11-24 07:16:16 +00:00
dockes
ce740a26ad most of adv search working. Still need subtree/filename filters 2005-10-19 10:21:48 +00:00
dockes
e1c3dbfeb3 adjust start/end of word when trimming 2005-09-22 14:09:04 +00:00
dockes
d8297680b1 fix problems with word followed by . 2005-09-22 11:10:11 +00:00