Jean-Francois Dockes
|
96d99ad6e5
|
textsplit: check for underflow while trimming the span
|
2014-05-19 18:52:51 +02:00 |
|
Jean-Francois Dockes
|
0145234b60
|
translate unicode hyphen (0x2010) in to ascii minus
|
2014-04-30 09:59:51 +02:00 |
|
Jean-Francois Dockes
|
077aed3018
|
fix term byte offsets produced by new textsplit: for highlighting
|
2014-04-24 12:42:10 +02:00 |
|
Jean-Francois Dockes
|
ece15318ab
|
New text splitter with word accumulator and full partial span generation. Search/Index seem ok. Still a pb with use for highlighting (preview)
|
2014-04-24 10:13:19 +02:00 |
|
Jean-Francois Dockes
|
e12d66865e
|
Deal with tr1 being gone in c0x11 compilers
|
2013-10-18 13:02:48 +02:00 |
|
Jean-Francois Dockes
|
b4c7efe490
|
Added (unifdefd) code to detect garbage data like undecoded base64 by looking at word length stats
|
2013-04-27 08:29:55 +02:00 |
|
Jean-Francois Dockes
|
d06e45946a
|
Handle wildcards as normal chars everywhere when splitting for query
|
2013-03-30 12:49:31 +01:00 |
|
Jean-Francois Dockes
|
0ae8ec99f6
|
more utf-8 err checking prevents bogus terms in index
|
2013-03-30 10:24:10 +01:00 |
|
Jean-Francois Dockes
|
df49598a8d
|
make comma a normal wordsplit char
|
2013-03-22 10:06:02 +01:00 |
|
Jean-Francois Dockes
|
dcf937d650
|
remove use of - as span-building character.
|
2013-03-04 12:16:11 +01:00 |
|
Jean-Francois Dockes
|
d2f7f11715
|
Use dynamic lib for shared recoll code
|
2012-12-29 14:27:01 +01:00 |
|
Jean-Francois Dockes
|
4544571490
|
term generation: only keep @ when not at start of term
|
2012-11-18 08:25:19 +01:00 |
|
Jean-Francois Dockes
|
d86f74a9e8
|
missing include
|
2012-10-16 16:10:14 +02:00 |
|
Jean-Francois Dockes
|
9801f0389f
|
fixed bug that would erase search term made of single wildcard
|
2012-10-05 09:15:09 +02:00 |
|
Jean-Francois Dockes
|
d3a26706b5
|
add a class for skipped characters
|
2012-10-03 09:07:59 +02:00 |
|
Jean-Francois Dockes
|
3f331ebb3e
|
fix glitch caused by udi prefix change
|
2012-10-03 08:05:39 +02:00 |
|
Jean-Francois Dockes
|
efd319025d
|
attempt to eliminate more unicode uninteresting characters
|
2012-10-02 17:45:16 +02:00 |
|
Jean-Francois Dockes
|
63d97e597b
|
added a bunch of graphic characters to the word breakers list and changed the container used from set to unordered_set for speed
|
2012-09-19 19:50:45 +02:00 |
|
"Jean-Francois Dockes ext:(%22)
|
0ebfc496d8
|
add capability to remember page breaks generated by, e.g. pdftotext, and use them to start an external viewer on a match page
|
2012-08-21 15:03:02 +02:00 |
|
Jean-Francois Dockes
|
4eaf12fb9c
|
more delistification
|
2012-04-12 08:15:50 +02:00 |
|
Jean-Francois Dockes
|
6c72454396
|
generate acronyms for dotted abbrevs. ie O.E.C.D -> OECD
|
2011-10-20 13:24:29 +02:00 |
|
Jean-Francois Dockes
|
0860b559ee
|
get rid of a few garbage terms during indexing. Set a threshold for conversion errors after which we discard the doc. Stabilize the new termproc pipeline but no commongrams for now
|
2011-10-12 17:55:58 +02:00 |
|
"Jean-Francois Dockes ext:(%22)
|
36516b091b
|
textsplit: discard - in front of words. Handle cjk punctuation characters
|
2011-07-16 11:51:38 +02:00 |
|
Jean-Francois Dockes
|
cb0794e92c
|
textsplit: eliminate some garbage terms (ie long sequences of dashes)
|
2011-07-06 16:20:32 +02:00 |
|
Jean-Francois Dockes
|
55f124725f
|
Fix problems that occurred when multiple threads were trying to read/convert files at the same time (ie: indexing and previewing threads in the GUI calling internfile()). Either get rid of or lock-protect all shared data, eliminate misc initialization possible conflicts by using static initializers. Hopefuly closes issue #51
|
2011-04-28 10:58:33 +02:00 |
|
Jean-Francois Dockes
|
b28eaf23fb
|
Got rid of all the old RCS id strings
|
2011-04-27 08:22:17 +02:00 |
|
Jean-Francois Dockes
|
8520ec668a
|
recognize more numbers: 1e-10, 1.e3
|
2010-05-17 09:20:09 +02:00 |
|
Jean-Francois Dockes
|
48358c8252
|
Added option nonumbers not to generate terms for numbers. closes #16
|
2010-05-05 10:18:56 +02:00 |
|
Jean-Francois Dockes
|
8b2b00bc72
|
cosmetics: use derived class for actual splitter instead of callback
|
2010-02-02 15:33:52 +01:00 |
|
dockes
|
69c27db46a
|
add --enable-camelcase option to configure
|
2009-12-14 10:10:01 +00:00 |
|
dockes
|
bf3ac8e053
|
small amd64 fixes: 64 bits size_type, signed chars
|
2009-12-13 16:13:59 +00:00 |
|
dockes
|
3223d1245a
|
process camelCase
|
2009-10-09 13:57:33 +00:00 |
|
dockes
|
6169fdec4b
|
Emit a_b intermediary span when splitting a_b.c
|
2009-01-27 10:25:26 +00:00 |
|
dockes
|
7a22709cab
|
add _ to wordsep/spanglue chars. Add non-ascii test to isCJK for optimization
|
2009-01-13 16:03:13 +00:00 |
|
dockes
|
64ef8d0b81
|
dont insert space in cjk abstracts
|
2008-12-12 11:53:45 +00:00 |
|
dockes
|
3414963810
|
take care of splitting user string with respect to unicode white space, not only ascii
|
2008-12-05 11:09:31 +00:00 |
|
dockes
|
46a7f05cbc
|
gcc 4 compat, thanks to Kartik Mistry
|
2007-12-13 06:58:22 +00:00 |
|
dockes
|
90e378333e
|
make cjk ngramlen configurable
|
2007-10-04 12:21:52 +00:00 |
|
dockes
|
4adb351ca4
|
add flag to disable cjk processing
|
2007-10-02 11:39:08 +00:00 |
|
dockes
|
ea7d3cd26e
|
include assert.h when needed
|
2007-09-22 08:51:29 +00:00 |
|
dockes
|
645018d574
|
logs
|
2007-09-20 12:22:26 +00:00 |
|
dockes
|
069d71ea8f
|
initial cjk support
|
2007-09-20 08:45:05 +00:00 |
|
dockes
|
ba295fae4f
|
use m_ prefix for members
|
2007-09-18 20:35:31 +00:00 |
|
dockes
|
25a9b93635
|
[] are also wildcard chars
|
2007-01-25 15:40:55 +00:00 |
|
dockes
|
d12021b22c
|
handle wildcards in search terms
|
2007-01-18 12:09:58 +00:00 |
|
dockes
|
554f75c99c
|
only autophrase if query has several terms
|
2006-12-08 07:11:17 +00:00 |
|
dockes
|
9d6963c95a
|
improved textsplit speed (needs utf8iter modifs too
|
2006-11-20 11:17:53 +00:00 |
|
dockes
|
b3ab39522b
|
optim ckpt
|
2006-11-19 18:37:37 +00:00 |
|
dockes
|
31b348b736
|
phrase queries with bot spans and words must be splitted as words only
|
2006-11-12 08:35:11 +00:00 |
|
dockes
|
507ee32fdb
|
132.jpg was not split
|
2006-09-21 05:59:02 +00:00 |
|