88 Commits

Author SHA1 Message Date
Jean-Francois Dockes
9661a4431e wen 2017-04-18 14:39:12 +02:00
Jean-Francois Dockes
0b0385e459 got rid of the STD_SHARED_XX std/tr1 defines 2016-07-13 15:12:25 +02:00
Jean-Francois Dockes
d8f4500f90 fix debuglog ref in test driver + std=c++11 2016-07-12 19:32:02 +02:00
Jean-Francois Dockes
f6a999de84 logging now uses c++ streams 2016-07-12 09:41:04 +02:00
Jean-Francois Dockes
a905a92328 arrange so that ' .net' is split as .net and net. Previously it only produced .net, which meant that matching filename extensions, like in fn:pdf$ did not work well because of cases where a special char or a space occurred before the . 2016-06-20 17:25:25 +02:00
Jean-Francois Dockes
0a9d55e790 Suppressed a couple warnings (unsigned issues) + small windows release fixes 2016-01-29 17:30:50 +01:00
Jean-Francois Dockes
c1c73573d8 more int fixups
--HG--
branch : WINDOWSPORT
2015-09-02 07:34:59 +02:00
Jean-Francois Dockes
1cbf02f713 Suppressed many integer size warnings by a mix of type adjustments and casts,
none of which should have a real effect.

--HG--
branch : WINDOWSPORT
2015-09-01 19:39:20 +02:00
Jean-Francois Dockes
82295328cc Test for end() after lower_bound call before dereferencing!
--HG--
branch : WINDOWSPORT
2015-09-01 14:44:30 +02:00
Jean-Francois Dockes
e1bb1a3022 Make dehyphenate (co-worker->coworker) optional 2015-08-19 11:34:26 +02:00
Jean-Francois Dockes
94eb3119ce Generate an additional unhyphenated term for singly hyphenated words: co-worker will index as [co worker], [co-worker] and [coworker]. Only produce terms for alphanumeric hashtags (discard #,xyz) 2015-08-13 18:18:49 +02:00
Jean-Francois Dockes
abe9fb671f clean up autoconf of unordered_xx, prepare change to shared_ptr 2015-08-09 10:21:46 +02:00
Jean-Francois Dockes
f70ec1cab7 comment 2015-07-31 11:23:17 +02:00
Jean-Francois Dockes
0be78cfe48 index #hashtags as such 2014-07-29 09:56:00 +02:00
Jean-Francois Dockes
efaa1fb3a3 fix textsplit core dump caused by interaction of new 1.20 code with little-tested camelcase splitting section 2014-07-28 22:12:35 +02:00
Jean-Francois Dockes
8e7bac08c1 test driver 2014-06-10 17:41:46 +02:00
Jean-Francois Dockes
96d99ad6e5 textsplit: check for underflow while trimming the span 2014-05-19 18:52:51 +02:00
Jean-Francois Dockes
0145234b60 translate unicode hyphen (0x2010) in to ascii minus 2014-04-30 09:59:51 +02:00
Jean-Francois Dockes
077aed3018 fix term byte offsets produced by new textsplit: for highlighting 2014-04-24 12:42:10 +02:00
Jean-Francois Dockes
ece15318ab New text splitter with word accumulator and full partial span generation. Search/Index seem ok. Still a pb with use for highlighting (preview) 2014-04-24 10:13:19 +02:00
Jean-Francois Dockes
e12d66865e Deal with tr1 being gone in c0x11 compilers 2013-10-18 13:02:48 +02:00
Jean-Francois Dockes
b4c7efe490 Added (unifdefd) code to detect garbage data like undecoded base64 by looking at word length stats 2013-04-27 08:29:55 +02:00
Jean-Francois Dockes
d06e45946a Handle wildcards as normal chars everywhere when splitting for query 2013-03-30 12:49:31 +01:00
Jean-Francois Dockes
0ae8ec99f6 more utf-8 err checking prevents bogus terms in index 2013-03-30 10:24:10 +01:00
Jean-Francois Dockes
df49598a8d make comma a normal wordsplit char 2013-03-22 10:06:02 +01:00
Jean-Francois Dockes
dcf937d650 remove use of - as span-building character. 2013-03-04 12:16:11 +01:00
Jean-Francois Dockes
d2f7f11715 Use dynamic lib for shared recoll code 2012-12-29 14:27:01 +01:00
Jean-Francois Dockes
4544571490 term generation: only keep @ when not at start of term 2012-11-18 08:25:19 +01:00
Jean-Francois Dockes
d86f74a9e8 missing include 2012-10-16 16:10:14 +02:00
Jean-Francois Dockes
9801f0389f fixed bug that would erase search term made of single wildcard 2012-10-05 09:15:09 +02:00
Jean-Francois Dockes
d3a26706b5 add a class for skipped characters 2012-10-03 09:07:59 +02:00
Jean-Francois Dockes
3f331ebb3e fix glitch caused by udi prefix change 2012-10-03 08:05:39 +02:00
Jean-Francois Dockes
efd319025d attempt to eliminate more unicode uninteresting characters 2012-10-02 17:45:16 +02:00
Jean-Francois Dockes
63d97e597b added a bunch of graphic characters to the word breakers list and changed the container used from set to unordered_set for speed 2012-09-19 19:50:45 +02:00
"Jean-Francois Dockes ext:(%22)
0ebfc496d8 add capability to remember page breaks generated by, e.g. pdftotext, and use them to start an external viewer on a match page 2012-08-21 15:03:02 +02:00
Jean-Francois Dockes
4eaf12fb9c more delistification 2012-04-12 08:15:50 +02:00
Jean-Francois Dockes
6c72454396 generate acronyms for dotted abbrevs. ie O.E.C.D -> OECD 2011-10-20 13:24:29 +02:00
Jean-Francois Dockes
0860b559ee get rid of a few garbage terms during indexing. Set a threshold for conversion errors after which we discard the doc. Stabilize the new termproc pipeline but no commongrams for now 2011-10-12 17:55:58 +02:00
"Jean-Francois Dockes ext:(%22)
36516b091b textsplit: discard - in front of words. Handle cjk punctuation characters 2011-07-16 11:51:38 +02:00
Jean-Francois Dockes
cb0794e92c textsplit: eliminate some garbage terms (ie long sequences of dashes) 2011-07-06 16:20:32 +02:00
Jean-Francois Dockes
55f124725f Fix problems that occurred when multiple threads were trying to read/convert files at the same time (ie: indexing and previewing threads in the GUI calling internfile()). Either get rid of or lock-protect all shared data, eliminate misc initialization possible conflicts by using static initializers. Hopefuly closes issue #51 2011-04-28 10:58:33 +02:00
Jean-Francois Dockes
b28eaf23fb Got rid of all the old RCS id strings 2011-04-27 08:22:17 +02:00
Jean-Francois Dockes
8520ec668a recognize more numbers: 1e-10, 1.e3 2010-05-17 09:20:09 +02:00
Jean-Francois Dockes
48358c8252 Added option nonumbers not to generate terms for numbers. closes #16 2010-05-05 10:18:56 +02:00
Jean-Francois Dockes
8b2b00bc72 cosmetics: use derived class for actual splitter instead of callback 2010-02-02 15:33:52 +01:00
dockes
69c27db46a add --enable-camelcase option to configure 2009-12-14 10:10:01 +00:00
dockes
bf3ac8e053 small amd64 fixes: 64 bits size_type, signed chars 2009-12-13 16:13:59 +00:00
dockes
3223d1245a process camelCase 2009-10-09 13:57:33 +00:00
dockes
6169fdec4b Emit a_b intermediary span when splitting a_b.c 2009-01-27 10:25:26 +00:00
dockes
7a22709cab add _ to wordsep/spanglue chars. Add non-ascii test to isCJK for optimization 2009-01-13 16:03:13 +00:00