Jean-Francois Dockes
|
bdc8d3eb38
|
Add config variable to process backslashes as letters
|
2019-01-29 18:32:19 +01:00 |
|
Jean-Francois Dockes
|
55e2fe5d27
|
Prevent text splitter bad array access and stl assertion crash (fedora rpmbuild) in marginal case. There were probably no real consequence beyond triggering the assertion
|
2018-11-15 18:19:39 +01:00 |
|
Jean-Francois Dockes
|
9244e31574
|
fixed a few spelling errors, mostly in comments and debug messages
|
2018-05-03 16:20:36 +02:00 |
|
Jean-Francois Dockes
|
5b35ecfe36
|
Windows warning suppression (no real changes)
|
2018-01-19 17:26:43 +01:00 |
|
Jean-Francois Dockes
|
15ea565e9f
|
m_words_in_span was always properly reset between invocations (if discardspan() was not called for some reason), resulting in crashes
|
2017-05-15 10:26:38 +02:00 |
|
Jean-Francois Dockes
|
f853f39ef3
|
Partially revert change treating Katakana as words, going back to n-grams. Did not work well because of separator-less compounds mostly
|
2017-04-25 10:20:38 +02:00 |
|
Jean-Francois Dockes
|
adaf7c77f9
|
Process katakana-western transitions as word breaks
|
2017-04-21 12:08:43 +02:00 |
|
Jean-Francois Dockes
|
9661a4431e
|
wen
|
2017-04-18 14:39:12 +02:00 |
|
Jean-Francois Dockes
|
0b0385e459
|
got rid of the STD_SHARED_XX std/tr1 defines
|
2016-07-13 15:12:25 +02:00 |
|
Jean-Francois Dockes
|
d8f4500f90
|
fix debuglog ref in test driver + std=c++11
|
2016-07-12 19:32:02 +02:00 |
|
Jean-Francois Dockes
|
f6a999de84
|
logging now uses c++ streams
|
2016-07-12 09:41:04 +02:00 |
|
Jean-Francois Dockes
|
a905a92328
|
arrange so that ' .net' is split as .net and net. Previously it only produced .net, which meant that matching filename extensions, like in fn:pdf$ did not work well because of cases where a special char or a space occurred before the .
|
2016-06-20 17:25:25 +02:00 |
|
Jean-Francois Dockes
|
0a9d55e790
|
Suppressed a couple warnings (unsigned issues) + small windows release fixes
|
2016-01-29 17:30:50 +01:00 |
|
Jean-Francois Dockes
|
c1c73573d8
|
more int fixups
--HG--
branch : WINDOWSPORT
|
2015-09-02 07:34:59 +02:00 |
|
Jean-Francois Dockes
|
1cbf02f713
|
Suppressed many integer size warnings by a mix of type adjustments and casts,
none of which should have a real effect.
--HG--
branch : WINDOWSPORT
|
2015-09-01 19:39:20 +02:00 |
|
Jean-Francois Dockes
|
82295328cc
|
Test for end() after lower_bound call before dereferencing!
--HG--
branch : WINDOWSPORT
|
2015-09-01 14:44:30 +02:00 |
|
Jean-Francois Dockes
|
e1bb1a3022
|
Make dehyphenate (co-worker->coworker) optional
|
2015-08-19 11:34:26 +02:00 |
|
Jean-Francois Dockes
|
94eb3119ce
|
Generate an additional unhyphenated term for singly hyphenated words: co-worker will index as [co worker], [co-worker] and [coworker]. Only produce terms for alphanumeric hashtags (discard #,xyz)
|
2015-08-13 18:18:49 +02:00 |
|
Jean-Francois Dockes
|
abe9fb671f
|
clean up autoconf of unordered_xx, prepare change to shared_ptr
|
2015-08-09 10:21:46 +02:00 |
|
Jean-Francois Dockes
|
f70ec1cab7
|
comment
|
2015-07-31 11:23:17 +02:00 |
|
Jean-Francois Dockes
|
0be78cfe48
|
index #hashtags as such
|
2014-07-29 09:56:00 +02:00 |
|
Jean-Francois Dockes
|
efaa1fb3a3
|
fix textsplit core dump caused by interaction of new 1.20 code with little-tested camelcase splitting section
|
2014-07-28 22:12:35 +02:00 |
|
Jean-Francois Dockes
|
8e7bac08c1
|
test driver
|
2014-06-10 17:41:46 +02:00 |
|
Jean-Francois Dockes
|
96d99ad6e5
|
textsplit: check for underflow while trimming the span
|
2014-05-19 18:52:51 +02:00 |
|
Jean-Francois Dockes
|
0145234b60
|
translate unicode hyphen (0x2010) in to ascii minus
|
2014-04-30 09:59:51 +02:00 |
|
Jean-Francois Dockes
|
077aed3018
|
fix term byte offsets produced by new textsplit: for highlighting
|
2014-04-24 12:42:10 +02:00 |
|
Jean-Francois Dockes
|
ece15318ab
|
New text splitter with word accumulator and full partial span generation. Search/Index seem ok. Still a pb with use for highlighting (preview)
|
2014-04-24 10:13:19 +02:00 |
|
Jean-Francois Dockes
|
e12d66865e
|
Deal with tr1 being gone in c0x11 compilers
|
2013-10-18 13:02:48 +02:00 |
|
Jean-Francois Dockes
|
b4c7efe490
|
Added (unifdefd) code to detect garbage data like undecoded base64 by looking at word length stats
|
2013-04-27 08:29:55 +02:00 |
|
Jean-Francois Dockes
|
d06e45946a
|
Handle wildcards as normal chars everywhere when splitting for query
|
2013-03-30 12:49:31 +01:00 |
|
Jean-Francois Dockes
|
0ae8ec99f6
|
more utf-8 err checking prevents bogus terms in index
|
2013-03-30 10:24:10 +01:00 |
|
Jean-Francois Dockes
|
df49598a8d
|
make comma a normal wordsplit char
|
2013-03-22 10:06:02 +01:00 |
|
Jean-Francois Dockes
|
dcf937d650
|
remove use of - as span-building character.
|
2013-03-04 12:16:11 +01:00 |
|
Jean-Francois Dockes
|
d2f7f11715
|
Use dynamic lib for shared recoll code
|
2012-12-29 14:27:01 +01:00 |
|
Jean-Francois Dockes
|
4544571490
|
term generation: only keep @ when not at start of term
|
2012-11-18 08:25:19 +01:00 |
|
Jean-Francois Dockes
|
d86f74a9e8
|
missing include
|
2012-10-16 16:10:14 +02:00 |
|
Jean-Francois Dockes
|
9801f0389f
|
fixed bug that would erase search term made of single wildcard
|
2012-10-05 09:15:09 +02:00 |
|
Jean-Francois Dockes
|
d3a26706b5
|
add a class for skipped characters
|
2012-10-03 09:07:59 +02:00 |
|
Jean-Francois Dockes
|
3f331ebb3e
|
fix glitch caused by udi prefix change
|
2012-10-03 08:05:39 +02:00 |
|
Jean-Francois Dockes
|
efd319025d
|
attempt to eliminate more unicode uninteresting characters
|
2012-10-02 17:45:16 +02:00 |
|
Jean-Francois Dockes
|
63d97e597b
|
added a bunch of graphic characters to the word breakers list and changed the container used from set to unordered_set for speed
|
2012-09-19 19:50:45 +02:00 |
|
"Jean-Francois Dockes ext:(%22)
|
0ebfc496d8
|
add capability to remember page breaks generated by, e.g. pdftotext, and use them to start an external viewer on a match page
|
2012-08-21 15:03:02 +02:00 |
|
Jean-Francois Dockes
|
4eaf12fb9c
|
more delistification
|
2012-04-12 08:15:50 +02:00 |
|
Jean-Francois Dockes
|
6c72454396
|
generate acronyms for dotted abbrevs. ie O.E.C.D -> OECD
|
2011-10-20 13:24:29 +02:00 |
|
Jean-Francois Dockes
|
0860b559ee
|
get rid of a few garbage terms during indexing. Set a threshold for conversion errors after which we discard the doc. Stabilize the new termproc pipeline but no commongrams for now
|
2011-10-12 17:55:58 +02:00 |
|
"Jean-Francois Dockes ext:(%22)
|
36516b091b
|
textsplit: discard - in front of words. Handle cjk punctuation characters
|
2011-07-16 11:51:38 +02:00 |
|
Jean-Francois Dockes
|
cb0794e92c
|
textsplit: eliminate some garbage terms (ie long sequences of dashes)
|
2011-07-06 16:20:32 +02:00 |
|
Jean-Francois Dockes
|
55f124725f
|
Fix problems that occurred when multiple threads were trying to read/convert files at the same time (ie: indexing and previewing threads in the GUI calling internfile()). Either get rid of or lock-protect all shared data, eliminate misc initialization possible conflicts by using static initializers. Hopefuly closes issue #51
|
2011-04-28 10:58:33 +02:00 |
|
Jean-Francois Dockes
|
b28eaf23fb
|
Got rid of all the old RCS id strings
|
2011-04-27 08:22:17 +02:00 |
|
Jean-Francois Dockes
|
8520ec668a
|
recognize more numbers: 1e-10, 1.e3
|
2010-05-17 09:20:09 +02:00 |
|