116 Commits

Author SHA1 Message Date
Jean-Francois Dockes
728129e5ce Text splitter: move apos and dash character conversions to unac_except_trans.
This was complicated and caused problems with highlight areas position computations in
plaintorich. Also, simplify the code for processing some dangling characters.
2021-11-02 14:32:38 +01:00
Jean-Francois Dockes
8285e18039 CJK indexing: return to western word indexing if encountering numeric after punctuation 2020-11-25 17:56:32 +01:00
Jean-Francois Dockes
16a9d8eba8 fix span trimming loop when underscoreasletter is set 2020-09-13 17:53:59 +02:00
Jean-Francois Dockes
df09d65a4e add underscoreasletter config variable to process _ as a letter 2020-09-13 15:40:28 +02:00
Jean-Francois Dockes
3f1dfa564c Restore nonumbers number indexing exclusion function 2020-08-22 10:07:58 +02:00
Jean-Francois Dockes
07e3387fc1 Avoid calling isalpha() with big ints, may crash, depending on version 2020-04-25 11:19:52 +02:00
Jean-Francois Dockes
39c152bada Fixed MSVC warnings, all inocuous 2020-04-17 14:26:40 +01:00
Jean-Francois Dockes
9565663f09 textsplit: create isNGRAMMED() method to replace isCJK() and let the latter actually return what it says 2020-04-14 09:27:26 +02:00
Jean-Francois Dockes
eb53b598d6 Textsplit: lost char at korean->ascii transition 2020-04-10 14:54:13 +01:00
Jean-Francois Dockes
de246349da textsplit: use more regular test for ISHANGUL. CJK: do not ignore whitespace, break on alphabetic non cjk character 2020-04-10 14:28:14 +02:00
Jean-Francois Dockes
1afc606718 textsplit: break on it.error() not only it.eof(). Seems to make a difference in rare cases? Add Komoran support but this one often fails 2020-03-26 09:31:19 +01:00
Jean-Francois Dockes
9719177c82 Korean external splitter: add some support for Mecab 2020-03-23 16:20:32 +01:00
Jean-Francois Dockes
384e3a1087 korean textsplit with extern help from konlpy, first step 2020-03-22 10:09:50 +01:00
Jean-Francois Dockes
5be3ed89c5 comments 2020-03-21 10:16:44 +01:00
Jean-Francois Dockes
bbf8c90185 experiment: ignore all ascii whitespace when generating cjk ngrams 2019-07-21 19:13:24 +02:00
Jean-Francois Dockes
baa6062de1 Do not process hangul as words, but as ngrams. Same issues as with Katakana: word separation too hard 2019-07-21 19:09:51 +02:00
Jean-Francois Dockes
6b058e9758 Regularise processing of hangul characters (there was a mixup of cjk/regular processing), and add a build-time option to either use cjk/ngram or regular term splitting for them 2019-07-21 19:09:51 +02:00
Jean-Francois Dockes
34bb62a8d9 got rid of a few unused variable warnings 2019-04-11 15:31:27 +02:00
Jean-Francois Dockes
0cbc46732f Fixed the FSF address 2019-03-04 11:19:14 +01:00
Jean-Francois Dockes
bbeaebf632 textsplit: process unicode apostrophes and right quotation mark as ascii single quote 2019-02-01 16:10:51 +01:00
Jean-Francois Dockes
b1ff34407d Simplify initialization by moving static config textsplit init from rclconfig to textsplit 2019-02-01 09:09:15 +01:00
Jean-Francois Dockes
bdc8d3eb38 Add config variable to process backslashes as letters 2019-01-29 18:32:19 +01:00
Jean-Francois Dockes
55e2fe5d27 Prevent text splitter bad array access and stl assertion crash (fedora rpmbuild) in marginal case. There were probably no real consequence beyond triggering the assertion 2018-11-15 18:19:39 +01:00
Jean-Francois Dockes
9244e31574 fixed a few spelling errors, mostly in comments and debug messages 2018-05-03 16:20:36 +02:00
Jean-Francois Dockes
5b35ecfe36 Windows warning suppression (no real changes) 2018-01-19 17:26:43 +01:00
Jean-Francois Dockes
15ea565e9f m_words_in_span was always properly reset between invocations (if discardspan() was not called for some reason), resulting in crashes 2017-05-15 10:26:38 +02:00
Jean-Francois Dockes
f853f39ef3 Partially revert change treating Katakana as words, going back to n-grams. Did not work well because of separator-less compounds mostly 2017-04-25 10:20:38 +02:00
Jean-Francois Dockes
adaf7c77f9 Process katakana-western transitions as word breaks 2017-04-21 12:08:43 +02:00
Jean-Francois Dockes
9661a4431e wen 2017-04-18 14:39:12 +02:00
Jean-Francois Dockes
0b0385e459 got rid of the STD_SHARED_XX std/tr1 defines 2016-07-13 15:12:25 +02:00
Jean-Francois Dockes
d8f4500f90 fix debuglog ref in test driver + std=c++11 2016-07-12 19:32:02 +02:00
Jean-Francois Dockes
f6a999de84 logging now uses c++ streams 2016-07-12 09:41:04 +02:00
Jean-Francois Dockes
a905a92328 arrange so that ' .net' is split as .net and net. Previously it only produced .net, which meant that matching filename extensions, like in fn:pdf$ did not work well because of cases where a special char or a space occurred before the . 2016-06-20 17:25:25 +02:00
Jean-Francois Dockes
0a9d55e790 Suppressed a couple warnings (unsigned issues) + small windows release fixes 2016-01-29 17:30:50 +01:00
Jean-Francois Dockes
c1c73573d8 more int fixups
--HG--
branch : WINDOWSPORT
2015-09-02 07:34:59 +02:00
Jean-Francois Dockes
1cbf02f713 Suppressed many integer size warnings by a mix of type adjustments and casts,
none of which should have a real effect.

--HG--
branch : WINDOWSPORT
2015-09-01 19:39:20 +02:00
Jean-Francois Dockes
82295328cc Test for end() after lower_bound call before dereferencing!
--HG--
branch : WINDOWSPORT
2015-09-01 14:44:30 +02:00
Jean-Francois Dockes
e1bb1a3022 Make dehyphenate (co-worker->coworker) optional 2015-08-19 11:34:26 +02:00
Jean-Francois Dockes
94eb3119ce Generate an additional unhyphenated term for singly hyphenated words: co-worker will index as [co worker], [co-worker] and [coworker]. Only produce terms for alphanumeric hashtags (discard #,xyz) 2015-08-13 18:18:49 +02:00
Jean-Francois Dockes
abe9fb671f clean up autoconf of unordered_xx, prepare change to shared_ptr 2015-08-09 10:21:46 +02:00
Jean-Francois Dockes
f70ec1cab7 comment 2015-07-31 11:23:17 +02:00
Jean-Francois Dockes
0be78cfe48 index #hashtags as such 2014-07-29 09:56:00 +02:00
Jean-Francois Dockes
efaa1fb3a3 fix textsplit core dump caused by interaction of new 1.20 code with little-tested camelcase splitting section 2014-07-28 22:12:35 +02:00
Jean-Francois Dockes
8e7bac08c1 test driver 2014-06-10 17:41:46 +02:00
Jean-Francois Dockes
96d99ad6e5 textsplit: check for underflow while trimming the span 2014-05-19 18:52:51 +02:00
Jean-Francois Dockes
0145234b60 translate unicode hyphen (0x2010) in to ascii minus 2014-04-30 09:59:51 +02:00
Jean-Francois Dockes
077aed3018 fix term byte offsets produced by new textsplit: for highlighting 2014-04-24 12:42:10 +02:00
Jean-Francois Dockes
ece15318ab New text splitter with word accumulator and full partial span generation. Search/Index seem ok. Still a pb with use for highlighting (preview) 2014-04-24 10:13:19 +02:00
Jean-Francois Dockes
e12d66865e Deal with tr1 being gone in c0x11 compilers 2013-10-18 13:02:48 +02:00
Jean-Francois Dockes
b4c7efe490 Added (unifdefd) code to detect garbage data like undecoded base64 by looking at word length stats 2013-04-27 08:29:55 +02:00