21 Commits

Author SHA1 Message Date
Jean-Francois Dockes
c1ef2187d3 Fixed LOG calls obsolescence issues preventing build with staticverbosity 7 2020-09-06 14:59:00 +01:00
Jean-Francois Dockes
f3858a7e3a limit max size of korean single-word span 2020-05-31 09:57:58 +02:00
Jean-Francois Dockes
560041cab9 cleared out errant tabs 2020-05-30 15:54:49 +02:00
Jean-Francois Dockes
8ac74ca8f5 log levels 2020-05-24 14:39:06 +02:00
Jean-Francois Dockes
a5bab94ae3 korean splitter: break on digits 2020-05-24 14:02:23 +02:00
Jean-Francois Dockes
fc981e3733 new variation on the korean splitter. Index both the space-less spans whole and the mecab split output 2020-05-22 16:48:05 +02:00
Jean-Francois Dockes
ea2db676ed korean: reactivate option to generate both noun,jx and noun+jx 2020-05-19 09:23:03 +02:00
Jean-Francois Dockes
97f3212f80 korean splitter: disable the noun+jx emitting thing 2020-05-14 09:23:09 +02:00
Jean-Francois Dockes
d58fec0b81 korean: for now dont filter tags, until it is better understood what should be done 2020-05-11 07:33:54 +02:00
Jean-Francois Dockes
48d4678770 experiment: Korean when Noun then JX emit both Noun and Noun+JX 2020-04-25 14:19:54 +02:00
Jean-Francois Dockes
ec7379f837 textsplitko: start cmd as python kosplitter.py 2020-04-10 14:34:50 +01:00
Jean-Francois Dockes
afcacf63c0 Fix page handling in Korean spitter, bug would shift the byte positions, with bad consequences for snippets 2020-03-31 16:11:37 +02:00
Jean-Francois Dockes
7de66aae60 Korean splitter: suppress some ctl chars from Komoran input. Better compute pages 2020-03-26 18:44:59 +01:00
Jean-Francois Dockes
1afc606718 textsplit: break on it.error() not only it.eof(). Seems to make a difference in rare cases? Add Komoran support but this one often fails 2020-03-26 09:31:19 +01:00
Jean-Francois Dockes
b677171fa8 GUI: Experimental: create a list of MIME types (compiled in for now: hwp) for which we prefer to use stored text for preview because extraction is slow 2020-03-25 18:13:00 +01:00
Jean-Francois Dockes
97e89c408a korean splitter: only break korean stretch on non-korean alphabetic (e.g. not numbers or punctuation) 2020-03-25 16:57:42 +01:00
Jean-Francois Dockes
207bfec93e korean splitter: restart the python/java splitter from time to time because it leaks memory 2020-03-24 11:27:10 +01:00
Jean-Francois Dockes
a323472876 typo in textsplitko would prevent use of Mecab 2020-03-24 08:50:24 +01:00
Jean-Francois Dockes
9719177c82 Korean external splitter: add some support for Mecab 2020-03-23 16:20:32 +01:00
Jean-Francois Dockes
c9667b5ba7 Korean text: sort-of-working version, in need of validation 2020-03-22 15:49:24 +01:00
Jean-Francois Dockes
384e3a1087 korean textsplit with extern help from konlpy, first step 2020-03-22 10:09:50 +01:00