12 Commits

Author SHA1 Message Date
Jean-Francois Dockes
48d4678770 experiment: Korean when Noun then JX emit both Noun and Noun+JX 2020-04-25 14:19:54 +02:00
Jean-Francois Dockes
ec7379f837 textsplitko: start cmd as python kosplitter.py 2020-04-10 14:34:50 +01:00
Jean-Francois Dockes
afcacf63c0 Fix page handling in Korean spitter, bug would shift the byte positions, with bad consequences for snippets 2020-03-31 16:11:37 +02:00
Jean-Francois Dockes
7de66aae60 Korean splitter: suppress some ctl chars from Komoran input. Better compute pages 2020-03-26 18:44:59 +01:00
Jean-Francois Dockes
1afc606718 textsplit: break on it.error() not only it.eof(). Seems to make a difference in rare cases? Add Komoran support but this one often fails 2020-03-26 09:31:19 +01:00
Jean-Francois Dockes
b677171fa8 GUI: Experimental: create a list of MIME types (compiled in for now: hwp) for which we prefer to use stored text for preview because extraction is slow 2020-03-25 18:13:00 +01:00
Jean-Francois Dockes
97e89c408a korean splitter: only break korean stretch on non-korean alphabetic (e.g. not numbers or punctuation) 2020-03-25 16:57:42 +01:00
Jean-Francois Dockes
207bfec93e korean splitter: restart the python/java splitter from time to time because it leaks memory 2020-03-24 11:27:10 +01:00
Jean-Francois Dockes
a323472876 typo in textsplitko would prevent use of Mecab 2020-03-24 08:50:24 +01:00
Jean-Francois Dockes
9719177c82 Korean external splitter: add some support for Mecab 2020-03-23 16:20:32 +01:00
Jean-Francois Dockes
c9667b5ba7 Korean text: sort-of-working version, in need of validation 2020-03-22 15:49:24 +01:00
Jean-Francois Dockes
384e3a1087 korean textsplit with extern help from konlpy, first step 2020-03-22 10:09:50 +01:00