399 Commits

Author SHA1 Message Date
Jean-Francois Dockes
b63cc1b712 Korean splitter script: use python-mecab-ko if possible, else konlpy 2020-04-10 14:27:06 +02:00
Jean-Francois Dockes
e8194dea9d comment 2020-04-08 09:51:37 +02:00
Jean-Francois Dockes
d3de1f0d6f add common execPythonScript method to rclexecm 2020-04-07 10:09:09 +02:00
Jean-Francois Dockes
32ebd65ba8 Windows: small changes for porting back from msvc to mingw 2020-04-07 09:40:00 +02:00
Jean-Francois Dockes
a88c0114b1 python filters: htmlescape needs not be an RclExecM member 2020-03-27 17:19:40 +01:00
Jean-Francois Dockes
90dd64fc61 Have RclExecM inherit the shared CmdTalk now that the latter is used anyway for the korean splitter. Main diff: cmdtalk strips the colon from param names and does not lowercase them 2020-03-27 11:07:51 +01:00
Jean-Francois Dockes
1afc606718 textsplit: break on it.error() not only it.eof(). Seems to make a difference in rare cases? Add Komoran support but this one often fails 2020-03-26 09:31:19 +01:00
Jean-Francois Dockes
207bfec93e korean splitter: restart the python/java splitter from time to time because it leaks memory 2020-03-24 11:27:10 +01:00
Jean-Francois Dockes
9719177c82 Korean external splitter: add some support for Mecab 2020-03-23 16:20:32 +01:00
Jean-Francois Dockes
c9667b5ba7 Korean text: sort-of-working version, in need of validation 2020-03-22 15:49:24 +01:00
Jean-Francois Dockes
384e3a1087 korean textsplit with extern help from konlpy, first step 2020-03-22 10:09:50 +01:00
Jean-Francois Dockes
03cbc203e1 Hanword: use the html converter, the text ones drops data from tables 2020-03-21 10:16:16 +01:00
Jean-Francois Dockes
2cbd9ad79c Added handler for Hancom .hwp format 2020-03-10 14:38:52 +01:00
Jean-Francois Dockes
0f6b5911d5 rclpython: only python3 now 2020-03-03 18:54:41 +01:00
Jean-Francois Dockes
8c816f50cf doc 2020-03-03 18:53:31 +01:00
Jean-Francois Dockes
fe86fa9e1f ocr: compat: make a non-existant ocrprogs config variable equivalent to "tesseract" 2020-02-28 14:38:02 +01:00
Jean-Francois Dockes
1fb9421163 OCR: small adjustments for Windows 2020-02-28 09:22:03 +01:00
Jean-Francois Dockes
8560467e4a pdf/ocr scripts: no need to look for rclocr if pdfocr is not set. comments. 2020-02-27 18:16:28 +01:00
Jean-Francois Dockes
e520176a2a OCR: small adjustments for Windows. Works with Tesseract. 2020-02-27 14:10:55 +01:00
Jean-Francois Dockes
abb7ef8803 added ocr module for abbyy 2020-02-27 11:35:23 +01:00
Jean-Francois Dockes
7bc70a30ae ocrcache: implemented purge functions/script 2020-02-27 09:25:52 +01:00
Jean-Francois Dockes
747e37a980 rclocr ckpt: cache+tesseract indexing working 2020-02-26 17:30:12 +01:00
Jean-Francois Dockes
38dfa5f841 1st version of the cached ocr mechanism 2020-02-15 21:19:13 +01:00
Jean-Francois Dockes
e7e37b9233 openxml: extract more metadata fiels (e.g. description, keywords) 2020-01-30 08:38:30 +01:00
Jean-Francois Dockes
a1122c4e8a Fix format string used to generate/scan circache headers.
Use _ not . as prefix for webqueue metadata files
Fix log messages and indent
2019-11-24 15:02:30 +01:00
Jean-Francois Dockes
83e29a9b01 Windows: enable the firefox recent history indexer. 2019-11-24 10:46:23 +01:00
Jean-Francois Dockes
b43d1b3287 pdf xmp: pdfextrametafix: add method which takes the xml elt as arg instead of the text content 2019-11-14 18:19:33 +01:00
Jean-Francois Dockes
6d2454aedb rclpdf.py: fixed typo in processing xmp field names 2019-10-14 19:46:46 +02:00
Jean-Francois Dockes
20ebeec7fc handler verbosity 2019-10-14 09:03:55 +02:00
Jean-Francois Dockes
a96ee950b1 missing import in ppt msodumper 2019-10-11 15:23:58 +02:00
Jean-Francois Dockes
2491388e9e pst handler: improved charset processing 2019-10-11 14:18:27 +02:00
Jean-Francois Dockes
f66b5d1ef9 pdf: fix test on pdfocr config value 2019-10-11 12:05:26 +02:00
Jean-Francois Dockes
239297d3de For zip-bundled modules: prepend zip in path instead of append to make sure that our version is used 2019-10-10 14:15:05 +02:00
Jean-Francois Dockes
5210088e8f Epub: failed with python3 when epubcatenate was set 2019-10-10 09:02:00 +02:00
Jean-Francois Dockes
0436b80956 windows: avoid picking up a default pdftotext: we want ours 2019-10-07 11:45:14 +02:00
Jean-Francois Dockes
af42fe8f5e rclconfig.py, rclexecm.py: implement part of mimetype identification for rclexecm test mode 2019-10-06 07:44:50 +02:00
Jean-Francois Dockes
2e801812fe rclpdf: restore pdfextrametafix function and add test 2019-09-04 09:38:11 +02:00
Jean-Francois Dockes
e4576fc12f rcltex: try to detect character encoding 2019-08-27 08:32:50 +02:00
Jean-Francois Dockes
af664e7768 Input handlers: more closing to help with windows temp files 2019-07-21 10:03:03 +02:00
Jean-Francois Dockes
a1daa8de55 Epub: close file (windows temp file cleanup) 2019-07-20 19:17:29 +02:00
Jean-Francois Dockes
16a051c3b6 rcltext.py: make sure to close file (windows temp file removal) 2019-07-20 19:09:07 +02:00
Jean-Francois Dockes
7d168dc198 rclchm: close file (windows temp file removal) 2019-07-20 19:08:33 +02:00
Jean-Francois Dockes
2c454b92a6 rclimg: explicitely close file handle (windows temp file removal) 2019-07-20 15:14:32 +02:00
Jean-Francois Dockes
703caf2ee4 rclzip: close file when done (windows temp file cleanup) 2019-07-20 14:45:11 +02:00
Jean-Francois Dockes
4c2fd82d4e pst: wait for pffexport and generate error if exit code is not 0 2019-06-24 11:47:17 +02:00
Jean-Francois Dockes
db9fd248f3 7z: properly list the needed package as pylzma 2019-06-21 16:57:58 +02:00
Jean-Francois Dockes
628da0e454 pst: new file name was appended to pffexport command instead of replacing old 2019-06-17 10:30:02 +02:00
Jean-Francois Dockes
e38e58c37a In case the self-doc was not sent first by the handler, its udi was not recalculated, and it clobbered the last subdoc 2019-06-16 13:46:00 +02:00
Jean-Francois Dockes
5d25094107 pst: pass the command line ipath as base64 as there is no msw way to pass utf-8 2019-06-14 14:33:49 +02:00
Jean-Francois Dockes
6c73a0d666 pst: reset generator for new file 2019-06-13 16:16:32 +02:00