Jean-Francois Dockes
a4b3aff5c4
rclaudio: if mutagen.File() fails, try with mutagen.ID3()
...
This allows extracting the tags e.g. from adts files
mistaken for mp3 during initial identification, and for which
the full later mp3 init fails because wrong kind of frame.
2021-03-03 12:53:59 +01:00
Jean-Francois Dockes
31f6793495
rclaudio: catch exception when parsing bad date, set date to the epoch
2021-02-25 19:27:24 +01:00
Jean-Francois Dockes
dc934b7ddc
comment
2021-02-10 14:57:40 +01:00
freddii
89c7efe682
fixed typos
2021-02-04 17:12:22 +01:00
Jean-Francois Dockes
50b64caf5e
rclaudio: process the Group tag
2021-01-27 09:32:55 +01:00
Jean-Francois Dockes
2998486d54
revert wrong change in rclaudio
2021-01-19 19:27:48 +01:00
Jean-Francois Dockes
baf2ee8d6b
dont make date a field alias for dmtime, does not make sense because of diff. formats in general
2021-01-16 19:19:29 +01:00
Jean-Francois Dockes
cb13b8b6df
"print fields" change in rclexecm options had broken -s
2021-01-15 14:06:52 +01:00
Jean-Francois Dockes
72a9548c88
fix warning from rclaudio regexp
2021-01-06 12:01:42 +01:00
Jean-Francois Dockes
e00767d98c
rclexecm test/debug: add option -f to dump fields
2020-12-29 15:04:49 +01:00
Jean-Francois Dockes
ee1e84b2f3
comments
2020-12-25 17:35:08 +01:00
Jean-Francois Dockes
53edd7b213
rcl7z: use py7zr if available, rather than pylzma, which does not work on some archives
2020-12-25 17:34:15 +01:00
Jean-Francois Dockes
824e305bb0
Add option to limit tesseract threads
2020-12-17 11:08:31 +01:00
Jean-Francois Dockes
b2f0e2e657
Add handler for emacs org-mode files
2020-11-30 09:50:44 +01:00
Jean-Francois Dockes
33725fd02c
simplify stdout redirection for pdftk
2020-11-25 17:54:06 +01:00
Jean-Francois Dockes
8b6082a89f
shared
2020-11-09 12:13:30 +01:00
Jean-Francois Dockes
f0abc1df68
pdf: discard pdftk stdout message "Error occurred during initialization of VM", it breaks pdf indexing when it occurs
2020-11-04 14:33:55 +01:00
Jean-Francois Dockes
f50a4e54b1
rclpython: renamed rclpython.py. Use rclexecm. Only colorize for preview, not indexing
2020-11-04 10:32:18 +01:00
Jean-Francois Dockes
e10cb959b3
add test for python program (different handler)
2020-10-18 18:38:44 +02:00
Jean-Francois Dockes
25eda37bc9
Index pdf annotations separately under field name annotation. Add annot, pdfannot and pa aliases.
2020-10-12 10:05:38 +02:00
Jean-Francois Dockes
694d0f155d
pdf annot: guard against possible exception while formatting results
2020-10-10 12:48:18 +02:00
Jean-Francois Dockes
96104e7d67
fix rclocrtesseract fix
2020-09-28 11:05:12 +02:00
Jean-Francois Dockes
8accec9b88
rclocrtesseract: unquote tesseractcmd parameter and check existence.
2020-09-24 07:13:21 +02:00
Jean-Francois Dockes
0dd609cf1a
python filters: replace misc message printing with single method in rclexecm
2020-09-23 18:38:22 +02:00
Jean-Francois Dockes
10bdf2a0c8
comments
2020-09-05 09:19:10 +02:00
Jean-Francois Dockes
d62bb9016a
pdf: try to extract annotation text if the python3 poppler-glib binding is available
2020-09-03 16:16:54 +02:00
Jean-Francois Dockes
2c0fd8502a
PDF: pdftk as snap (ubuntu): print warning about pdf attachments if TMPDIR does not belong to user
2020-08-20 11:27:12 +02:00
Jean-Francois Dockes
b305c86041
recoll-we-move-files: apply expanduser to the webdownloadsdir config value
2020-08-17 11:02:46 +02:00
Jean-Francois Dockes
d932d19562
epub handler: extract the opf metadata subjects fields as dc:subject tags. Share more code between rclepub and the now redundant rclepub1 (no more lynx usage in rclepub)
2020-08-09 09:49:08 +02:00
Jean-Francois Dockes
19fe03af62
Support visio .vsdx format
2020-08-04 10:57:13 +02:00
Jean-Francois Dockes
b2e68740ba
PDF: attachment extraction was broken since python3 (wrong open mode r instead of rb for the extracted file)
2020-07-27 09:03:58 +02:00
Jean-Francois Dockes
b4306b71c0
openxml word: be more specific for extracting text, avoids treating some image parameters as text
2020-07-15 10:49:06 +02:00
Jean-Francois Dockes
4508b6b064
rclpdf: avoid crash when external metadata filter cant be imported
2020-07-13 10:13:59 +02:00
Jean-Francois Dockes
73f2836317
korean splitter: add inactive option to split on white space before calling the tagger
2020-05-19 09:22:16 +02:00
Jean-Francois Dockes
c6dac9347f
cmdtalk: catch param decoding exceptions
2020-05-14 09:23:46 +02:00
Jean-Francois Dockes
dce3bff5d7
comment
2020-04-19 09:19:28 +02:00
Jean-Francois Dockes
c38db0f160
comment
2020-04-18 09:15:45 +02:00
Jean-Francois Dockes
b63cc1b712
Korean splitter script: use python-mecab-ko if possible, else konlpy
2020-04-10 14:27:06 +02:00
Jean-Francois Dockes
e8194dea9d
comment
2020-04-08 09:51:37 +02:00
Jean-Francois Dockes
d3de1f0d6f
add common execPythonScript method to rclexecm
2020-04-07 10:09:09 +02:00
Jean-Francois Dockes
32ebd65ba8
Windows: small changes for porting back from msvc to mingw
2020-04-07 09:40:00 +02:00
Jean-Francois Dockes
a88c0114b1
python filters: htmlescape needs not be an RclExecM member
2020-03-27 17:19:40 +01:00
Jean-Francois Dockes
90dd64fc61
Have RclExecM inherit the shared CmdTalk now that the latter is used anyway for the korean splitter. Main diff: cmdtalk strips the colon from param names and does not lowercase them
2020-03-27 11:07:51 +01:00
Jean-Francois Dockes
1afc606718
textsplit: break on it.error() not only it.eof(). Seems to make a difference in rare cases? Add Komoran support but this one often fails
2020-03-26 09:31:19 +01:00
Jean-Francois Dockes
207bfec93e
korean splitter: restart the python/java splitter from time to time because it leaks memory
2020-03-24 11:27:10 +01:00
Jean-Francois Dockes
9719177c82
Korean external splitter: add some support for Mecab
2020-03-23 16:20:32 +01:00
Jean-Francois Dockes
c9667b5ba7
Korean text: sort-of-working version, in need of validation
2020-03-22 15:49:24 +01:00
Jean-Francois Dockes
384e3a1087
korean textsplit with extern help from konlpy, first step
2020-03-22 10:09:50 +01:00
Jean-Francois Dockes
03cbc203e1
Hanword: use the html converter, the text ones drops data from tables
2020-03-21 10:16:16 +01:00
Jean-Francois Dockes
2cbd9ad79c
Added handler for Hancom .hwp format
2020-03-10 14:38:52 +01:00