41 Commits

Author SHA1 Message Date
Jean-Francois Dockes
dc934b7ddc comment 2021-02-10 14:57:40 +01:00
Jean-Francois Dockes
33725fd02c simplify stdout redirection for pdftk 2020-11-25 17:54:06 +01:00
Jean-Francois Dockes
f0abc1df68 pdf: discard pdftk stdout message "Error occurred during initialization of VM", it breaks pdf indexing when it occurs 2020-11-04 14:33:55 +01:00
Jean-Francois Dockes
25eda37bc9 Index pdf annotations separately under field name annotation. Add annot, pdfannot and pa aliases. 2020-10-12 10:05:38 +02:00
Jean-Francois Dockes
694d0f155d pdf annot: guard against possible exception while formatting results 2020-10-10 12:48:18 +02:00
Jean-Francois Dockes
10bdf2a0c8 comments 2020-09-05 09:19:10 +02:00
Jean-Francois Dockes
d62bb9016a pdf: try to extract annotation text if the python3 poppler-glib binding is available 2020-09-03 16:16:54 +02:00
Jean-Francois Dockes
2c0fd8502a PDF: pdftk as snap (ubuntu): print warning about pdf attachments if TMPDIR does not belong to user 2020-08-20 11:27:12 +02:00
Jean-Francois Dockes
b2e68740ba PDF: attachment extraction was broken since python3 (wrong open mode r instead of rb for the extracted file) 2020-07-27 09:03:58 +02:00
Jean-Francois Dockes
4508b6b064 rclpdf: avoid crash when external metadata filter cant be imported 2020-07-13 10:13:59 +02:00
Jean-Francois Dockes
a88c0114b1 python filters: htmlescape needs not be an RclExecM member 2020-03-27 17:19:40 +01:00
Jean-Francois Dockes
90dd64fc61 Have RclExecM inherit the shared CmdTalk now that the latter is used anyway for the korean splitter. Main diff: cmdtalk strips the colon from param names and does not lowercase them 2020-03-27 11:07:51 +01:00
Jean-Francois Dockes
2cbd9ad79c Added handler for Hancom .hwp format 2020-03-10 14:38:52 +01:00
Jean-Francois Dockes
1fb9421163 OCR: small adjustments for Windows 2020-02-28 09:22:03 +01:00
Jean-Francois Dockes
8560467e4a pdf/ocr scripts: no need to look for rclocr if pdfocr is not set. comments. 2020-02-27 18:16:28 +01:00
Jean-Francois Dockes
e520176a2a OCR: small adjustments for Windows. Works with Tesseract. 2020-02-27 14:10:55 +01:00
Jean-Francois Dockes
38dfa5f841 1st version of the cached ocr mechanism 2020-02-15 21:19:13 +01:00
Jean-Francois Dockes
b43d1b3287 pdf xmp: pdfextrametafix: add method which takes the xml elt as arg instead of the text content 2019-11-14 18:19:33 +01:00
Jean-Francois Dockes
6d2454aedb rclpdf.py: fixed typo in processing xmp field names 2019-10-14 19:46:46 +02:00
Jean-Francois Dockes
f66b5d1ef9 pdf: fix test on pdfocr config value 2019-10-11 12:05:26 +02:00
Jean-Francois Dockes
0436b80956 windows: avoid picking up a default pdftotext: we want ours 2019-10-07 11:45:14 +02:00
Jean-Francois Dockes
2e801812fe rclpdf: restore pdfextrametafix function and add test 2019-09-04 09:38:11 +02:00
Jean-Francois Dockes
5ff1a92a51 pdf: ocr: small fixes, plus make pdfocr redefinable in subdirs 2019-06-13 09:47:25 +02:00
Jean-Francois Dockes
9dcdb6e9a6 pdf: ocr function was broken for python3 in some cases (depending on how the ocr language was specified) 2019-06-13 08:33:55 +02:00
Jean-Francois Dockes
b895980e95 PDF: fix the XMP metadata extraction code for python3 and other issues. Also get metadata from XML attributes 2019-06-12 19:21:37 +02:00
Jean-Francois Dockes
e71d7f183f Python filters: using list append + join instead of string append improves performance hugely for big (book-sized) documents. Impact on a typical pdf mix is moderate though 2019-03-25 11:30:50 +01:00
Jean-Francois Dockes
f482df9707 reset wrong mode change 2019-03-04 11:22:46 +01:00
Jean-Francois Dockes
0cbc46732f Fixed the FSF address 2019-03-04 11:19:14 +01:00
Jean-Francois Dockes
7ea3936420 Windows: use wide char interfaces
Exchange file names and command line parameters with the system using
wchar_t interfaces: allows preserving values which can be reversibly
transcoded in the current multibyte charset (which can't be UTF-8). Store
all file paths internally in UTF-8
2019-01-25 15:28:24 +01:00
Jean-Francois Dockes
a457b6c68e rclpdf ocr: fix python3 issue. Add pdfocrlang config variable 2018-07-18 18:05:42 +02:00
Jean-Francois Dockes
52d3bfa54f Change the shebang line from python2 to python3 for all scripts 2018-06-01 14:55:10 +02:00
Jean-Francois Dockes
0b8988cd64 Fix Windows PDF indexing. The successful test for poppler/pdftotext was not acknowledged and pdf indexing always failed 2018-01-19 13:15:51 +01:00
Jean-Francois Dockes
123d5b36ad pdf: add and document MetaFixer::wrapup() method 2017-05-17 08:32:23 +02:00
Jean-Francois Dockes
ef9e7a935b PDF XMP: move field editing code to external script, document 2017-05-17 06:57:52 +02:00
Jean-Francois Dockes
9e046187da pdf xmp metadata: handle the case where the x:xmpmeta node is omitted and the XML root is rdf:RDF 2017-05-16 03:20:57 +02:00
Jean-Francois Dockes
6f44dce466 pdf: Added field-fixing method for Xml metadata 2017-05-15 14:04:55 +02:00
Jean-Francois Dockes
ccc0398155 Handle a unicode conversion issue. Avoid returning None as document for an empty document 2017-05-15 12:35:59 +02:00
Jean-Francois Dockes
d87d410f11 pdf: added capability to extract metadata from XML packet 2017-05-12 10:27:12 +02:00
Jean-Francois Dockes
06e8424048 Changed input handler shebang lines to use explicit python2 instead of python. Cant switch to python3 because of msodump anyway 2017-04-09 04:09:02 +02:00
Jean-Francois Dockes
d6b230043c Check for newer pdftotext version to avoid double HTML escaping. fixes issue #318 2016-08-05 08:51:34 +02:00
Jean-Francois Dockes
b421f86f72 renamed rclmpdf.py to more normal rclpdf.py 2016-04-11 13:59:07 +02:00