Jean-Francois Dockes
dc934b7ddc
comment
2021-02-10 14:57:40 +01:00
Jean-Francois Dockes
33725fd02c
simplify stdout redirection for pdftk
2020-11-25 17:54:06 +01:00
Jean-Francois Dockes
f0abc1df68
pdf: discard pdftk stdout message "Error occurred during initialization of VM", it breaks pdf indexing when it occurs
2020-11-04 14:33:55 +01:00
Jean-Francois Dockes
25eda37bc9
Index pdf annotations separately under field name annotation. Add annot, pdfannot and pa aliases.
2020-10-12 10:05:38 +02:00
Jean-Francois Dockes
694d0f155d
pdf annot: guard against possible exception while formatting results
2020-10-10 12:48:18 +02:00
Jean-Francois Dockes
10bdf2a0c8
comments
2020-09-05 09:19:10 +02:00
Jean-Francois Dockes
d62bb9016a
pdf: try to extract annotation text if the python3 poppler-glib binding is available
2020-09-03 16:16:54 +02:00
Jean-Francois Dockes
2c0fd8502a
PDF: pdftk as snap (ubuntu): print warning about pdf attachments if TMPDIR does not belong to user
2020-08-20 11:27:12 +02:00
Jean-Francois Dockes
b2e68740ba
PDF: attachment extraction was broken since python3 (wrong open mode r instead of rb for the extracted file)
2020-07-27 09:03:58 +02:00
Jean-Francois Dockes
4508b6b064
rclpdf: avoid crash when external metadata filter cant be imported
2020-07-13 10:13:59 +02:00
Jean-Francois Dockes
a88c0114b1
python filters: htmlescape needs not be an RclExecM member
2020-03-27 17:19:40 +01:00
Jean-Francois Dockes
90dd64fc61
Have RclExecM inherit the shared CmdTalk now that the latter is used anyway for the korean splitter. Main diff: cmdtalk strips the colon from param names and does not lowercase them
2020-03-27 11:07:51 +01:00
Jean-Francois Dockes
2cbd9ad79c
Added handler for Hancom .hwp format
2020-03-10 14:38:52 +01:00
Jean-Francois Dockes
1fb9421163
OCR: small adjustments for Windows
2020-02-28 09:22:03 +01:00
Jean-Francois Dockes
8560467e4a
pdf/ocr scripts: no need to look for rclocr if pdfocr is not set. comments.
2020-02-27 18:16:28 +01:00
Jean-Francois Dockes
e520176a2a
OCR: small adjustments for Windows. Works with Tesseract.
2020-02-27 14:10:55 +01:00
Jean-Francois Dockes
38dfa5f841
1st version of the cached ocr mechanism
2020-02-15 21:19:13 +01:00
Jean-Francois Dockes
b43d1b3287
pdf xmp: pdfextrametafix: add method which takes the xml elt as arg instead of the text content
2019-11-14 18:19:33 +01:00
Jean-Francois Dockes
6d2454aedb
rclpdf.py: fixed typo in processing xmp field names
2019-10-14 19:46:46 +02:00
Jean-Francois Dockes
f66b5d1ef9
pdf: fix test on pdfocr config value
2019-10-11 12:05:26 +02:00
Jean-Francois Dockes
0436b80956
windows: avoid picking up a default pdftotext: we want ours
2019-10-07 11:45:14 +02:00
Jean-Francois Dockes
2e801812fe
rclpdf: restore pdfextrametafix function and add test
2019-09-04 09:38:11 +02:00
Jean-Francois Dockes
5ff1a92a51
pdf: ocr: small fixes, plus make pdfocr redefinable in subdirs
2019-06-13 09:47:25 +02:00
Jean-Francois Dockes
9dcdb6e9a6
pdf: ocr function was broken for python3 in some cases (depending on how the ocr language was specified)
2019-06-13 08:33:55 +02:00
Jean-Francois Dockes
b895980e95
PDF: fix the XMP metadata extraction code for python3 and other issues. Also get metadata from XML attributes
2019-06-12 19:21:37 +02:00
Jean-Francois Dockes
e71d7f183f
Python filters: using list append + join instead of string append improves performance hugely for big (book-sized) documents. Impact on a typical pdf mix is moderate though
2019-03-25 11:30:50 +01:00
Jean-Francois Dockes
f482df9707
reset wrong mode change
2019-03-04 11:22:46 +01:00
Jean-Francois Dockes
0cbc46732f
Fixed the FSF address
2019-03-04 11:19:14 +01:00
Jean-Francois Dockes
7ea3936420
Windows: use wide char interfaces
...
Exchange file names and command line parameters with the system using
wchar_t interfaces: allows preserving values which can be reversibly
transcoded in the current multibyte charset (which can't be UTF-8). Store
all file paths internally in UTF-8
2019-01-25 15:28:24 +01:00
Jean-Francois Dockes
a457b6c68e
rclpdf ocr: fix python3 issue. Add pdfocrlang config variable
2018-07-18 18:05:42 +02:00
Jean-Francois Dockes
52d3bfa54f
Change the shebang line from python2 to python3 for all scripts
2018-06-01 14:55:10 +02:00
Jean-Francois Dockes
0b8988cd64
Fix Windows PDF indexing. The successful test for poppler/pdftotext was not acknowledged and pdf indexing always failed
2018-01-19 13:15:51 +01:00
Jean-Francois Dockes
123d5b36ad
pdf: add and document MetaFixer::wrapup() method
2017-05-17 08:32:23 +02:00
Jean-Francois Dockes
ef9e7a935b
PDF XMP: move field editing code to external script, document
2017-05-17 06:57:52 +02:00
Jean-Francois Dockes
9e046187da
pdf xmp metadata: handle the case where the x:xmpmeta node is omitted and the XML root is rdf:RDF
2017-05-16 03:20:57 +02:00
Jean-Francois Dockes
6f44dce466
pdf: Added field-fixing method for Xml metadata
2017-05-15 14:04:55 +02:00
Jean-Francois Dockes
ccc0398155
Handle a unicode conversion issue. Avoid returning None as document for an empty document
2017-05-15 12:35:59 +02:00
Jean-Francois Dockes
d87d410f11
pdf: added capability to extract metadata from XML packet
2017-05-12 10:27:12 +02:00
Jean-Francois Dockes
06e8424048
Changed input handler shebang lines to use explicit python2 instead of python. Cant switch to python3 because of msodump anyway
2017-04-09 04:09:02 +02:00
Jean-Francois Dockes
d6b230043c
Check for newer pdftotext version to avoid double HTML escaping. fixes issue #318
2016-08-05 08:51:34 +02:00
Jean-Francois Dockes
b421f86f72
renamed rclmpdf.py to more normal rclpdf.py
2016-04-11 13:59:07 +02:00