In consequence, the rclpdf.py PDF input handler has more
- complex capabilities than most others, and it is also more
- configurable. Specifically, rclpdf.py can automatically use
- tesseract to perform OCR
- if the document text is empty, it can be configured to
- extract specific metadata tags from an XMP packet, and to
- extract PDF attachments.
The PDF handler can execute an external program to run - OCR if no text is found in the document. This is now - described in a separate section.
+In consequence, the rclpdf.py PDF input + handler has more complex capabilities than most others, and + it is also more configurable. Specifically, rclpdf.py has the + following features:
+It can be configured to extract specific metadata + tags from an XMP packet.
+It can extract PDF attachments.
+It can automatically perform OCR if the document + text is empty. This is done by executing an external + program and is now described in a separate section, + because the OCR framework can also be used with + non-PDF image files.
+Configuration. See the To enable this feature, you need to install one of the + supported OCR applications (tesseract or ABBYY), enable OCR in the PDF handler, + and tell Recoll where the + appropriate command resides. The last parts are done by + setting configuration variables. See the relevant section. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).
diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 5533e342..a235be2b 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -1402,21 +1402,27 @@ metadatacmds = ;