diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index e97e382d..cb4b293a 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -2131,19 +2131,32 @@ metadatacmds = ; -

In consequence, the rclpdf.py PDF input handler has more - complex capabilities than most others, and it is also more - configurable. Specifically, rclpdf.py can automatically use - tesseract to perform OCR - if the document text is empty, it can be configured to - extract specific metadata tags from an XMP packet, and to - extract PDF attachments.

-

The PDF handler can execute an external program to run - OCR if no text is found in the document. This is now - described in a separate section.

+

In consequence, the rclpdf.py PDF input + handler has more complex capabilities than most others, and + it is also more configurable. Specifically, rclpdf.py has the + following features:

+
+ +
@@ -2270,8 +2283,14 @@ metadatacmds = ;
-

Configuration. See the To enable this feature, you need to install one of the + supported OCR applications (tesseract or ABBYY), enable OCR in the PDF handler, + and tell Recoll where the + appropriate command resides. The last parts are done by + setting configuration variables. See the relevant section. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).

diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 5533e342..a235be2b 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -1402,21 +1402,27 @@ metadatacmds = ; tags = tmsu tags %f The PDF input handler The PDF format is very important for scientific and technical - documentation, and document archival. It has extensive - facilities for storing metadata along with the document, and these - facilities are actually used in the real world. + documentation, and document archival. It has extensive + facilities for storing metadata along with the document, and these + facilities are actually used in the real world. - In consequence, the rclpdf.py PDF input - handler has more complex capabilities than most others, and it is - also more configurable. Specifically, rclpdf.py - can automatically use tesseract to perform - OCR if the document text is empty, it can be configured to extract - specific metadata tags from an XMP packet, and to extract PDF - attachments. - - The PDF handler can execute an external program to run OCR if - no text is found in the document. This is now described in a - separate section. + In consequence, the rclpdf.py PDF input + handler has more complex capabilities than most others, and it is + also more configurable. Specifically, rclpdf.py + has the following features: + + It can be configured to extract + specific metadata tags from an XMP packet. + It can extract PDF + attachments. + It can automatically perform + OCR if the document text is empty. This is done by + executing an external program and is now described in a + separate + section, because the OCR framework can also be used + with non-PDF image files. + + XMP fields extraction @@ -1477,7 +1483,7 @@ metadatacmds = ; tags = tmsu tags %f PDF attachment indexing If pdftk is installed, and if the - the + the pdfattach configuration variable is set, the PDF input handler will try to extract PDF attachements for indexing as sub-documents of the PDF @@ -1489,6 +1495,7 @@ metadatacmds = ; tags = tmsu tags %f + Recoll and OCR @@ -1521,8 +1528,13 @@ metadatacmds = ; tags = tmsu tags %f - Configuration. See the - + To enable this feature, you need to install one of + the supported OCR applications + (tesseract + or ABBYY), enable OCR in the PDF + handler, and tell &RCL; where the appropriate command resides. The + last parts are done by setting configuration variables. See the + relevant section. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).