From c110b94738f95fc0277e9435f7856d333cfccf5c Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Sun, 1 Mar 2020 16:08:15 +0100 Subject: [PATCH] doc --- src/doc/user/usermanual.html | 49 +++++++++++++++++++++++++----------- src/doc/user/usermanual.xml | 46 ++++++++++++++++++++------------- 2 files changed, 63 insertions(+), 32 deletions(-) diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index e97e382d..cb4b293a 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -2131,19 +2131,32 @@ metadatacmds = ; -

In consequence, the rclpdf.py PDF input handler has more - complex capabilities than most others, and it is also more - configurable. Specifically, rclpdf.py can automatically use - tesseract to perform OCR - if the document text is empty, it can be configured to - extract specific metadata tags from an XMP packet, and to - extract PDF attachments.

-

The PDF handler can execute an external program to run - OCR if no text is found in the document. This is now - described in a separate section.

+

In consequence, the rclpdf.py PDF input + handler has more complex capabilities than most others, and + it is also more configurable. Specifically, rclpdf.py has the + following features:

+
+
    +
  • +

    It can be configured to extract specific metadata + tags from an XMP packet.

    +
  • +
  • +

    It can extract PDF attachments.

    +
  • +
  • +

    It can automatically perform OCR if the document + text is empty. This is done by executing an external + program and is now described in a separate section, + because the OCR framework can also be used with + non-PDF image files.

    +
  • +
+
@@ -2270,8 +2283,14 @@ metadatacmds = ;
-

Configuration. See the To enable this feature, you need to install one of the + supported OCR applications (tesseract or ABBYY), enable OCR in the PDF handler, + and tell Recoll where the + appropriate command resides. The last parts are done by + setting configuration variables. See the relevant section. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).

diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 5533e342..a235be2b 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -1402,21 +1402,27 @@ metadatacmds = ; tags = tmsu tags %f The PDF input handler The PDF format is very important for scientific and technical - documentation, and document archival. It has extensive - facilities for storing metadata along with the document, and these - facilities are actually used in the real world. + documentation, and document archival. It has extensive + facilities for storing metadata along with the document, and these + facilities are actually used in the real world. - In consequence, the rclpdf.py PDF input - handler has more complex capabilities than most others, and it is - also more configurable. Specifically, rclpdf.py - can automatically use tesseract to perform - OCR if the document text is empty, it can be configured to extract - specific metadata tags from an XMP packet, and to extract PDF - attachments. - - The PDF handler can execute an external program to run OCR if - no text is found in the document. This is now described in a - separate section. + In consequence, the rclpdf.py PDF input + handler has more complex capabilities than most others, and it is + also more configurable. Specifically, rclpdf.py + has the following features: + + It can be configured to extract + specific metadata tags from an XMP packet. + It can extract PDF + attachments. + It can automatically perform + OCR if the document text is empty. This is done by + executing an external program and is now described in a + separate + section, because the OCR framework can also be used + with non-PDF image files. + + XMP fields extraction @@ -1477,7 +1483,7 @@ metadatacmds = ; tags = tmsu tags %f PDF attachment indexing If pdftk is installed, and if the - the + the pdfattach configuration variable is set, the PDF input handler will try to extract PDF attachements for indexing as sub-documents of the PDF @@ -1489,6 +1495,7 @@ metadatacmds = ; tags = tmsu tags %f + Recoll and OCR @@ -1521,8 +1528,13 @@ metadatacmds = ; tags = tmsu tags %f - Configuration. See the - + To enable this feature, you need to install one of + the supported OCR applications + (tesseract + or ABBYY), enable OCR in the PDF + handler, and tell &RCL; where the appropriate command resides. The + last parts are done by setting configuration variables. See the + relevant section. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).