The indexing process + title="2.8.1. Running indexing">indexing process is started automatically the first time you execute the recoll GUI. Indexing can also be performed by executing the @@ -879,21 +899,21 @@ alink="#0000FF"> "list-style-type: disc;">
Periodic (or + title="2.8. Periodic indexing">Periodic (or batch) indexing: indexing takes place at discrete times, by executing the recollindex command. The typical usage is to have a nightly indexing run programmed + "2.8.2. Using cron to automate indexing">programmed into your cron file.
Real time + title="2.9. Real time indexing">Real time indexing: indexing takes place as soon as a file is created or changed. recollindex runs @@ -997,8 +1017,8 @@ alink="#0000FF">
@@ -1111,8 +1131,8 @@ indexedmimetypes = application/pdfThe PDF format is very important for scientific and + technical documentation, and document archival. It has + extensive facilities for storing metadata along with the + document, and these facilities are actually used in the + real world.
+ +In consequence, the rclpdf.py PDF input handler has more
+ complex capabilities than most others, and it is also more
+ configurable. Specifically, rclpdf.py can automatically use
+ tesseract to perform OCR
+ if the document text is empty, it can be configured to
+ extract specific metadata tags from an XMP packet, and to
+ extract PDF attachments.
If both tesseract and + pdftoppm + (generally from the poppler-utils package) are + installed, the PDF handler may attempt OCR on PDF files + with no text content. This is controlled by the pdfocr + configuration variable, which is false by default because + OCR is very slow.
+ +The choice of language is very important for
+ successfull OCR. Recoll has currently no way to determine
+ this from the document itself. You can set the language
+ to use through the contents of a .ocrpdflang text file in the same
+ directory as the PDF document, or through the
+ RECOLL_TESSERACT_LANG
+ environment variable, or through the contents of an
+ ocrpdf text file inside the
+ configuration directory. If none of the above are used,
+ Recoll will try to guess
+ the language from the NLS environment.
The rclpdf.py script in
+ Recoll version 1.23.2
+ and later can extract XMP metadata fields by executing
+ the pdfinfo
+ command (usually found with poppler-utils). This is controlled
+ by the pdfextrameta
+ configuration variable, which specifies which tags to
+ extract and, possibly, how to rename them.
The pdfextrametafix + variable can be used to designate a file with Python code + to edit the metadata fields (available for Recoll 1.23.3 and later. 1.23.2 has + equivalent code inside the handler script). Example:
++import sys +import re + +class MetaFixer(object): + def __init__(self): + pass + + def metafix(self, nm, txt): + if nm == 'bibtex:pages': + txt = re.sub(r'--', '-', txt) + elif nm == 'someothername': + # do something else + pass + elif nm == 'stillanother': + # etc. + pass + + return txt + ++
If pdftk is + installed, and if the the pdfattach + configuration variable is set, the PDF input handler will + try to extract PDF attachements for indexing as + sub-documents of the PDF file. This is disabled by + default, because it slows down PDF indexing a bit even if + not one attachment is ever found (PDF attachments are + uncommon in my experience).
+pdfextrametaExtract text from selected XMP metadata tags. + This is a space-separated list of qualified XMP tag + names. Each element can also include a translation + to a Recoll field name, separated by a '|' + character. If the second element is absent, the tag + name is used as the Recoll field names. You will + also need to add specifications to the 'fields' + file to direct processing of the extracted + data.
+pdfextrametafixDefine name of XMP field editing script. This + defines the name of a script to be loaded for + editing XMP field values. The script should define + a 'MetaFixer' class with a metafix() method which + will be called with the qualified tag name and + value of each selected field, for editing or + erasing. A new instance is created for each + document, so that the object can keep state for, + e.g. eliminating duplicate values.
+