diff --git a/src/doc/user/recoll.conf.xml b/src/doc/user/recoll.conf.xml index f40b3c51..640532d2 100644 --- a/src/doc/user/recoll.conf.xml +++ b/src/doc/user/recoll.conf.xml @@ -606,6 +606,23 @@ very slow. available). This is normally disabled, because it does slow down PDF indexing a bit even if not one attachment is ever found. + +pdfextrameta +Extract text from selected XMP metadata tags. This +is a space-separated list of qualified XMP tag names. Each element can also +include a translation to a Recoll field name, separated by a '|' +character. If the second element is absent, the tag name is used as the +Recoll field names. You will also need to add specifications to the +'fields' file to direct processing of the extracted data. + +pdfextrametafix +Define name of XMP field editing script. This +defines the name of a script to be loaded for editing XMP field +values. The script should define a 'MetaFixer' class with a metafix() +method which will be called with the qualified tag name and value of each +selected field, for editing or erasing. A new instance is created for +each document, so that the object can keep state for, e.g. eliminating +duplicate values. Parameters set for specific locations diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index c372ac63..3abce386 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -20,8 +20,8 @@ alink="#0000FF">
-

Recoll user manual

+

Recoll user manual

@@ -109,13 +109,13 @@ alink="#0000FF"> multiple indexes
2.1.3. Document types
+ "#idp40818624">Document types
2.1.4. Indexing failures
+ "#idp40843200">Indexing failures
2.1.5. Recovery
+ "#idp40850208">Recovery @@ -172,29 +172,49 @@ alink="#0000FF"> tags
2.7. The PDF input + handler
+ +
+
+
2.7.1. OCR with + Tesseract
+ +
2.7.2. XMP fields + extraction
+ +
2.7.3. PDF attachment + indexing
+
+
+ +
2.8. Periodic indexing
-
2.7.1. 2.8.1. Running indexing
-
2.7.2. 2.8.2. Using cron to automate indexing
-
2.8. 2.9. Real time indexing
-
2.8.1. 2.9.1. Slowing down the reindexing rate for fast changing files
@@ -768,7 +788,7 @@ alink="#0000FF"> "application">Qt.

The indexing process + title="2.8.1. Running indexing">indexing process is started automatically the first time you execute the recoll GUI. Indexing can also be performed by executing the @@ -879,21 +899,21 @@ alink="#0000FF"> "list-style-type: disc;">

  • Periodic (or + title="2.8. Periodic indexing">Periodic (or batch) indexing: indexing takes place at discrete times, by executing the recollindex command. The typical usage is to have a nightly indexing run programmed + "2.8.2. Using cron to automate indexing">programmed into your cron file.

  • Real time + title="2.9. Real time indexing">Real time indexing: indexing takes place as soon as a file is created or changed. recollindex runs @@ -997,8 +1017,8 @@ alink="#0000FF">

    -

    2.1.3. Document types

    +

    2.1.3. Document types

    @@ -1111,8 +1131,8 @@ indexedmimetypes = application/pdf
    -

    2.1.4. Indexing +

    2.1.4. Indexing failures

    @@ -1152,8 +1172,8 @@ indexedmimetypes = application/pdf
    -

    2.1.5. Recovery

    +

    2.1.5. Recovery

    @@ -1911,13 +1931,151 @@ metadatacmds = ; tags = tmsu tags %f filename.

    +
    +
    +
    +
    +

    2.7. The PDF input + handler

    +
    +
    +
    + +

    The PDF format is very important for scientific and + technical documentation, and document archival. It has + extensive facilities for storing metadata along with the + document, and these facilities are actually used in the + real world.

    + +

    In consequence, the rclpdf.py PDF input handler has more + complex capabilities than most others, and it is also more + configurable. Specifically, rclpdf.py can automatically use + tesseract to perform OCR + if the document text is empty, it can be configured to + extract specific metadata tags from an XMP packet, and to + extract PDF attachments.

    + +
    +
    +
    +
    +

    2.7.1. OCR with + Tesseract

    +
    +
    +
    + +

    If both tesseract and + pdftoppm + (generally from the poppler-utils package) are + installed, the PDF handler may attempt OCR on PDF files + with no text content. This is controlled by the pdfocr + configuration variable, which is false by default because + OCR is very slow.

    + +

    The choice of language is very important for + successfull OCR. Recoll has currently no way to determine + this from the document itself. You can set the language + to use through the contents of a .ocrpdflang text file in the same + directory as the PDF document, or through the + RECOLL_TESSERACT_LANG + environment variable, or through the contents of an + ocrpdf text file inside the + configuration directory. If none of the above are used, + Recoll will try to guess + the language from the NLS environment.

    +
    + +
    +
    +
    +
    +

    2.7.2. XMP + fields extraction

    +
    +
    +
    + +

    The rclpdf.py script in + Recoll version 1.23.2 + and later can extract XMP metadata fields by executing + the pdfinfo + command (usually found with poppler-utils). This is controlled + by the pdfextrameta + configuration variable, which specifies which tags to + extract and, possibly, how to rename them.

    + +

    The pdfextrametafix + variable can be used to designate a file with Python code + to edit the metadata fields (available for Recoll 1.23.3 and later. 1.23.2 has + equivalent code inside the handler script). Example:

    +
    +import sys
    +import re
    +
    +class MetaFixer(object):
    +    def __init__(self):
    +        pass
    +
    +    def metafix(self, nm, txt):
    +        if nm == 'bibtex:pages':
    +            txt = re.sub(r'--', '-', txt)
    +        elif nm == 'someothername':
    +            # do something else
    +            pass
    +        elif nm == 'stillanother':
    +            # etc.
    +            pass
    +    
    +        return txt
    +        
    +
    +
    + +
    +
    +
    +
    +

    2.7.3. PDF + attachment indexing

    +
    +
    +
    + +

    If pdftk is + installed, and if the the pdfattach + configuration variable is set, the PDF input handler will + try to extract PDF attachements for indexing as + sub-documents of the PDF file. This is disabled by + default, because it slows down PDF indexing a bit even if + not one attachment is ever found (PDF attachments are + uncommon in my experience).

    +
    +
    +

    2.7. Periodic + "RCL.INDEXING.PERIODIC">2.8. Periodic indexing

    @@ -1929,7 +2087,7 @@ metadatacmds = ; tags = tmsu tags %f

    2.7.1. Running + "RCL.INDEXING.PERIODIC.EXEC">2.8.1. Running indexing

    @@ -2037,7 +2195,7 @@ metadatacmds = ; tags = tmsu tags %f

    2.7.2. Using + "RCL.INDEXING.PERIODIC.AUTOMAT">2.8.2. Using cron to automate indexing

    @@ -2095,7 +2253,7 @@ metadatacmds = ; tags = tmsu tags %f

    2.8. Real time + "RCL.INDEXING.MONITOR">2.9. Real time indexing

    @@ -2225,7 +2383,7 @@ fs.inotify.max_user_watches=32768

    2.8.1. Slowing + "RCL.INDEXING.MONITOR.FASTFILES">2.9.1. Slowing down the reindexing rate for fast changing files

    @@ -9848,6 +10006,38 @@ thesame = "some string with spaces" because it does slow down PDF indexing a bit even if not one attachment is ever found.

  • + +
    pdfextrameta
    + +
    +

    Extract text from selected XMP metadata tags. + This is a space-separated list of qualified XMP tag + names. Each element can also include a translation + to a Recoll field name, separated by a '|' + character. If the second element is absent, the tag + name is used as the Recoll field names. You will + also need to add specifications to the 'fields' + file to direct processing of the extracted + data.

    +
    + +
    pdfextrametafix
    + +
    +

    Define name of XMP field editing script. This + defines the name of a script to be loaded for + editing XMP field values. The script should define + a 'MetaFixer' class with a metafix() method which + will be called with the qualified tag name and + value of each selected field, for editing or + erasing. A new instance is created for each + document, so that the object can keep state for, + e.g. eliminating duplicate values.

    +
    diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 8dbddecb..ce56e47d 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -1098,6 +1098,108 @@ metadatacmds = ; tags = tmsu tags %f + + The PDF input handler + + The PDF format is very important for scientific and technical + documentation, and document archival. It has extensive + facilities for storing metadata along with the document, and these + facilities are actually used in the real world. + + In consequence, the rclpdf.py PDF input + handler has more complex capabilities than most others, and it is + also more configurable. Specifically, rclpdf.py + can automatically use tesseract to perform + OCR if the document text is empty, it can be configured to extract + specific metadata tags from an XMP packet, and to extract PDF + attachments. + + + OCR with Tesseract + + If both tesseract and + pdftoppm (generally from the + poppler-utils package) are installed, + the PDF handler may attempt OCR on PDF files with no text + content. This is controlled by the pdfocr + configuration variable, which is false by default because + OCR is very slow. + + The choice of language is very important for successfull + OCR. Recoll has currently no way to determine this from the + document itself. You can set the language to use through the + contents of a .ocrpdflang text file in the + same directory as the PDF document, or through the + RECOLL_TESSERACT_LANG environment variable, or + through the contents of an ocrpdf text file + inside the configuration directory. If none of the above are used, + &RCL; will try to guess the language from the NLS + environment. + + + + + XMP fields extraction + + The rclpdf.py script in &RCL; version + 1.23.2 and later can extract XMP metadata fields by executing the + pdfinfo command (usually found with + poppler-utils). This is controlled by + the pdfextrameta + configuration variable, which specifies which tags to extract and, + possibly, how to rename them. + + The pdfextrametafix + variable can be used to designate a file with Python code to edit + the metadata fields (available for &RCL; 1.23.3 and later. 1.23.2 + has equivalent code inside the handler script). Example: + import sys +import re + +class MetaFixer(object): + def __init__(self): + pass + + def metafix(self, nm, txt): + if nm == 'bibtex:pages': + txt = re.sub(r'--', '-', txt) + elif nm == 'someothername': + # do something else + pass + elif nm == 'stillanother': + # etc. + pass + + return txt + + + + + + + + + + PDF attachment indexing + + If pdftk is installed, and if the + the pdfattach + configuration variable is set, the PDF input handler will try to + extract PDF attachements for indexing as sub-documents of the PDF + file. This is disabled by default, because it slows down PDF + indexing a bit even if not one attachment is ever found (PDF + attachments are uncommon in my experience). + + + + + Periodic indexing diff --git a/src/filters/rclpdf.py b/src/filters/rclpdf.py index 04007d0a..abdf096b 100755 --- a/src/filters/rclpdf.py +++ b/src/filters/rclpdf.py @@ -98,6 +98,7 @@ class PDFExtractor: # (xmltag,rcltag) pairs self.extrameta = cf.getConfParam("pdfextrameta") if self.extrameta: + self.extrametafix = cf.getConfParam("pdfextrametafix") self._initextrameta() # Check if we need to escape portions of text where old @@ -178,7 +179,16 @@ class PDFExtractor: self.re_xmlpacket = re.compile(r'<\?xpacket[ ]+begin.*\?>' + r'(.*)' + r'<\?xpacket[ ]+end', flags = re.DOTALL) - + global EMF + EMF = None + if self.extrametafix: + try: + import imp + EMF = imp.load_source('pdfextrametafix', self.extrametafix) + except Exception as err: + self.em.rclog("Import extrametafix failed: %s" % err) + pass + # Extract all attachments if any into temporary directory def extractAttach(self): if self.attextractdone: @@ -384,27 +394,12 @@ class PDFExtractor: # [e.text for e in elt.iter() if e.text]).strip() - # This can be used for local field editing. For now you need to - # change the program source. maybe we'll make it more dynamic one - # day. The method receives an (original) field name, and the text - # value, and should return the possibly modified text. - def _extrametafix(self, nm, txt): - if nm == 'bibtex:pages': - txt = re.sub(r'--', '-', txt) - elif nm == 'someothername': - # do something else - pass - elif nm == 'stillanother': - # etc. - pass - - return txt - - def _setextrameta(self, html): if not self.pdfinfo: return html + emf = EMF.MetaFixer() if EMF else None + all = subprocess.check_output([self.pdfinfo, "-meta", self.filename]) # Extract the XML packet @@ -445,9 +440,10 @@ class PDFExtractor: continue if elt is not None: text = self._xmltreetext(elt).encode('UTF-8') + if emf: + text = emf.metafix(metanm, text) # Should we set empty values ? if text: - text = self._extrametafix(metanm, text) # Can't use setfield as it only works for # text/plain output at the moment. metaheaders.append((rclnm, text)) diff --git a/src/sampleconf/recoll.conf b/src/sampleconf/recoll.conf index a733032e..c20cc295 100644 --- a/src/sampleconf/recoll.conf +++ b/src/sampleconf/recoll.conf @@ -750,6 +750,27 @@ snippetMaxPosWalk = 1000000 # not one attachment is ever found. #pdfattach = 0 +# +# +# Extract text from selected XMP metadata tags.This +# is a space-separated list of qualified XMP tag names. Each element can also +# include a translation to a Recoll field name, separated by a '|' +# character. If the second element is absent, the tag name is used as the +# Recoll field names. You will also need to add specifications to the +# 'fields' file to direct processing of the extracted data. +#pdfextrameta = bibtex:location|location bibtex:booktitle bibtex:pages + +# +# +# Define name of XMP field editing script.This +# defines the name of a script to be loaded for editing XMP field +# values. The script should define a 'MetaFixer' class with a metafix() +# method which will be called with the qualified tag name and value of each +# selected field, for editing or erasing. A new instance is created for +# each document, so that the object can keep state for, e.g. eliminating +# duplicate values. +#pdfextrametafix = /path/to/fixerscript.py + # Parameters set for specific # locations