diff --git a/website/recoll_XMP/index.html b/website/recoll_XMP/index.html index 0f4bd2d1..9a68bffc 100644 --- a/website/recoll_XMP/index.html +++ b/website/recoll_XMP/index.html @@ -745,10 +745,9 @@ handler, which differs a lot from doing something equivalent with the current Python-based one (for which XMP capability is available from recoll 1.23.2, but the new handler can be used with previous Recoll versions).

-

This page was adapted from the text by Jeffrey Dick, using input from -Johannes Menzel, (especially the result list paragraph format), -adapting things for the new handler. The discussion which led to the -updated handler is a +

I based this page on the text by Jeffrey Dick, using input from Johannes +Menzel for all examples about the new features. The discussion which led to +the updated handler is a Bitbucket Recoll issue.

@@ -787,46 +786,49 @@ to describe genre, topic, etc.

-

Custom indexing (fields file)

+

Custom indexing short example (fields file)

-

Let’s create two fields named "year" and "journal". The prefixes -starting with "XY" are extension prefixes that are added to the terms -in the Xapian database (Recoll internally does not use prefixes -starting with XY). Additionally, the year and journal are stored so -they can be displayed in the results list. Some other types of -metadata, such as title, author and keywords, are already indexed by -Recoll (the default rclpdf finds them using the pdftotext -command) so there is no need to add those to the [prefixes] section.

-

Add this text to the fields file in your Recoll configuration -directory (~/.recoll/fields).

+

The following example (extract from a complete configuration shown later) +creates two fields named "refjournal" and "refpages", which are both stored +(so they can be displayed in result list entries), and indexed (you can +specifically search them).

+

Some other types of metadata, such as title, author and keywords, are +already indexed by Recoll (the default rclpdf finds them using the +pdftotext command) so there is no need to add those to the [prefixes] +section.

+

This is taken from the fields file inside the configuration +(e.g. ~/.recoll/fields).

[prefixes]
-year = XYEAR
-journal = XYJOUR
+refjournal=RFJOURNAL
+refpages=RFPAGES
 
 [stored]
-bibtex:year =
-bibtex:journal =
+refjournal = +refpages = + +[aliases] +refjournal = bibtex:journal bibtex:journaltitle +refpages = bibtex:pages

Telling the handler what fields to extract

-

As of Recoll 1.23.2, the PDF handler has the capability to use -pdfinfo for extracting XMP metadata. The switch for executing pdfinfo -is the pdfextrameta configuration parameter, and the value of the -parameter is a list of XMP tags to extract, with optional conversion -to Recoll field names (the XMP qualified tag name is kept by -default). Example:

+

As of Recoll 1.23.2, the PDF handler has the capability to use pdfinfo +for extracting XMP metadata. The switch for executing pdfinfo is the +pdfextrameta configuration parameter, and the value of the parameter is a +list of XMP tags to extract, with optional conversion to Recoll field names +(the XMP qualified tag name is kept by default, the translation is +separated by a | character). Example (without translations):

-
pdfextrameta =  bibtex:year bibtex:journal bibtex:booktitle|title
+
pdfextrameta =  bibtex:year bibtex:journal bibtex:journaltitle
-

Here, bibtex:year and bibtex:journal are used directly, and -bibtex:booktitle is translated to title (the example is not -supposed to make sense)

+

Note that it is quite equivalent to translate a field name inside +pdfextrameta or to uses aliases inside the fields file.

@@ -871,6 +873,11 @@ class MetaFixer(object): return txt
+

The metadata-editing script can be modified to fill in the "journal" field for +BibTex entries that aren’t journal articles (e.g. bibtex:booktitle +for "InCollection" entries), by defining a wrapup() method which will +be called with the whole metadata array (an array of (nm,value) +pairs) for global editing/removing/addition.

@@ -886,11 +893,8 @@ HTML meta elements, and the <body> contains the text of the PDF.

Result paragraph format

-

Here, the result is formatted to show the title, which is a link -to open the document, in blue with underlining turned off. The next -two lines contain the authors, then the journal title in green -italicized text followed by year (in parentheses). The keywords are -listed in red after the abstract/text snippet.

+

The result paragraph format defines what fields are displayed inside Recoll +result list, and how they are formatted.

Edit this using the Recoll GUI: Preferences > GUI configuration > Result List > Edit result paragraph format string.

@@ -922,26 +926,17 @@ listed in red after the abstract/text snippet.

</table>
-

The screenshot below also has the Highlight color for query terms -set to black; font-weight:bold; for bold, black text (instead -of the blue default). There -are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various -methods for creating the thumbnails]; the ones here were made by -opening the directory containing the PDFs in the Dolphin file manager -(part of KDE) and selecting the Preview option.

+

There are +various +methods for creating the thumbnails; the ones here were made by opening +the directory containing the PDFs in the Dolphin file manager (part of KDE) +and selecting the Preview option.

+

And the result:

+
+
+Result list display
-
-

A search example

-
-

The simple query is cerevisiae keyword:protein. This -returns only PDFs that have the text "cerevisiae" and have been -tagged with the "protein" keyword. The LaTeX-style formatting from -the BibTeX database is displayed as HTML (note the italicized words -in article title, and umlaut in author’s name). Other queries could -be made based on the PDF metadata, e.g. journal:plos -r year:2013.

-

image::recoll_query.png

@@ -967,15 +962,194 @@ The sort buttons (up- and down-arrows) in Recoll sort the the result list using the stored date of the file (using "%D" in the result paragraph format, and date format "%Y") instead of having to add the year to the index as shown above.

-
+ + +
+

Complete example

+
+

This was designed by Johannes Menzel, who kindly provided the data when we +worked on improving PDF XMP data extraction. The originals are listed in +this +BitBucket issue

+

The paragraph format is listed above.

+
+

recoll.conf additions:

+
+
+
pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
+  bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
+  bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
+  bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
+  bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
+  bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
+  dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
+
+defaultcharset = UTF-8//
+
+pdfextrametafix = /home/hannes/.recoll/metafix.py
+
+
+
+

metafix.py script:

+
+
+
import sys
+import re
+
+# This can be used for local XMP field editing.
+#
+# A new instance is created for each PDF document (so the object could
+# keep state to avoid, e.g. duplicate values)
+#
+# The metafix method receives an (original) field name, and the text
+# value, and should return the possibly modified text.
+class MetaFixer(object):
+    def __init__(self):
+        pass
+
+    def metafix(self, nm, txt):
+        if nm == 'bibtex:pages':
+            txt = re.sub(r'--', '-', txt)
+            txt = re.sub(r'^', ', p. ', txt)
+        elif nm == 'bibtex:author':
+            txt = re.sub(r'$', ':\ ', txt)
+            pass
+        elif nm == 'bibtex:chapter':
+            txt = re.sub(r'^', ', in: id.: ', txt)
+            pass
+        elif nm == 'bibtex:editor':
+            txt = re.sub(r'^', ', in: ', txt)
+            txt = re.sub(r'$', ' (ed.):\ ', txt)
+            pass
+        elif nm == 'bibtex:year':
+            txt = re.sub(r'^', ', ', txt)
+            pass
+        elif nm == 'bibtex:date':
+            txt = re.sub(r'^', ', ', txt)
+            pass
+        elif nm == 'bibtex:volume':
+            txt = re.sub(r'^', ', vol. ', txt)
+            pass
+        elif nm == 'bibtex:number':
+            txt = re.sub(r'^', ', no. ', txt)
+            pass
+        elif nm == 'bibtex:journaltitle':
+            txt = re.sub(r'^', ', in: ', txt)
+            pass
+        elif nm == 'bibtex:journal':
+            txt = re.sub(r'^', ', in: ', txt)
+            pass
+        elif nm == 'bibtex:title':
+            txt = re.sub(r'^', '"', txt)
+            txt = re.sub(r'$', '"', txt)
+            pass
+        elif nm == 'bibtex:location':
+            txt = re.sub(r'^', ', ', txt)
+            txt = re.sub(r'$', ':\ ', txt)
+            pass
+        elif nm == 'bibtex:address':
+            txt = re.sub(r'^', ', ', txt)
+            txt = re.sub(r'$', ':\ ', txt)
+            pass
+        elif nm == 'bibtex:isbn':
+            txt = re.sub(r'^', 'ISBN: ', txt)
+            pass
+        elif nm == 'bibtex:issn':
+            txt = re.sub(r'^', 'ISSN: ', txt)
+            pass
+        elif nm == 'bibtex:doi':
+            txt = re.sub(r'^', 'DOI: ', txt)
+            pass
+        elif nm == 'bibtex:bibtexkey':
+            txt = re.sub(r'^', 'Key: ', txt)
+            pass
+
+        return txt
+
+
+
+

fields file:

+
+
+
[prefixes]
+
+refjournal=RFJOURNAL
+refpages=RFPAGES
+reftitle=RFTTITLE
+refvolume=RFVOLUME
+refauthor=RFAUTHOR
+refyear=RFYYEAR
+refisbn=RFISBN
+refissn=RFISSN
+refdoi=RFDOI
+refeditor=RFEDITOR
+refpublisher=RFPUBLISHER
+refaddress=RFADDRESS
+reflocation=RFLOCATION
+refbooktitle=RFBOOKTITLE
+refurl=RFURL
+reftype=RFTYPE
+refkey=RFKEY
+refabstract=RFABSTRACT
+refkeywords=RFKEYWORDS
+refcomment=RFCOMMENT
+refedition=RFEDITION
+reflanguage=RFLANGUAGE
+
+[stored]
+
+refjournal=
+refpages=
+reftitle=
+refvolume=
+refauthor=
+refyear=
+refisbn=
+refissn=
+refdoi=
+refeditor=
+refpublisher=
+refaddress=
+reflocation=
+refbooktitle=
+refurl=
+reftype=
+refkey=
+refabstract=
+refkeywords=
+refcomment=
+refedition=
+reflanguage=
+refid=
+
+[aliases]
+
+refjournal = bibtex:journal bibtex:journaltitle
+refpages = bibtex:pages
+reftitle = bibtex:title
+refvolume = bibtex:volume
+refauthor = bibtex:author
+refyear = bibtex:year bibtex:date
+refid = dc:identifier bibtex:isbn bibtex:issn
+refisbn = bibtex:isbn
+refissn = bibtex:issn
+refdoi = bibtex:doi
+refeditor = bibtex:editor
+refpublisher = bibtex:publisher
+refaddress = bibtex:address
+reflocation = bibtex:location
+refbooktitle = bibtex:booktitle
+refurl = bibtex:url
+reftype = bibtex:entrytype bibtex:type
+refkey = bibtex:bibtexkey
+refabstract = bibtex:abstract
+refkeywords = bibtex:keywords
+refcomment = bibtex:comment
+refedition = bibtex:edition
+reflanguage = bibtex:language
+author = xesam:author
+
+
@@ -983,7 +1157,7 @@ The filter can be modified to fill in the "journal" field for diff --git a/website/recoll_XMP/index.txt b/website/recoll_XMP/index.txt index aa10560b..14a92399 100644 --- a/website/recoll_XMP/index.txt +++ b/website/recoll_XMP/index.txt @@ -8,10 +8,9 @@ current Python-based one (for which XMP capability is available from recoll 1.23.2, but the new handler can be used with previous Recoll versions). -This page was adapted from the text by Jeffrey Dick, using input from -Johannes Menzel, (especially the result list paragraph format), -adapting things for the new handler. The discussion which led to the -updated handler is a +I based this page on the text by Jeffrey Dick, using input from Johannes +Menzel for all examples about the new features. The discussion which led to +the updated handler is a link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket Recoll issue]. @@ -42,46 +41,51 @@ to describe genre, topic, etc. image::jabref_metadata.png[Editing metadata with jabref] -== Custom indexing (fields file) +== Custom indexing short example (fields file) -Let's create two fields named "year" and "journal". The prefixes -starting with "XY" are extension prefixes that are added to the terms -in the Xapian database (Recoll internally does not use prefixes -starting with XY). Additionally, the year and journal are stored so -they can be displayed in the results list. Some other types of -metadata, such as title, author and keywords, are already indexed by -Recoll (the default rclpdf finds them using the *pdftotext* -command) so there is no need to add those to the [prefixes] section. +The following example (extract from a complete configuration shown later) +creates two fields named "refjournal" and "refpages", which are both stored +(so they can be displayed in result list entries), and indexed (you can +specifically search them). + +Some other types of metadata, such as title, author and keywords, are +already indexed by Recoll (the default rclpdf finds them using the +*pdftotext* command) so there is no need to add those to the [prefixes] +section. + +This is taken from the `fields` file inside the configuration +(e.g. '~/.recoll/fields'). -Add this text to the fields file in your Recoll configuration -directory ('~/.recoll/fields'). ---- [prefixes] -year = XYEAR -journal = XYJOUR +refjournal=RFJOURNAL +refpages=RFPAGES [stored] -bibtex:year = -bibtex:journal = +refjournal = +refpages = + +[aliases] +refjournal = bibtex:journal bibtex:journaltitle +refpages = bibtex:pages ---- == Telling the handler what fields to extract -As of Recoll 1.23.2, the PDF handler has the capability to use -*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo* -is the 'pdfextrameta' configuration parameter, and the value of the -parameter is a list of XMP tags to extract, with optional conversion -to Recoll field names (the XMP qualified tag name is kept by -default). Example: +As of Recoll 1.23.2, the PDF handler has the capability to use *pdfinfo* +for extracting XMP metadata. The switch for executing *pdfinfo* is the +'pdfextrameta' configuration parameter, and the value of the parameter is a +list of XMP tags to extract, with optional conversion to Recoll field names +(the XMP qualified tag name is kept by default, the translation is +separated by a '|' character). Example (without translations): ---- -pdfextrameta = bibtex:year bibtex:journal bibtex:booktitle|title +pdfextrameta = bibtex:year bibtex:journal bibtex:journaltitle ---- -Here, 'bibtex:year' and 'bibtex:journal' are used directly, and -'bibtex:booktitle' is translated to 'title' (the example is not -supposed to make sense) +Note that it is quite equivalent to translate a field name inside +'pdfextrameta' or to uses aliases inside the 'fields' file. == Editing the field values @@ -127,6 +131,13 @@ class MetaFixer(object): return txt ---- + +The metadata-editing script can be modified to fill in the "journal" field for +BibTex entries that aren't journal articles (e.g. bibtex:booktitle +for "InCollection" entries), by defining a 'wrapup()' method which will +be called with the whole metadata array (an array of '(nm,value)' +pairs) for global editing/removing/addition. + == Indexing Then index away! @@ -138,12 +149,9 @@ HTML meta elements, and the contains the text of the PDF. == Result paragraph format -Here, the result is formatted to show the title, which is a link -to open the document, in blue with underlining turned off. The next -two lines contain the authors, then the journal title in green -italicized text followed by year (in parentheses). The keywords are -listed in red after the abstract/text snippet. - +The result paragraph format defines what fields are displayed inside Recoll +result list, and how they are formatted. + Edit this using the Recoll GUI: Preferences > GUI configuration > Result List > Edit result paragraph format string. @@ -177,26 +185,15 @@ Edit this using the Recoll GUI: Preferences > GUI configuration > ---- -The screenshot below also has the 'Highlight color for query terms' -set to `black; font-weight:bold;` for bold, black text (instead -of the blue default). There -are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various -methods for creating the thumbnails]; the ones here were made by -opening the directory containing the PDFs in the Dolphin file manager -(part of KDE) and selecting the Preview option. +There are +link:https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various +methods for creating the thumbnails]; the ones here were made by opening +the directory containing the PDFs in the Dolphin file manager (part of KDE) +and selecting the Preview option. +And the result: -== A search example - -The simple query is `cerevisiae keyword:protein`. This -returns only PDFs that have the text "cerevisiae" and have been -tagged with the "protein" keyword. The LaTeX-style formatting from -the BibTeX database is displayed as HTML (note the italicized words -in article title, and umlaut in author's name). Other queries could -be made based on the PDF metadata, e.g. 'journal:plos' -r 'year:2013'. - -image::recoll_query.png +image::recoll_query.png[Result list display] == More possibilities @@ -216,6 +213,190 @@ the result list using the stored date of the file (using "%D" in the result paragraph format, and date format "%Y") instead of having to add the year to the index as shown above. -- The filter can be modified to fill in the "journal" field for - BibTex entries that aren't journal articles (e.g. bibtex:booktitle - for "InCollection" entries). + +== Complete example + +This was designed by Johannes Menzel, who kindly provided the data when we +worked on improving PDF XMP data extraction. The originals are listed in +this +link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[BitBucket issue] + +The paragraph format is listed above. + +=== 'recoll.conf' additions: + +---- +pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \ + bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \ + bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \ + bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \ + bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \ + bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \ + dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier + +defaultcharset = UTF-8// + +pdfextrametafix = /home/hannes/.recoll/metafix.py +---- + + +=== 'metafix.py' script: + +---- +import sys +import re + +# This can be used for local XMP field editing. +# +# A new instance is created for each PDF document (so the object could +# keep state to avoid, e.g. duplicate values) +# +# The metafix method receives an (original) field name, and the text +# value, and should return the possibly modified text. +class MetaFixer(object): + def __init__(self): + pass + + def metafix(self, nm, txt): + if nm == 'bibtex:pages': + txt = re.sub(r'--', '-', txt) + txt = re.sub(r'^', ', p. ', txt) + elif nm == 'bibtex:author': + txt = re.sub(r'$', ':\ ', txt) + pass + elif nm == 'bibtex:chapter': + txt = re.sub(r'^', ', in: id.: ', txt) + pass + elif nm == 'bibtex:editor': + txt = re.sub(r'^', ', in: ', txt) + txt = re.sub(r'$', ' (ed.):\ ', txt) + pass + elif nm == 'bibtex:year': + txt = re.sub(r'^', ', ', txt) + pass + elif nm == 'bibtex:date': + txt = re.sub(r'^', ', ', txt) + pass + elif nm == 'bibtex:volume': + txt = re.sub(r'^', ', vol. ', txt) + pass + elif nm == 'bibtex:number': + txt = re.sub(r'^', ', no. ', txt) + pass + elif nm == 'bibtex:journaltitle': + txt = re.sub(r'^', ', in: ', txt) + pass + elif nm == 'bibtex:journal': + txt = re.sub(r'^', ', in: ', txt) + pass + elif nm == 'bibtex:title': + txt = re.sub(r'^', '"', txt) + txt = re.sub(r'$', '"', txt) + pass + elif nm == 'bibtex:location': + txt = re.sub(r'^', ', ', txt) + txt = re.sub(r'$', ':\ ', txt) + pass + elif nm == 'bibtex:address': + txt = re.sub(r'^', ', ', txt) + txt = re.sub(r'$', ':\ ', txt) + pass + elif nm == 'bibtex:isbn': + txt = re.sub(r'^', 'ISBN: ', txt) + pass + elif nm == 'bibtex:issn': + txt = re.sub(r'^', 'ISSN: ', txt) + pass + elif nm == 'bibtex:doi': + txt = re.sub(r'^', 'DOI: ', txt) + pass + elif nm == 'bibtex:bibtexkey': + txt = re.sub(r'^', 'Key: ', txt) + pass + + return txt +---- + + +=== 'fields' file: + +---- +[prefixes] + +refjournal=RFJOURNAL +refpages=RFPAGES +reftitle=RFTTITLE +refvolume=RFVOLUME +refauthor=RFAUTHOR +refyear=RFYYEAR +refisbn=RFISBN +refissn=RFISSN +refdoi=RFDOI +refeditor=RFEDITOR +refpublisher=RFPUBLISHER +refaddress=RFADDRESS +reflocation=RFLOCATION +refbooktitle=RFBOOKTITLE +refurl=RFURL +reftype=RFTYPE +refkey=RFKEY +refabstract=RFABSTRACT +refkeywords=RFKEYWORDS +refcomment=RFCOMMENT +refedition=RFEDITION +reflanguage=RFLANGUAGE + +[stored] + +refjournal= +refpages= +reftitle= +refvolume= +refauthor= +refyear= +refisbn= +refissn= +refdoi= +refeditor= +refpublisher= +refaddress= +reflocation= +refbooktitle= +refurl= +reftype= +refkey= +refabstract= +refkeywords= +refcomment= +refedition= +reflanguage= +refid= + +[aliases] + +refjournal = bibtex:journal bibtex:journaltitle +refpages = bibtex:pages +reftitle = bibtex:title +refvolume = bibtex:volume +refauthor = bibtex:author +refyear = bibtex:year bibtex:date +refid = dc:identifier bibtex:isbn bibtex:issn +refisbn = bibtex:isbn +refissn = bibtex:issn +refdoi = bibtex:doi +refeditor = bibtex:editor +refpublisher = bibtex:publisher +refaddress = bibtex:address +reflocation = bibtex:location +refbooktitle = bibtex:booktitle +refurl = bibtex:url +reftype = bibtex:entrytype bibtex:type +refkey = bibtex:bibtexkey +refabstract = bibtex:abstract +refkeywords = bibtex:keywords +refcomment = bibtex:comment +refedition = bibtex:edition +reflanguage = bibtex:language +author = xesam:author +---- + diff --git a/website/recoll_XMP/recoll_query.png b/website/recoll_XMP/recoll_query.png index 23b371d3..d8da01b4 100644 Binary files a/website/recoll_XMP/recoll_query.png and b/website/recoll_XMP/recoll_query.png differ