web: new xmp article
This commit is contained in:
parent
7cefd893cd
commit
6bf210a0c3
@ -745,10 +745,9 @@ handler, which differs a lot from doing something equivalent with the
|
|||||||
current Python-based one (for which XMP capability is available from
|
current Python-based one (for which XMP capability is available from
|
||||||
recoll 1.23.2, but the new handler can be used with previous Recoll
|
recoll 1.23.2, but the new handler can be used with previous Recoll
|
||||||
versions).</p></div>
|
versions).</p></div>
|
||||||
<div class="paragraph"><p>This page was adapted from the text by Jeffrey Dick, using input from
|
<div class="paragraph"><p>I based this page on the text by Jeffrey Dick, using input from Johannes
|
||||||
Johannes Menzel, (especially the result list paragraph format),
|
Menzel for all examples about the new features. The discussion which led to
|
||||||
adapting things for the new handler. The discussion which led to the
|
the updated handler is a
|
||||||
updated handler is a
|
|
||||||
<a href="https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags">Bitbucket
|
<a href="https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags">Bitbucket
|
||||||
Recoll issue</a>.</p></div>
|
Recoll issue</a>.</p></div>
|
||||||
</div>
|
</div>
|
||||||
@ -787,46 +786,49 @@ to describe genre, topic, etc.</p></div>
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
<h2 id="_custom_indexing_fields_file">Custom indexing (fields file)</h2>
|
<h2 id="_custom_indexing_short_example_fields_file">Custom indexing short example (fields file)</h2>
|
||||||
<div class="sectionbody">
|
<div class="sectionbody">
|
||||||
<div class="paragraph"><p>Let’s create two fields named "year" and "journal". The prefixes
|
<div class="paragraph"><p>The following example (extract from a complete configuration shown later)
|
||||||
starting with "XY" are extension prefixes that are added to the terms
|
creates two fields named "refjournal" and "refpages", which are both stored
|
||||||
in the Xapian database (Recoll internally does not use prefixes
|
(so they can be displayed in result list entries), and indexed (you can
|
||||||
starting with XY). Additionally, the year and journal are stored so
|
specifically search them).</p></div>
|
||||||
they can be displayed in the results list. Some other types of
|
<div class="paragraph"><p>Some other types of metadata, such as title, author and keywords, are
|
||||||
metadata, such as title, author and keywords, are already indexed by
|
already indexed by Recoll (the default rclpdf finds them using the
|
||||||
Recoll (the default rclpdf finds them using the <strong>pdftotext</strong>
|
<strong>pdftotext</strong> command) so there is no need to add those to the [prefixes]
|
||||||
command) so there is no need to add those to the [prefixes] section.</p></div>
|
section.</p></div>
|
||||||
<div class="paragraph"><p>Add this text to the fields file in your Recoll configuration
|
<div class="paragraph"><p>This is taken from the <code>fields</code> file inside the configuration
|
||||||
directory (<em>~/.recoll/fields</em>).</p></div>
|
(e.g. <em>~/.recoll/fields</em>).</p></div>
|
||||||
<div class="listingblock">
|
<div class="listingblock">
|
||||||
<div class="content">
|
<div class="content">
|
||||||
<pre><code>[prefixes]
|
<pre><code>[prefixes]
|
||||||
year = XYEAR
|
refjournal=RFJOURNAL
|
||||||
journal = XYJOUR
|
refpages=RFPAGES
|
||||||
|
|
||||||
[stored]
|
[stored]
|
||||||
bibtex:year =
|
refjournal =
|
||||||
bibtex:journal =</code></pre>
|
refpages =
|
||||||
|
|
||||||
|
[aliases]
|
||||||
|
refjournal = bibtex:journal bibtex:journaltitle
|
||||||
|
refpages = bibtex:pages</code></pre>
|
||||||
</div></div>
|
</div></div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
<h2 id="_telling_the_handler_what_fields_to_extract">Telling the handler what fields to extract</h2>
|
<h2 id="_telling_the_handler_what_fields_to_extract">Telling the handler what fields to extract</h2>
|
||||||
<div class="sectionbody">
|
<div class="sectionbody">
|
||||||
<div class="paragraph"><p>As of Recoll 1.23.2, the PDF handler has the capability to use
|
<div class="paragraph"><p>As of Recoll 1.23.2, the PDF handler has the capability to use <strong>pdfinfo</strong>
|
||||||
<strong>pdfinfo</strong> for extracting XMP metadata. The switch for executing <strong>pdfinfo</strong>
|
for extracting XMP metadata. The switch for executing <strong>pdfinfo</strong> is the
|
||||||
is the <em>pdfextrameta</em> configuration parameter, and the value of the
|
<em>pdfextrameta</em> configuration parameter, and the value of the parameter is a
|
||||||
parameter is a list of XMP tags to extract, with optional conversion
|
list of XMP tags to extract, with optional conversion to Recoll field names
|
||||||
to Recoll field names (the XMP qualified tag name is kept by
|
(the XMP qualified tag name is kept by default, the translation is
|
||||||
default). Example:</p></div>
|
separated by a <em>|</em> character). Example (without translations):</p></div>
|
||||||
<div class="listingblock">
|
<div class="listingblock">
|
||||||
<div class="content">
|
<div class="content">
|
||||||
<pre><code>pdfextrameta = bibtex:year bibtex:journal bibtex:booktitle|title</code></pre>
|
<pre><code>pdfextrameta = bibtex:year bibtex:journal bibtex:journaltitle</code></pre>
|
||||||
</div></div>
|
</div></div>
|
||||||
<div class="paragraph"><p>Here, <em>bibtex:year</em> and <em>bibtex:journal</em> are used directly, and
|
<div class="paragraph"><p>Note that it is quite equivalent to translate a field name inside
|
||||||
<em>bibtex:booktitle</em> is translated to <em>title</em> (the example is not
|
<em>pdfextrameta</em> or to uses aliases inside the <em>fields</em> file.</p></div>
|
||||||
supposed to make sense)</p></div>
|
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
@ -871,6 +873,11 @@ class MetaFixer(object):
|
|||||||
|
|
||||||
return txt</code></pre>
|
return txt</code></pre>
|
||||||
</div></div>
|
</div></div>
|
||||||
|
<div class="paragraph"><p>The metadata-editing script can be modified to fill in the "journal" field for
|
||||||
|
BibTex entries that aren’t journal articles (e.g. bibtex:booktitle
|
||||||
|
for "InCollection" entries), by defining a <em>wrapup()</em> method which will
|
||||||
|
be called with the whole metadata array (an array of <em>(nm,value)</em>
|
||||||
|
pairs) for global editing/removing/addition.</p></div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
@ -886,11 +893,8 @@ HTML meta elements, and the <body> contains the text of the PDF.</p></div>
|
|||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
<h2 id="_result_paragraph_format">Result paragraph format</h2>
|
<h2 id="_result_paragraph_format">Result paragraph format</h2>
|
||||||
<div class="sectionbody">
|
<div class="sectionbody">
|
||||||
<div class="paragraph"><p>Here, the result is formatted to show the title, which is a link
|
<div class="paragraph"><p>The result paragraph format defines what fields are displayed inside Recoll
|
||||||
to open the document, in blue with underlining turned off. The next
|
result list, and how they are formatted.</p></div>
|
||||||
two lines contain the authors, then the journal title in green
|
|
||||||
italicized text followed by year (in parentheses). The keywords are
|
|
||||||
listed in red after the abstract/text snippet.</p></div>
|
|
||||||
<div class="paragraph"><p>Edit this using the Recoll GUI: Preferences > GUI configuration >
|
<div class="paragraph"><p>Edit this using the Recoll GUI: Preferences > GUI configuration >
|
||||||
Result List > Edit result paragraph format string.</p></div>
|
Result List > Edit result paragraph format string.</p></div>
|
||||||
<div class="listingblock">
|
<div class="listingblock">
|
||||||
@ -922,26 +926,17 @@ listed in red after the abstract/text snippet.</p></div>
|
|||||||
|
|
||||||
</table></code></pre>
|
</table></code></pre>
|
||||||
</div></div>
|
</div></div>
|
||||||
<div class="paragraph"><p>The screenshot below also has the <em>Highlight color for query terms</em>
|
<div class="paragraph"><p>There are
|
||||||
set to <code>black; font-weight:bold;</code> for bold, black text (instead
|
<a href="https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails">various
|
||||||
of the blue default). There
|
methods for creating the thumbnails</a>; the ones here were made by opening
|
||||||
are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
|
the directory containing the PDFs in the Dolphin file manager (part of KDE)
|
||||||
methods for creating the thumbnails]; the ones here were made by
|
and selecting the Preview option.</p></div>
|
||||||
opening the directory containing the PDFs in the Dolphin file manager
|
<div class="paragraph"><p>And the result:</p></div>
|
||||||
(part of KDE) and selecting the Preview option.</p></div>
|
<div class="imageblock">
|
||||||
|
<div class="content">
|
||||||
|
<img src="recoll_query.png" alt="Result list display" />
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
|
||||||
<h2 id="_a_search_example">A search example</h2>
|
|
||||||
<div class="sectionbody">
|
|
||||||
<div class="paragraph"><p>The simple query is <code>cerevisiae keyword:protein</code>. This
|
|
||||||
returns only PDFs that have the text "cerevisiae" and have been
|
|
||||||
tagged with the "protein" keyword. The LaTeX-style formatting from
|
|
||||||
the BibTeX database is displayed as HTML (note the italicized words
|
|
||||||
in article title, and umlaut in author’s name). Other queries could
|
|
||||||
be made based on the PDF metadata, e.g. <em>journal:plos</em>
|
|
||||||
r <em>year:2013</em>.</p></div>
|
|
||||||
<div class="paragraph"><p>image::recoll_query.png</p></div>
|
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="sect1">
|
<div class="sect1">
|
||||||
@ -967,15 +962,194 @@ The sort buttons (up- and down-arrows) in Recoll sort the
|
|||||||
the result list using the stored date of the file (using "%D" in the
|
the result list using the stored date of the file (using "%D" in the
|
||||||
result paragraph format, and date format "%Y") instead of having to
|
result paragraph format, and date format "%Y") instead of having to
|
||||||
add the year to the index as shown above.</p></div>
|
add the year to the index as shown above.</p></div>
|
||||||
<div class="ulist"><ul>
|
</div>
|
||||||
<li>
|
</div>
|
||||||
<p>
|
<div class="sect1">
|
||||||
The filter can be modified to fill in the "journal" field for
|
<h2 id="_complete_example">Complete example</h2>
|
||||||
BibTex entries that aren’t journal articles (e.g. bibtex:booktitle
|
<div class="sectionbody">
|
||||||
for "InCollection" entries).
|
<div class="paragraph"><p>This was designed by Johannes Menzel, who kindly provided the data when we
|
||||||
</p>
|
worked on improving PDF XMP data extraction. The originals are listed in
|
||||||
</li>
|
this
|
||||||
</ul></div>
|
<a href="https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags">BitBucket issue</a></p></div>
|
||||||
|
<div class="paragraph"><p>The paragraph format is listed above.</p></div>
|
||||||
|
<div class="sect2">
|
||||||
|
<h3 id="_em_recoll_conf_em_additions"><em>recoll.conf</em> additions:</h3>
|
||||||
|
<div class="listingblock">
|
||||||
|
<div class="content">
|
||||||
|
<pre><code>pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
|
||||||
|
bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
|
||||||
|
bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
|
||||||
|
bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
|
||||||
|
bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
|
||||||
|
bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
|
||||||
|
dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
|
||||||
|
|
||||||
|
defaultcharset = UTF-8//
|
||||||
|
|
||||||
|
pdfextrametafix = /home/hannes/.recoll/metafix.py</code></pre>
|
||||||
|
</div></div>
|
||||||
|
</div>
|
||||||
|
<div class="sect2">
|
||||||
|
<h3 id="_em_metafix_py_em_script"><em>metafix.py</em> script:</h3>
|
||||||
|
<div class="listingblock">
|
||||||
|
<div class="content">
|
||||||
|
<pre><code>import sys
|
||||||
|
import re
|
||||||
|
|
||||||
|
# This can be used for local XMP field editing.
|
||||||
|
#
|
||||||
|
# A new instance is created for each PDF document (so the object could
|
||||||
|
# keep state to avoid, e.g. duplicate values)
|
||||||
|
#
|
||||||
|
# The metafix method receives an (original) field name, and the text
|
||||||
|
# value, and should return the possibly modified text.
|
||||||
|
class MetaFixer(object):
|
||||||
|
def __init__(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def metafix(self, nm, txt):
|
||||||
|
if nm == 'bibtex:pages':
|
||||||
|
txt = re.sub(r'--', '-', txt)
|
||||||
|
txt = re.sub(r'^', ', p. ', txt)
|
||||||
|
elif nm == 'bibtex:author':
|
||||||
|
txt = re.sub(r'$', ':\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:chapter':
|
||||||
|
txt = re.sub(r'^', ', in: id.: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:editor':
|
||||||
|
txt = re.sub(r'^', ', in: ', txt)
|
||||||
|
txt = re.sub(r'$', ' (ed.):\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:year':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:date':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:volume':
|
||||||
|
txt = re.sub(r'^', ', vol. ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:number':
|
||||||
|
txt = re.sub(r'^', ', no. ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:journaltitle':
|
||||||
|
txt = re.sub(r'^', ', in: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:journal':
|
||||||
|
txt = re.sub(r'^', ', in: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:title':
|
||||||
|
txt = re.sub(r'^', '"', txt)
|
||||||
|
txt = re.sub(r'$', '"', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:location':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
txt = re.sub(r'$', ':\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:address':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
txt = re.sub(r'$', ':\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:isbn':
|
||||||
|
txt = re.sub(r'^', 'ISBN: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:issn':
|
||||||
|
txt = re.sub(r'^', 'ISSN: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:doi':
|
||||||
|
txt = re.sub(r'^', 'DOI: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:bibtexkey':
|
||||||
|
txt = re.sub(r'^', 'Key: ', txt)
|
||||||
|
pass
|
||||||
|
|
||||||
|
return txt</code></pre>
|
||||||
|
</div></div>
|
||||||
|
</div>
|
||||||
|
<div class="sect2">
|
||||||
|
<h3 id="_em_fields_em_file"><em>fields</em> file:</h3>
|
||||||
|
<div class="listingblock">
|
||||||
|
<div class="content">
|
||||||
|
<pre><code>[prefixes]
|
||||||
|
|
||||||
|
refjournal=RFJOURNAL
|
||||||
|
refpages=RFPAGES
|
||||||
|
reftitle=RFTTITLE
|
||||||
|
refvolume=RFVOLUME
|
||||||
|
refauthor=RFAUTHOR
|
||||||
|
refyear=RFYYEAR
|
||||||
|
refisbn=RFISBN
|
||||||
|
refissn=RFISSN
|
||||||
|
refdoi=RFDOI
|
||||||
|
refeditor=RFEDITOR
|
||||||
|
refpublisher=RFPUBLISHER
|
||||||
|
refaddress=RFADDRESS
|
||||||
|
reflocation=RFLOCATION
|
||||||
|
refbooktitle=RFBOOKTITLE
|
||||||
|
refurl=RFURL
|
||||||
|
reftype=RFTYPE
|
||||||
|
refkey=RFKEY
|
||||||
|
refabstract=RFABSTRACT
|
||||||
|
refkeywords=RFKEYWORDS
|
||||||
|
refcomment=RFCOMMENT
|
||||||
|
refedition=RFEDITION
|
||||||
|
reflanguage=RFLANGUAGE
|
||||||
|
|
||||||
|
[stored]
|
||||||
|
|
||||||
|
refjournal=
|
||||||
|
refpages=
|
||||||
|
reftitle=
|
||||||
|
refvolume=
|
||||||
|
refauthor=
|
||||||
|
refyear=
|
||||||
|
refisbn=
|
||||||
|
refissn=
|
||||||
|
refdoi=
|
||||||
|
refeditor=
|
||||||
|
refpublisher=
|
||||||
|
refaddress=
|
||||||
|
reflocation=
|
||||||
|
refbooktitle=
|
||||||
|
refurl=
|
||||||
|
reftype=
|
||||||
|
refkey=
|
||||||
|
refabstract=
|
||||||
|
refkeywords=
|
||||||
|
refcomment=
|
||||||
|
refedition=
|
||||||
|
reflanguage=
|
||||||
|
refid=
|
||||||
|
|
||||||
|
[aliases]
|
||||||
|
|
||||||
|
refjournal = bibtex:journal bibtex:journaltitle
|
||||||
|
refpages = bibtex:pages
|
||||||
|
reftitle = bibtex:title
|
||||||
|
refvolume = bibtex:volume
|
||||||
|
refauthor = bibtex:author
|
||||||
|
refyear = bibtex:year bibtex:date
|
||||||
|
refid = dc:identifier bibtex:isbn bibtex:issn
|
||||||
|
refisbn = bibtex:isbn
|
||||||
|
refissn = bibtex:issn
|
||||||
|
refdoi = bibtex:doi
|
||||||
|
refeditor = bibtex:editor
|
||||||
|
refpublisher = bibtex:publisher
|
||||||
|
refaddress = bibtex:address
|
||||||
|
reflocation = bibtex:location
|
||||||
|
refbooktitle = bibtex:booktitle
|
||||||
|
refurl = bibtex:url
|
||||||
|
reftype = bibtex:entrytype bibtex:type
|
||||||
|
refkey = bibtex:bibtexkey
|
||||||
|
refabstract = bibtex:abstract
|
||||||
|
refkeywords = bibtex:keywords
|
||||||
|
refcomment = bibtex:comment
|
||||||
|
refedition = bibtex:edition
|
||||||
|
reflanguage = bibtex:language
|
||||||
|
author = xesam:author</code></pre>
|
||||||
|
</div></div>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
@ -983,7 +1157,7 @@ The filter can be modified to fill in the "journal" field for
|
|||||||
<div id="footer">
|
<div id="footer">
|
||||||
<div id="footer-text">
|
<div id="footer-text">
|
||||||
Last updated
|
Last updated
|
||||||
2017-05-17 07:27:42 CEST
|
2017-05-23 09:26:52 CEST
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</body>
|
</body>
|
||||||
|
|||||||
@ -8,10 +8,9 @@ current Python-based one (for which XMP capability is available from
|
|||||||
recoll 1.23.2, but the new handler can be used with previous Recoll
|
recoll 1.23.2, but the new handler can be used with previous Recoll
|
||||||
versions).
|
versions).
|
||||||
|
|
||||||
This page was adapted from the text by Jeffrey Dick, using input from
|
I based this page on the text by Jeffrey Dick, using input from Johannes
|
||||||
Johannes Menzel, (especially the result list paragraph format),
|
Menzel for all examples about the new features. The discussion which led to
|
||||||
adapting things for the new handler. The discussion which led to the
|
the updated handler is a
|
||||||
updated handler is a
|
|
||||||
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
|
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
|
||||||
Recoll issue].
|
Recoll issue].
|
||||||
|
|
||||||
@ -42,46 +41,51 @@ to describe genre, topic, etc.
|
|||||||
|
|
||||||
image::jabref_metadata.png[Editing metadata with jabref]
|
image::jabref_metadata.png[Editing metadata with jabref]
|
||||||
|
|
||||||
== Custom indexing (fields file)
|
== Custom indexing short example (fields file)
|
||||||
|
|
||||||
Let's create two fields named "year" and "journal". The prefixes
|
The following example (extract from a complete configuration shown later)
|
||||||
starting with "XY" are extension prefixes that are added to the terms
|
creates two fields named "refjournal" and "refpages", which are both stored
|
||||||
in the Xapian database (Recoll internally does not use prefixes
|
(so they can be displayed in result list entries), and indexed (you can
|
||||||
starting with XY). Additionally, the year and journal are stored so
|
specifically search them).
|
||||||
they can be displayed in the results list. Some other types of
|
|
||||||
metadata, such as title, author and keywords, are already indexed by
|
Some other types of metadata, such as title, author and keywords, are
|
||||||
Recoll (the default rclpdf finds them using the *pdftotext*
|
already indexed by Recoll (the default rclpdf finds them using the
|
||||||
command) so there is no need to add those to the [prefixes] section.
|
*pdftotext* command) so there is no need to add those to the [prefixes]
|
||||||
|
section.
|
||||||
|
|
||||||
|
This is taken from the `fields` file inside the configuration
|
||||||
|
(e.g. '~/.recoll/fields').
|
||||||
|
|
||||||
Add this text to the fields file in your Recoll configuration
|
|
||||||
directory ('~/.recoll/fields').
|
|
||||||
|
|
||||||
----
|
----
|
||||||
[prefixes]
|
[prefixes]
|
||||||
year = XYEAR
|
refjournal=RFJOURNAL
|
||||||
journal = XYJOUR
|
refpages=RFPAGES
|
||||||
|
|
||||||
[stored]
|
[stored]
|
||||||
bibtex:year =
|
refjournal =
|
||||||
bibtex:journal =
|
refpages =
|
||||||
|
|
||||||
|
[aliases]
|
||||||
|
refjournal = bibtex:journal bibtex:journaltitle
|
||||||
|
refpages = bibtex:pages
|
||||||
----
|
----
|
||||||
|
|
||||||
== Telling the handler what fields to extract
|
== Telling the handler what fields to extract
|
||||||
|
|
||||||
As of Recoll 1.23.2, the PDF handler has the capability to use
|
As of Recoll 1.23.2, the PDF handler has the capability to use *pdfinfo*
|
||||||
*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo*
|
for extracting XMP metadata. The switch for executing *pdfinfo* is the
|
||||||
is the 'pdfextrameta' configuration parameter, and the value of the
|
'pdfextrameta' configuration parameter, and the value of the parameter is a
|
||||||
parameter is a list of XMP tags to extract, with optional conversion
|
list of XMP tags to extract, with optional conversion to Recoll field names
|
||||||
to Recoll field names (the XMP qualified tag name is kept by
|
(the XMP qualified tag name is kept by default, the translation is
|
||||||
default). Example:
|
separated by a '|' character). Example (without translations):
|
||||||
|
|
||||||
----
|
----
|
||||||
pdfextrameta = bibtex:year bibtex:journal bibtex:booktitle|title
|
pdfextrameta = bibtex:year bibtex:journal bibtex:journaltitle
|
||||||
----
|
----
|
||||||
|
|
||||||
Here, 'bibtex:year' and 'bibtex:journal' are used directly, and
|
Note that it is quite equivalent to translate a field name inside
|
||||||
'bibtex:booktitle' is translated to 'title' (the example is not
|
'pdfextrameta' or to uses aliases inside the 'fields' file.
|
||||||
supposed to make sense)
|
|
||||||
|
|
||||||
== Editing the field values
|
== Editing the field values
|
||||||
|
|
||||||
@ -127,6 +131,13 @@ class MetaFixer(object):
|
|||||||
return txt
|
return txt
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
||||||
|
The metadata-editing script can be modified to fill in the "journal" field for
|
||||||
|
BibTex entries that aren't journal articles (e.g. bibtex:booktitle
|
||||||
|
for "InCollection" entries), by defining a 'wrapup()' method which will
|
||||||
|
be called with the whole metadata array (an array of '(nm,value)'
|
||||||
|
pairs) for global editing/removing/addition.
|
||||||
|
|
||||||
== Indexing
|
== Indexing
|
||||||
|
|
||||||
Then index away!
|
Then index away!
|
||||||
@ -138,11 +149,8 @@ HTML meta elements, and the <body> contains the text of the PDF.
|
|||||||
|
|
||||||
== Result paragraph format
|
== Result paragraph format
|
||||||
|
|
||||||
Here, the result is formatted to show the title, which is a link
|
The result paragraph format defines what fields are displayed inside Recoll
|
||||||
to open the document, in blue with underlining turned off. The next
|
result list, and how they are formatted.
|
||||||
two lines contain the authors, then the journal title in green
|
|
||||||
italicized text followed by year (in parentheses). The keywords are
|
|
||||||
listed in red after the abstract/text snippet.
|
|
||||||
|
|
||||||
Edit this using the Recoll GUI: Preferences > GUI configuration >
|
Edit this using the Recoll GUI: Preferences > GUI configuration >
|
||||||
Result List > Edit result paragraph format string.
|
Result List > Edit result paragraph format string.
|
||||||
@ -177,26 +185,15 @@ Edit this using the Recoll GUI: Preferences > GUI configuration >
|
|||||||
|
|
||||||
----
|
----
|
||||||
|
|
||||||
The screenshot below also has the 'Highlight color for query terms'
|
There are
|
||||||
set to `black; font-weight:bold;` for bold, black text (instead
|
link:https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
|
||||||
of the blue default). There
|
methods for creating the thumbnails]; the ones here were made by opening
|
||||||
are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
|
the directory containing the PDFs in the Dolphin file manager (part of KDE)
|
||||||
methods for creating the thumbnails]; the ones here were made by
|
and selecting the Preview option.
|
||||||
opening the directory containing the PDFs in the Dolphin file manager
|
|
||||||
(part of KDE) and selecting the Preview option.
|
|
||||||
|
|
||||||
|
And the result:
|
||||||
|
|
||||||
== A search example
|
image::recoll_query.png[Result list display]
|
||||||
|
|
||||||
The simple query is `cerevisiae keyword:protein`. This
|
|
||||||
returns only PDFs that have the text "cerevisiae" and have been
|
|
||||||
tagged with the "protein" keyword. The LaTeX-style formatting from
|
|
||||||
the BibTeX database is displayed as HTML (note the italicized words
|
|
||||||
in article title, and umlaut in author's name). Other queries could
|
|
||||||
be made based on the PDF metadata, e.g. 'journal:plos'
|
|
||||||
r 'year:2013'.
|
|
||||||
|
|
||||||
image::recoll_query.png
|
|
||||||
|
|
||||||
== More possibilities
|
== More possibilities
|
||||||
|
|
||||||
@ -216,6 +213,190 @@ the result list using the stored date of the file (using "%D" in the
|
|||||||
result paragraph format, and date format "%Y") instead of having to
|
result paragraph format, and date format "%Y") instead of having to
|
||||||
add the year to the index as shown above.
|
add the year to the index as shown above.
|
||||||
|
|
||||||
- The filter can be modified to fill in the "journal" field for
|
|
||||||
BibTex entries that aren't journal articles (e.g. bibtex:booktitle
|
== Complete example
|
||||||
for "InCollection" entries).
|
|
||||||
|
This was designed by Johannes Menzel, who kindly provided the data when we
|
||||||
|
worked on improving PDF XMP data extraction. The originals are listed in
|
||||||
|
this
|
||||||
|
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[BitBucket issue]
|
||||||
|
|
||||||
|
The paragraph format is listed above.
|
||||||
|
|
||||||
|
=== 'recoll.conf' additions:
|
||||||
|
|
||||||
|
----
|
||||||
|
pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
|
||||||
|
bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
|
||||||
|
bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
|
||||||
|
bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
|
||||||
|
bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
|
||||||
|
bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
|
||||||
|
dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
|
||||||
|
|
||||||
|
defaultcharset = UTF-8//
|
||||||
|
|
||||||
|
pdfextrametafix = /home/hannes/.recoll/metafix.py
|
||||||
|
----
|
||||||
|
|
||||||
|
|
||||||
|
=== 'metafix.py' script:
|
||||||
|
|
||||||
|
----
|
||||||
|
import sys
|
||||||
|
import re
|
||||||
|
|
||||||
|
# This can be used for local XMP field editing.
|
||||||
|
#
|
||||||
|
# A new instance is created for each PDF document (so the object could
|
||||||
|
# keep state to avoid, e.g. duplicate values)
|
||||||
|
#
|
||||||
|
# The metafix method receives an (original) field name, and the text
|
||||||
|
# value, and should return the possibly modified text.
|
||||||
|
class MetaFixer(object):
|
||||||
|
def __init__(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def metafix(self, nm, txt):
|
||||||
|
if nm == 'bibtex:pages':
|
||||||
|
txt = re.sub(r'--', '-', txt)
|
||||||
|
txt = re.sub(r'^', ', p. ', txt)
|
||||||
|
elif nm == 'bibtex:author':
|
||||||
|
txt = re.sub(r'$', ':\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:chapter':
|
||||||
|
txt = re.sub(r'^', ', in: id.: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:editor':
|
||||||
|
txt = re.sub(r'^', ', in: ', txt)
|
||||||
|
txt = re.sub(r'$', ' (ed.):\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:year':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:date':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:volume':
|
||||||
|
txt = re.sub(r'^', ', vol. ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:number':
|
||||||
|
txt = re.sub(r'^', ', no. ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:journaltitle':
|
||||||
|
txt = re.sub(r'^', ', in: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:journal':
|
||||||
|
txt = re.sub(r'^', ', in: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:title':
|
||||||
|
txt = re.sub(r'^', '"', txt)
|
||||||
|
txt = re.sub(r'$', '"', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:location':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
txt = re.sub(r'$', ':\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:address':
|
||||||
|
txt = re.sub(r'^', ', ', txt)
|
||||||
|
txt = re.sub(r'$', ':\ ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:isbn':
|
||||||
|
txt = re.sub(r'^', 'ISBN: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:issn':
|
||||||
|
txt = re.sub(r'^', 'ISSN: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:doi':
|
||||||
|
txt = re.sub(r'^', 'DOI: ', txt)
|
||||||
|
pass
|
||||||
|
elif nm == 'bibtex:bibtexkey':
|
||||||
|
txt = re.sub(r'^', 'Key: ', txt)
|
||||||
|
pass
|
||||||
|
|
||||||
|
return txt
|
||||||
|
----
|
||||||
|
|
||||||
|
|
||||||
|
=== 'fields' file:
|
||||||
|
|
||||||
|
----
|
||||||
|
[prefixes]
|
||||||
|
|
||||||
|
refjournal=RFJOURNAL
|
||||||
|
refpages=RFPAGES
|
||||||
|
reftitle=RFTTITLE
|
||||||
|
refvolume=RFVOLUME
|
||||||
|
refauthor=RFAUTHOR
|
||||||
|
refyear=RFYYEAR
|
||||||
|
refisbn=RFISBN
|
||||||
|
refissn=RFISSN
|
||||||
|
refdoi=RFDOI
|
||||||
|
refeditor=RFEDITOR
|
||||||
|
refpublisher=RFPUBLISHER
|
||||||
|
refaddress=RFADDRESS
|
||||||
|
reflocation=RFLOCATION
|
||||||
|
refbooktitle=RFBOOKTITLE
|
||||||
|
refurl=RFURL
|
||||||
|
reftype=RFTYPE
|
||||||
|
refkey=RFKEY
|
||||||
|
refabstract=RFABSTRACT
|
||||||
|
refkeywords=RFKEYWORDS
|
||||||
|
refcomment=RFCOMMENT
|
||||||
|
refedition=RFEDITION
|
||||||
|
reflanguage=RFLANGUAGE
|
||||||
|
|
||||||
|
[stored]
|
||||||
|
|
||||||
|
refjournal=
|
||||||
|
refpages=
|
||||||
|
reftitle=
|
||||||
|
refvolume=
|
||||||
|
refauthor=
|
||||||
|
refyear=
|
||||||
|
refisbn=
|
||||||
|
refissn=
|
||||||
|
refdoi=
|
||||||
|
refeditor=
|
||||||
|
refpublisher=
|
||||||
|
refaddress=
|
||||||
|
reflocation=
|
||||||
|
refbooktitle=
|
||||||
|
refurl=
|
||||||
|
reftype=
|
||||||
|
refkey=
|
||||||
|
refabstract=
|
||||||
|
refkeywords=
|
||||||
|
refcomment=
|
||||||
|
refedition=
|
||||||
|
reflanguage=
|
||||||
|
refid=
|
||||||
|
|
||||||
|
[aliases]
|
||||||
|
|
||||||
|
refjournal = bibtex:journal bibtex:journaltitle
|
||||||
|
refpages = bibtex:pages
|
||||||
|
reftitle = bibtex:title
|
||||||
|
refvolume = bibtex:volume
|
||||||
|
refauthor = bibtex:author
|
||||||
|
refyear = bibtex:year bibtex:date
|
||||||
|
refid = dc:identifier bibtex:isbn bibtex:issn
|
||||||
|
refisbn = bibtex:isbn
|
||||||
|
refissn = bibtex:issn
|
||||||
|
refdoi = bibtex:doi
|
||||||
|
refeditor = bibtex:editor
|
||||||
|
refpublisher = bibtex:publisher
|
||||||
|
refaddress = bibtex:address
|
||||||
|
reflocation = bibtex:location
|
||||||
|
refbooktitle = bibtex:booktitle
|
||||||
|
refurl = bibtex:url
|
||||||
|
reftype = bibtex:entrytype bibtex:type
|
||||||
|
refkey = bibtex:bibtexkey
|
||||||
|
refabstract = bibtex:abstract
|
||||||
|
refkeywords = bibtex:keywords
|
||||||
|
refcomment = bibtex:comment
|
||||||
|
refedition = bibtex:edition
|
||||||
|
reflanguage = bibtex:language
|
||||||
|
author = xesam:author
|
||||||
|
----
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 285 KiB After Width: | Height: | Size: 284 KiB |
Loading…
x
Reference in New Issue
Block a user