70 lines
2.5 KiB
Plaintext
70 lines
2.5 KiB
Plaintext
== Generating a custom field and using it to sort results
|
|
|
|
We are going to show how to generate a custom field from a Recoll filter,
|
|
and use it for sorting results. The example chosen comes from an actual
|
|
user request: sorting results on pdf page counts.
|
|
|
|
The details here are obsolete, as the +pdf+ input handler is now a quite
|
|
different python program, but the general idea is still relevant.
|
|
|
|
The page count from a pdf file can be displayed by the pdfinfo command
|
|
(xpdf or poppler tools).
|
|
|
|
We first modify a copy of the rclpdf filter
|
|
('/usr/[local/]share/recoll/filters/rclpdf'), to compute the pdf page count,
|
|
and output the value as an html meta field. This is a not very interesting
|
|
bit of shell/awk magic. Another approach would be to just rewrite the
|
|
rclpdf filter in your favorite scripting language (ie: perl, python...), as
|
|
all it does is execute pdftotext and pdfinfo and output html, nothing
|
|
complicated. Here follows the rclpdf modification as a pseudo patch:
|
|
|
|
----
|
|
# compute the page count and format it so that it's alphabetically sortable
|
|
+set `pdfinfo "$infile" | egrep ^Pages:`
|
|
+pages=`printf "%04d" $2`
|
|
[skip...]
|
|
# Pass the page count value to awk
|
|
-awk 'BEGIN'\
|
|
+awk -v Pages="$pages" 'BEGIN'\
|
|
[skip...]
|
|
# Inside the awk program startup section: compute the "meta" field line
|
|
+ pagemeta = "<meta name=\"pdfpages\" content=\"" Pages "\">\n"
|
|
[skip...]
|
|
# Then print it as part of the header:
|
|
+ $0 = part1 charsetmeta pagemeta part2
|
|
[skip...]
|
|
----
|
|
|
|
You can execute your own version of rclpdf by modifying '~/.recoll/mimeconf':
|
|
|
|
----
|
|
[index]
|
|
application/pdf = exec /path/to/my/own/rclpdf
|
|
----
|
|
|
|
At this point, recollindex would receive and extract a +pdfpages+ field,
|
|
but it would not know what to do with it. We are going to tell it to store
|
|
the value inside the document data record so that it can be displayed in
|
|
the results, and sorted on. For this we modify the '~/.recoll/fields' file:
|
|
|
|
----
|
|
[stored]
|
|
pdfpages=
|
|
----
|
|
|
|
That's it ! After reindexing, you can now display +pdfpages+ inside the
|
|
result list (add a +%(pdfpages)+ value to the paragraph format), and display
|
|
+pdfpages+ inside the result table (right-click the table header), and sort
|
|
the results on page count (click the column header).
|
|
|
|
Note that +pdfpages+ has not been defined as searchable (this would not make
|
|
much sense). For this, you'd have to define a prefix and add it to the
|
|
[prefixes] fields file section:
|
|
|
|
----
|
|
[prefixes]
|
|
pdfpages = XYPDFP
|
|
----
|
|
|
|
Have a look at the comments inside the 'fields' file for more information.
|