This commit is contained in:
Jean-Francois Dockes 2017-06-05 11:57:26 +02:00
parent 06b414cfc6
commit 821fb780d2
35 changed files with 2078 additions and 0 deletions

View File

@ -0,0 +1,35 @@
== Extending the Recoll Firefox visited web page indexing mechanism to other browsers
The *Recoll* _Web Queue_ function allows using WEB browser plug-ins
originally designed for indexing visited WEB pages with *Beagle* (rip). The
browser plug-ins works very simply by creating copies of the visited pages
in a designated directory. Two files are created for each page, one for the
contents, the other for the metadata.
When activated, *Recoll* will visit the queue directory and index each HTML
page and its associated metadata. There is more detail about the mechanism
on the [[IndexWebHistory|page about the Recoll Web queue]], but mostly, you
just need to go to the _Indexing Preferences_ in the *recoll* GUI, open the
_Web history_ panel and check the top button.
Franck, a *Recoll* and *Elinks* user from New Zealand, designed a method
and wrote a script to index the *Elinks* WEB history in this fashion.
The script works by using *wget* to fetch the visited page into the queue
directory. This means that it would be reusable to index arbitrary WEB
pages in contexts other than *Elinks* visits.
Recipee for *Elinks* and Recoll 1.18 and later:
* Retrieve the
link:https://www.recoll.org/files/elinks_recoll.sh[elinks_recoll.sh] shell
script and make it executable (`chmod a+x elinks_recoll.sh`).
* In the Elinks Keyboard shortcut manager (k)/Main, add a shortcut to pass
the current URL to an external commande, e.g. _Ctrl-P_.
* In the Options manager (o) /Document/Uri Passing, add an action named for
example _ToIndex_
* Modify the ToIndex action to execute `/path/to/the/script/elinks_recoll.sh %c`
* Save, you are done
For Recoll 1.17, the method is analog, but the script is named
link:https://www.recoll.org/files/elinks_recoll.sh[elinks_beagle.sh].

View File

@ -0,0 +1,37 @@
== Faqs and Howtos
=== Indexing
* link:WhyIsMyFileNotIndexed.html[Why is this file not indexed ? Investigating indexing issues]
* link:PreventIndexingDir.html[Preventing the indexing of a directory]
* link:IndexOnAc.html[Starting/stopping the indexer depending on power/battery status]
* link:IndexMozillaCalendari.html[Indexing Mozilla Sunbird / Lightning calendar data]
* link:MultipleIndexes.html[Creating and using multiple indexes]
* link:IndexWebHistory.html[Indexing Web history with the Firefox browser extension]
* link:ElinksWeb.html[Extending the Web queue mechanism to other browsers and general WEB indexing]
* link:IndexMailHeader.html[Indexing arbitrary mail headers]
* link:IndexOutlook.html[Indexing Outlook archives]
* link:HandleCustomField.html[Generating a custom field and using it to sort results]
* link:http://www.recoll.org/recoll_XMP/index.html.html[An example of filter/field customisation, using XMP metadata with PDFs]
* link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members]
=== Searching
* link:GUIKeyboard.html[Recoll GUI keyboard navigation]
* link:HotRecoll.html[On the desktop: using a keyboard shortcut for starting/hiding recoll]
* link:OpenHelperScript.html[Handling issues for starting native apps, esp. email clients - getting Thunderbird to open message files]
* link:QpdfviewHelperScript.html[Another example open helper script - using qpdfview to open pdf and postscript files, with support for page and search options]
* link:UsingOpenWith.html[Using the new Open With menu in recoll 1.20 with a custom
app]
* link:ReplaceCategories.html[Replacing the document category filters]
* link:ResultsThumbnails.html[Result list thumbnails and how to create them]
* link:MuttAndRecoll.html[Interfacing Recoll and Mutt]
* link:QueryFromC.html[Querying from a C program]
=== Administration and miscellaneous
* link:http://www.recoll.org/pages/recoll-webui-install-wsgi.html.html[Installation of the Recoll WebUI with Apache]
* link:FilterRetrofit.wiki.html[//Installing a filter for a new document type//]
* link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens]
* link:SavingConfig.wiki.html[Recoll configuration backup]
* link:XDGBase.wiki.html[Tidying Recoll data storage]
* link:ProblemSolvingData.html[Collecting diagnostic information]
* link:NonAsciiFileNames.html[Unix and non-ascii file names]
* link:FilterArch.html[Recoll filters]

View File

@ -0,0 +1,82 @@
== Recoll input handlers
In the end, Recoll indexes plain UTF-8 text, remembering when it came
from.
But of course, this is not how the source data looks like.
The text content of the original documents is encoded in many fashions
(ie pdf, ms-word, html, etc.), and it can also be stored in quite
involved ways (inside archives, email attachments ...).
For getting to the data and converting it to plain text, Recoll uses a set
of modules which it calls input handlers (or filters), which either operate
on the storage structure (ie: a zip handler), or the storage format (ie a
pdf to text translator), or both. In addition, there is a tentative notion
of a higher level storage backend which we will ignore for now (for
reference there are currently two of those: the file system and the web
history cache).
The basic task of filters is to take a document as input and produce a
series of subdocuments as output. The subdocument's format is defined
either dynamically (as part of the output data), or statically, in the
filter definition.
=== Simple filters
These are executed by a the **mh_exec** recoll module. They are the vast
majority.
These filters are very simple. They are designed to perform a simple task
with minimal interface, they mostly don't know anything about each other,
and they don't know much about their context. This makes writing a filter
quite easy as there is not much to learn about their environment.
Only one output document is produced and the format is fixed.
In practise the filter, which is most generally a shell-script (but could
be any executable program), takes a file name on the command line and
outputs an html or plain text document on standard output, then exits.
For example, the pdf filter takes one pdf file name as input on the command
line and produces one html document on stdout. The fact that the output is
html is statically defined in a configuration file.
For filters which produce plain text, the output character set information
is in general defined in the configuration file. Else it will be obtained
from the locale (hoping that it makes sense).
Filters that output html can produce metadata information in the html
header (ie author etc.). Filters that output plain text can only output
main text data, no metadata fields.
Besides the file name, there is one other piece of input information, which
is in the form of an environment variable, and can be safely ignored:
+RECOLL_FILTER_FORPREVIEW+. This indicates if the filter is being used
for previewing or for indexing data. Some filters will elect to suppress
repetitive parts of the output text when indexing to avoid distorting the
term statistics. For exemple, the man filter suppresses the section
headers (NAME, SYNOPSIS...) when indexing.
=== Multiple input filters
These filters are more complex, but still quite easy to write, especially
if you can use Python, because they can then use a common module which
manages the communication with the indexer.
Newer Recoll versions have converted many previously 'simple' filters to
this kind as part of the port to Windows.
These filters are executed by the *mh_execm* Recoll module.
They are persistent (one instance will persist through a whole indexing
pass), and will index successive multiple input files (the point being to
avoid startup performance penalty), and possibly multiple documents per
input file if this makes sense for their input format (ie: zip archive, chm
help file).
They use a simple communication protocol over a pipe with the main recoll
or recollindex process, with file names and a few other parameters being
sent as input, and decoded data and attributes being sent in return.
The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip'
or 'rclaudio' for reasonably straightforward exemples.

View File

@ -0,0 +1,62 @@
== Installing a filter for a new document type
It will sometimes happen that a newer Recoll release has support for a
document type which would be useful to you, but which your older release
does not support.
It is in general easy to import support from the newer to the older
release: the Recoll input handler interface is very stable, so things should just
work.
Input Handler updates are generally described on the Recoll web site
link:https://www.recoll.org/filters/filters.html[new filters pages]. They
may include notes about which versions need the new input handler, or specifics
about installing it.
An up to date copy of input handlers and configuration files is also kept
link:https://www.recoll.org/filters/[at the same location].
We will take an example to make things more concrete: Tomboy and Gnote
files are directly supported by Recoll 1.19, but not in older Recoll
releases. The *rclxml* handler is needed to process them.
The following procedure will allow you to retrofit support:
- Retrieve the *rclxml* input handler from:
link:https://www.lesbonscomptes.com/recoll/filters/rclxml[]
- Copy it to '/usr/share/recoll/filters' and make it executable:
`chmod +x rclxml`
The input handler needs *xsltproc*, but this is probably already on your
system (else get it with the package manager).
- Edit '~/.recoll/mimemap', add the following line:
`.note = application/x-gnote`
- Edit '~/.recoll/mimeconf', add the following lines:
+
----
[index]
application/x-gnote = exec rclxml
----
- Edit '~/.recoll/mimeview', add the following lines:
+
----
[view]
application/x-gnote = tomboy %f
----
- The easiest way to make sure the files are indexed with the new input
handlers may then be to just run a full indexing pass (`recollindex -z`).
Notes:
- The MIME type which is used is not crucial, you could prefer to use,
e.g., +application/x-tomboy+ instead, it just has to be consistent. To
avoid future trouble, it's better to use the type used by newer Recoll
releases though.
- The 'mimeview' entry is necessary even if you are using the desktop
preferences to open files. The value will not be used, but it has to be
there.

View File

@ -0,0 +1,34 @@
== Filtering out Zip archive members ==
The *rclzip* Zip archive extraction input handler does not use the general
configuration variables which define what file system objects should be
skipped, but it has an equivalent internal function.
The name-skipping code depends on a recent member of the the Recoll Python
package. This will become standard for release 1.20, but for earlier
releases, you need to do two things to use this function:
- Fetch 'python/recoll/recoll/rclconfig.py' and 'filters/rclzip' from the
source repository.
- Copy both to '/usr/share/recoll/filters' and make 'rclzip' executable.
You can then set a variable named +zipSkippedNames+ inside
'recoll.conf'. +zipSkippedNames+ should be a space-separated list of
patterns which will be passed to the Python fnmatch() function. The +/+
characters are not special (matched as any character).
You can't use embedded spaces in patterns (no double-quote quoting for now)
This can be redefined for file system directories using the usual section
indicators (Zip archives in different file-system directories can have
different skip lists).
Example:
----
zipSkippedNames = *.txt
[/path/to/the/dir]
zipSkippedNames = somedir/*/*.html
----

View File

@ -0,0 +1,60 @@
== Recoll GUI keyboard navigation
Using Recoll without the mouse is not completely straightforward, but it is
mostly feasible. Here follows a description of the usable shortcuts.
=== Anywhere
`Ctrl+q` should exit Recoll from anywhere.
=== Main window and result list ===
When Recoll starts up, the focus is in the simple search entry. The main
window tab order is as follows:
* Clear
* Search
* Search type combo
* Search entry (Initial focus)
* Result list (scrolling etc)
* Result list 1st link
* Result list next links...
* Back to Clear
Each result list entry has 3 links: the icon link is not active, but its
value is the URL, so that it can be dragged and dropped to another
application. The 2 other links are _Preview_ and _Open_ and can be
activated by typing _Enter_.
Typing _Ctrl+Shift+s_ anywhere in the main window should return the focus to the search entry. So will _Ctrl+l_ in future versions (for compatibility with WEB browser usage).
For pure keyboard usage, you can improve this by:
- Disabling the icon link: use _Preferences->GUI configuration->Result
List->Edit result paragraph_ and remove the `<a href='%U'>` and `</a>`
around the `<img...>` tag.
- Making the active link more visible by adding the following code to the
result page HTML header insert (same preferences tab). Feel free to
adjust the color :=) :
----
<style type="text/css">
a:focus {background-color: red;}
</style>
----
=== Result table
The same _Ctrl+Shift+s_ will return the focus to the search entry when
working with the result table.
_Ctrl+r_ will move the focus from the entry to the spreadsheet. When in
there the arrow keys will navigate the lines.
When a line is selected:
* _Ctrl+o_ will _Open_ the document.
* _Ctrl+Shift+o_ will _Open_ the document and exit Recoll.
* _Ctrl+d_ (detail) will start a _Preview_
_Esc_ will deselect the current line so that mouse hovering will work again.

View File

@ -0,0 +1,69 @@
== Generating a custom field and using it to sort results
We are going to show how to generate a custom field from a Recoll filter,
and use it for sorting results. The example chosen comes from an actual
user request: sorting results on pdf page counts.
The details here are obsolete, as the +pdf+ input handler is now a quite
different python program, but the general idea is still relevant.
The page count from a pdf file can be displayed by the pdfinfo command
(xpdf or poppler tools).
We first modify a copy of the rclpdf filter
('/usr/[local/]share/recoll/filters/rclpdf'), to compute the pdf page count,
and output the value as an html meta field. This is a not very interesting
bit of shell/awk magic. Another approach would be to just rewrite the
rclpdf filter in your favorite scripting language (ie: perl, python...), as
all it does is execute pdftotext and pdfinfo and output html, nothing
complicated. Here follows the rclpdf modification as a pseudo patch:
----
# compute the page count and format it so that it's alphabetically sortable
+set `pdfinfo "$infile" | egrep ^Pages:`
+pages=`printf "%04d" $2`
[skip...]
# Pass the page count value to awk
-awk 'BEGIN'\
+awk -v Pages="$pages" 'BEGIN'\
[skip...]
# Inside the awk program startup section: compute the "meta" field line
+ pagemeta = "<meta name=\"pdfpages\" content=\"" Pages "\">\n"
[skip...]
# Then print it as part of the header:
+ $0 = part1 charsetmeta pagemeta part2
[skip...]
----
You can execute your own version of rclpdf by modifying '~/.recoll/mimeconf':
----
[index]
application/pdf = exec /path/to/my/own/rclpdf
----
At this point, recollindex would receive and extract a +pdfpages+ field,
but it would not know what to do with it. We are going to tell it to store
the value inside the document data record so that it can be displayed in
the results, and sorted on. For this we modify the '~/.recoll/fields' file:
----
[stored]
pdfpages=
----
That's it ! After reindexing, you can now display +pdfpages+ inside the
result list (add a +%(pdfpages)+ value to the paragraph format), and display
+pdfpages+ inside the result table (right-click the table header), and sort
the results on page count (click the column header).
Note that +pdfpages+ has not been defined as searchable (this would not make
much sense). For this, you'd have to define a prefix and add it to the
[prefixes] fields file section:
----
[prefixes]
pdfpages = XYPDFP
----
Have a look at the comments inside the 'fields' file for more information.

View File

@ -0,0 +1,13 @@
== Welcome to the Recoll Faqs and Recipees
link:FaqsAndHowTos.html[FAQs and Howtos] are stored here, but
the main source for Recoll user documentation is
link:https://www.recoll.org/doc.html[the _Recoll user manual_] on the
link:https://www.recoll.org/[Recoll Web site] where you will also find a
lot of other Recoll information, source code tarballs and contact
information.
If you want to make your problem report as useful as possible, you may want
to take a look at link:ProblemSolvingData.html[this page].
link:WikiIndex.html[Full file index]

View File

@ -0,0 +1,79 @@
== Recoll hotkey: starting / hiding recoll with a keyboard shortcut
Type a key (ie: F12) and have recoll appear or disappear. On the first
occurrence, recoll is started if it's not already running. Further
occurrences toggle recoll between visible and minimized states. Never
thought this would be useful until someone asked for it. Can't do without
it anymore :)
This works well with both Gnome and KDE, but is implemented using a gnome
library (*libwnck*) and its python interface, which you may have to install
on your system if you are a pure KDE user. The library most probably exists
in the package repositories for your distribution, so this should not be
too complicated.
This should also work with other window managers, because it is based on a
standard window manager interface extension (EWMH) that most modern window
managers implement.
=== Installing the script (all desktops):
- You will need the libwnck library and its python interface. These are
usually part of a gnome installation, otherwise check and possibly
install them. For OpenSuse, the library should already be there but you
need to install gnome-python-desktop.
- Download the
link:https://www.recoll.org/files/hotrecoll.py[http://www.recoll.org/files/hotrecoll.py
script]. If you have a recent recoll installation (1.14.3 and
further), it's already in the recoll filters directory
('/usr/[local/]share/recoll/filters')
- Copy the script to some permanent place (ie: '~/bin') and make it
executable (you can leave it in the filters dirs if it's there). In a
shell window: `chmod +x hotrecoll.py`.
- You can check that the script works (or not) by executing it on the
command line. It does not need an argument. Recoll should appear or
disappear every time you execute the script. A few warning messages may
be considered normal. If the script says that it does not find the wnck
library or some other module, you'll have to install them.
=== Installing the keyboard shortcut (Gnome):
- _System->Preferences->Keyboard shortcuts_, or execute
*gnome-keybinding-properties*
- Click add, Name, ie: StartRecoll, Action: /path/to/hotrecoll.py
- This will add the shortcut to the "Custom shortcuts" section. You can
then click in the "Shortcut" column for "StartRecoll", and type any key
combination (ie: push F12) to assign a key shortcut.
=== Installing the keyboard shortcut (KDE):
Under KDE installing a global custom keyboard shortcut like we need is most
helpfully not under "Keyboard Shortcuts" but under "Input Actions".
- _Kmenu -> Configure Desktop -> Input Actions -> Edit -> New -> Global
Shortcut -> Command/Url_
- A new Action appears, named _New Action_. You can rename it something
like +hotrecoll+ for clarity.
- Click the _Trigger_ tab, click the input area and press your preferred
key combination (ie: F12)
- Click the _Action_ tab, and enter +hotrecoll.py+ (if it's in your PATH),
or else the full path to the command (e.g.:
'/usr/share/recoll/filters/hotrecoll.py').
- Click _Apply_.
=== Installing the keyboard shortcut (XFCE):
Open the settings manager, and add the shortcut in the
_Application Shortcuts_ panel inside the _Keyboard_ tool.
=== Other environments
Many window managers have a way to set up a keyboard shortcut for running
an arbitrary command. You'll need to look at the documentation for yours,
or search the web for a solution.
An alternative independant of the environment would be to use the XBindKeys
utility. See this link:http://www.linux.com/archive/feed/59494[linux.com
article] for helpful instructions.

View File

@ -0,0 +1,33 @@
== Indexing arbitrary mail headers
By default the Recoll mail handler only processes a subset of email headers
(+From+, +To+, +Cc+, +Date+, +Subject+). It is possible to index additional
headers by specifying them inside the 'fields' configuration file, inside
the configuration directory (typically '~/.recoll/').
Lengthy explanations are not really needed here, and I'll just show an
example (duplicated from the configuration section of the manual):
----
[prefixes]
# Index mailmytag contents (with the given prefix)
mailmytag = XMTAG
[stored]
# Store mailmytag inside the document data record (so that it can be
# displayed - as %(mailmytag) - in result lists).
mailmytag =
[mail]
# Extract the X-My-Tag mail header, and use it internally with the
# mailmytag field name
x-my-tag = mailmytag
----
Limitations:
- The mail filter will only process the first instance for a header
occurring several times.
- No decoding will take place (ie for non-ascii headers which would have
some kind of encoding).

View File

@ -0,0 +1,32 @@
== Indexing Mozilla calendar data
Mozilla calendar programs (*Sunbird*, *Lightning*) do not store their
data in +ics+ files natively. They use an *SQLite* database (the
'storage.sdb' file inside the profile). This means that calendar data
cannot be indexed directly.
To get Recoll to index calendar data, you need to export it to an +ics+
file. This can be done manually, from the application menus, or, by
installing the
link:https://addons.mozilla.org/en-US/sunbird/addon/3740[Automatic Export
extension].
The extension can be configured to export the data when exiting the
program, or at regular time intervals. You can even set up a command to be
executed after the export. If you are not using real time indexing, this
can usefully be *recollindex*.
In _Tools->Add Ons->Automatic Export preferences_, in the _Start an
application after export_ subpanel, set _Path of application_ to
'/usr/[local/]bin/recollindex' and _Parameters of application_ to
something like _-i;/home/me/path/to/nameofexportedcal.ics_
This will ensure that the calendar is indexed every time it is exported
(this is not necessary though, you can let the next batch indexing pass
take care of it).
It may happen that the exported data has some syntax errors which will
prevent indexing with the *rclics* filter which was distributed up to
Recoll 1.13.04 (included). You may get an updated filter from the
link:https://www.recoll.org/download.html[Recoll download page].

View File

@ -0,0 +1,24 @@
== Laptops: starting or stopping indexing according to AC power status
For people using real time indexing on a laptop, kind user "The Doctor"
contributed a script to automatically start and stop indexing according to
power status. The script can be found here:
link:https://bitbucket.org/medoc/recoll/src/tip/src/desktop/recoll_index_on_ac.sh[recoll_index_on_ac.sh]
To use it, you need to copy it somewhere (e.g.: '/usr/bin', but any place
will do), make it executable (`chmod a+x recoll_index_on_ac.sh`), and edit
'~/.config/autostart/recollindex.desktop'
Change the following line:
Exec=recollindex -w 60 -m
to something like the following (depending where you copied the script):
Exec=/usr/bin/recoll_index_on_ac.sh
You may also want to change
'/usr/share/recoll/examples/recollindex.desktop', otherwise your change
will be reverted the next time you toggle real time indexing through the
GUI. And, yes, sorry about it, _this_ change will be lost on the next
Recoll update, so save a copy.

View File

@ -0,0 +1,11 @@
== Indexing Outlook archives ==
Recoll has no direct support for indexing Microsoft Outlook data, because,
if you are a Windows user, you probably are not a good customer for Linux
desktop indexing...
However, if you have a need to index Outlook data at some point, I can
recommend the excellent link:http://www.five-ten-sg.com/libpst/[libpst]
library and its link:http://www.five-ten-sg.com/libpst/rn01re01.html[readpst]
utility. Using this you can very easily convert the Outlook data into MH or
mbox format, and then index the result with Recoll.

View File

@ -0,0 +1,29 @@
== Indexing Web history with the Firefox extension ==
Note: this document is valid for Recoll versions from 1.18.
The link:http://sourceforge.net/projects/recollfirefox/[Recoll Firefox
extension]
works together with Recoll to index the Web pages that you visit. The
extension is based on an older one which was initially written for the
Beagle indexer.
The extension works by copying the data for the visited pages to a queue
directory ('~/.recollweb/ToIndex' by default), from which they are
indexed and removed by Recoll, and then stored in a local cache.
The extension is now hosted on the Mozilla add-ons site, so you can install
it very simply in Firefox: link:https://addons.mozilla.org/fr/firefox/addon/recoll-indexer-1/[Recoll Firefox add-on page].
This feature can be enabled in the Recoll GUI index configuration panel
(Web history section), or by editing the configuration file (set
+processwebqueue+ to 1).
Please remember that Recoll only stores a limited amount of cached web data
(adjustable from the GUI Index Configuration section), and that old pages
will be purged from the index. Pages that you want to archive permanently
need to be saved elsewhere, as they will otherwise eventually disappear
from the Recoll results.
Recoll will index +.maff+ files, which may be a better choice for archival
usage.

View File

@ -0,0 +1,9 @@
.SUFFIXES: .txt .html
.txt.html:
asciidoc $<
all: $(addsuffix .html,$(basename $(wildcard *.txt)))
clean:
rm *.html

View File

@ -0,0 +1,96 @@
== Creating and using multiple indexes
=== Why would you want to do this ?
- Easy adjustment of search areas: you can filter results by using the
directory filter in the advanced search panel, but, if you have
separate well defined places where you store different kind of data,
it is easier to maintain separate index and use the External indexes
dialog to switch them on or off, and it will also yield much better
search performance.
- Shared indexes: it may be useful to maintain one or several indexes
for shared data, and separate personal indexes for each user. Indexes
can be shared over the network.
- Creating separate indexes for removable volumes.
=== How to do it
As an example we'll suppose that you have Recoll installed and indexing
your home directory, and that you would like to have a separate index for
/usr/shared/doc.
You need to create a separate configuration for the new index, then add it
to the external indexes list in the user interface, and activate it as
needed.
. Create a directory for the new index, and create an empty configuration
file
+
----
cd
mkdir .recoll-sharedoc
touch .recoll-sharedoc/recoll.conf
----
. Either edit the new configuration by hand or start recoll to use the GUI
configuration editor.
+
----
cd .recoll-sharedoc
echo "topdirs = /usr/share/doc" > recoll.conf
# OR
recoll -c ~/.recoll-sharedoc
----
+
If using the GUI, click _Cancel_ when asked, to start the configuration
editor.
. Perform initial indexing. If you chose the GUI route, indexing will
start as soon as you leave the configuration editor. Else, on the
command line:
+
----
recollindex -c ~/.recoll-sharedoc
----
. Optionally set up *cron* to perform nightly indexing, use +crontab -e+
and insert a line like the following:
+
----
45 20 * * * recollindex -c ~/.recoll-sharedoc
----
+
This would start the indexing at 20:45. `crontab -e` will use the *vi*
editor by default, you can change this by using the EDITOR
environment variable. Exemple: `EDITOR=kate crontab -e`
Your favorite desktop may also have a dedicated tool to add crontab entries.
. Start recoll and choose the _Preferences->External_ index dialog menu
entry, then click the Browse button (near the bottom), and select the
new index Xapian database directory '~/.recoll-sharedoc/xapiandb'
Then click _Add index_.
. You can then activate or deactivate the new index by clicking the box
in front of the directory name in the list.
When adding an index shared by multiple users, it may be helpful to use the
RECOLL_EXTRA_DBS environment variable instead of editing individual
configurations, see the manual for more details.
=== Paths adjustments
When sharing indexes over a network, in most cases, the indexed data will
be accessible through different paths on the different hosts. This will
prevent the Preview and Open functions to work because the paths they get
from the index do not match the ones which are usable from the local
host.
For example my home directory is accessed as '/home/me' on my home
machine, and as '/net/myhost/home/me' on other hosts. By default, trying
to access a result from a remote host would use the first path, when the
second is the one that would work.
As of release 1.19 **Recoll** has a facility to perform index-dependant
path translations. This facility is accessible from the _external index
dialog_ in the GUI preferences. Paths translations can be set for the main
index if no index is selected (rarely useful), or for the selected
additional index.

View File

@ -0,0 +1,77 @@
== Interfacing Recoll and Mutt
It is possible to either use Mutt as a Recoll search result viewer, or
start Recoll from the Mutt search.
=== Starting Mutt to view Recoll search results
This method and the associated
link:http://www.recoll.org/files/recoll2mutt[recoll2mutt script] were kindly
contributed by Morten Langlo.
This allows finding mail messages in recoll and then calling *mutt*
or *mutt-kz* to read or process the mail.
Installation:
- Copy the [[http://www.recoll.org/files/recoll2mutt|recoll2mutt script]]
somewhere in your PATH, and make it executable.
- In the **recoll** GUI menus:
_Preferences->GUI configuration->User interface->Choose editor applications_
change the entry for "message/rfc822" to: +recoll2mutt %f+
The script has options for setting a number of parameters, you may not need
to set any of them, the defaults are:
- -c mutt
- -F .muttrc
- -m Mail
- -x "-fn 10*20 -geometry 115x40"
Example:
----
recoll2mutt -c mutt-kz -F .mutt_kzrc -m Mail -x "-fn 10*20 -geometry 115x40" %f
----
The option +-x+ is passed to *xterm*, which is used to call *mutt* or
*mutt-kz*.
The script works for both _mbox_ and _maildir_ mail boxes, and it
expects the configuration file for mutt and the mail directory to reside in
your $HOME and the spool file to be '/var/spool/mail/$USER' if it is
not in your mail directory. But it is easy to change the values in the
script if you need to.
*mutt* is opened with the right mailbox and limit set to _Date_ and
_Sender_. In theory you could set limit to _Message-Id_, but very often
*mutt* reports, that there are invalid patterns in _Message-Id_, so do it
safe, even though all emails in the opened mail box with the same date from
the sender are shown.
=== Starting Recoll from the Mutt search
This will work only when using maildir storage (messages in individual
files). It will not work with mailbox files. The latter would probably be
possible by extracting the individual result messages using the Python
interface, but I did not try.
The classic way to interface Mutt and a search application is to create a
shortcut to an external command which creates a temporary Maildir
containing the search results.
There is such a script for Recoll, you will find it link:https://bitbucket.org/medoc/recoll/raw/41d41799dbac4c69a34db985b3ab9f1597c9c742/src/python/samples/mutt-recoll.py[here].
Copy the script somewhere in your PATH, and make it executable, then add
the following line to your '.muttrc':
----
macro index S "<enter-command>unset wait_key<enter><shell-escape>mutt-recoll.py -G<enter><change-folder-readonly>~/.cache/mutt_results<enter>" \
"search mail (using recoll)"
----
Obviously, you can replace the 'S' letter with whatever will suit you (e.g:/)

View File

@ -0,0 +1,85 @@
== Unix and non-ASCII file names, a summary of issues
Unix/Linux file and directory names are binary byte C strings. Only the
null byte and the slash character (/) are forbidden inside a name,
nowhere does the kernel interpret the strings as meaningful or
printable.
In the old times, all utilities that would display to the user were
ASCII-based, and people would use pure printable ASCII file names (even
using space characters inside names was a cause for trouble). Non
alphanumeric characters were exclusively used for playing tricks on
colleagues. And all was well.
Then the devil came under the guise of accented 8 bit characters. The
system has no problem with them, file names are still binary C strings, but
the utilities have to display them or take them as input, and, because
there is no encoding specification stored with the file names, they can
only do this according to the character encoding taken from the user's
current locale.
For example fr_FR.UTF-8, and fr_FR.ISO8859-1 could be used simultaneously
on the same system (by different users), but they are completely
uncompatible: ISO-8859-1 strings are illegal when viewed in an UTF-8 locale
(will display as interrogation points or some other conventional error
marker). UTF-8 strings will display as gibberish in an ISO-8859-1 locale.
This means that the file names created by an UTF-8 user are displayed as
garbage to the ISO-8859 one...
If you ever change your locale, your old files are still there and named
the same (in the binary sense), but the names display badly and you have
great trouble inputing them. If you add distributed (NFS) file system
issues, things become totally unmanageable. Also think about archives sent
from another system with a different encoding.
For what concerns Recoll:
- The file names inside recoll.conf are not transcoded, they are taken as
binary strings (mostly, only +\n+ and +space+ are a bit special), and
passed as is to the system. So if you edit 'recoll.conf' with a text
editor, inside the same locale that is or has been used for file names,
you'll be fine.
- There was a bug in the GUI configuration tool, up to 1.12, it should
transcode between the internal Qt format and locale-dependant strings,
but it doesn't or does it badly.
- There is also an exception for the +unac_except_trans+ variable, this
*has* to be UTF-8, so if the rest of the file uses another encoding,
you'll need to edit two separate files and concatenate them.
As of version 1.13, Recoll uses local8Bit()/fromLocal8Bit() to convert
recoll.conf file names from/to QStrings (it uses UTF-8 for all string
values which are not file names).
The Qt file dialog is broken (at least was, I have not checked this on
recent versions). It should consider file paths as almost-binary data, not
QStrings, but doesn't. In consequence, things are even more broken than
necessary as seen from there:
With LANG="C", no non-ASCII paths can't be used at all:
- Strings read from recoll.conf are stripped of 8bit characters before display.
- Directory entries with 8bit characters are not displayed at all in the
selection dialog.
With LANG="fr_FR.UTF-8", only UTF-8 paths can be used:
- Strings read from recoll.conf are damaged when converted to QString
(except those that were actually UTF-8)
- Only the UTF-8 directory entries are displayed in the selection dialog.
With LANG="fr_FR.iso8859-1", everything works ok.
- Strings read from recoll.conf are displayed with weird characters if
they use another encoding such as UTF-8, but are correctly maintained
and can be read back from the dialogs and rewritten without damage.
- Directory entries with 8 bit characters are displayed weirdly (normal),
but can be manipulated without trouble (this includes utf-8 names of
course).
In conclusion, only the iso-8859 locales can be used for handling mixed
encoding situations. This is a possible workaround for people who need it.
More data about path encoding issues:
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

View File

@ -0,0 +1,71 @@
== Starting native applications
It is sometimes difficult to start a native application on a result
document, especially when the result comes from a container file (ie: email
folder file, chm file).
The problem is that native applications usually expect at most a file name
on the command line, and sometimes not even that (emailers).
The _Open parent documents_ link in the result list right click menu is
sometimes useful in this situation (e.g.: +chm+ files).
In some other cases it may help that Recoll does make a lot of data
available to the application. This data may have to be pre-processed in a
script before calling the actual application.
Details about configuring how the native application or script are called
are given with the
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description of the mimeview configuration file]
Information about
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.FIELDS[configuring
customised fields] may also be useful in combination.
=== Example
This is a simple example, because it does not need to use special
fields. It just shows how to solve a simple issue by using an intermediary
script. The problem is due to the fact that thunderbird's +-file+ option
won't open a file if the extension is not '.eml'. Jorge, the kind Recoll
user who supplied the example stores his email in Maildir++ format, the
file names have no extension, so an intermediary script is necessary to get
thunderbird to open them:
Note that this only works with messages stored in Maildir or MH format (one
message per file). As far as I know, there is no way to get Thunderbird to
open an arbitrary mbox file.
The 'recoll-thunderbird-open-file' script:
----
#!/bin/sh
cp $1 /tmp/$$.eml
thunderbird -file /tmp/$$.eml
----
Create the file in an editor, save it somewhere, and make it executable
(`chmod +x recoll-thunderbird-open-file`).
The mail line in the '~/.recoll/mimeview' file:
----
[view]
message/rfc822 = recoll-thunderbird-open-file %f
----
If the place where you saved the script is not in your PATH, you will need
to use the full path instead of just the script name, as in
----
[view]
message/rfc822 = /home/me/somewhere/recoll-thunderbird-open-file %f
----
You should then be able to open the messages in Thunderbird, which is
useful, for example, to handle the attachments.
With recent Recoll versions, if using the normal option of letting the
Desktop chose the _Open_ application to use (_Use Desktop default_),
you should also add +message/rfc822+ to the exceptions, and the whole
thing is probably more easily done from the Recoll GUI.

View File

@ -0,0 +1,27 @@
== Preventing indexing in a directory
=== Why would you want to do this ?
By default, recollindex (or the indexing thread inside the recoll QT user
interface) will process your home directories and most its subdirectories,
at the exception of some well known places (thumbnails, beagle and web
browser caches, etc.)
You may want to prevent indexing in some directories where you don't expect
interesting search results. This will avoid polluting the search result
lists, speed up indexing times and make the index smaller.
=== How to do it
There are two ways to block indexing at certain points: either by listing
specific paths, or by directory name pattern matches.
- Blocking specific paths: this is controlled by the skippedPaths variable
in the main configuration file. You can adjust the value either by
editing the file or by using the indexing configuration dialog:
_Preferences->Indexing configuration->Global parameters->Skipped paths_
- Using pattern matches: these are listed in the skippedNames variable in
the main configuration file. You can adjust the value either by editing
the file or by using the GUI: _Preferences->Indexing configuration->Local
parameters->Skipped names_

View File

@ -0,0 +1,157 @@
== Gathering useful data for asking help about or reporting a Recoll issue
Once in a while it will happen that a Recoll program will either signal an
error, or even crash (either the *recoll* graphical interface or the
*recollindex* command line indexing command).
Reporting errors and crashes is very useful. It can help others, and it can
get your own problem solved.
Any problem report should include the exact Recoll and system versions.
If at all possible, reading the following and performing part of the
suggested steps will be useful. This is not a condition for obtaining help
though ! If you have any problem and have a difficulty with the following,
just contact the mailing list or the developers (see contacts on
link:https://www.recoll.org/support.html[the Recoll site support page]).
If the problem concerns indexing, and was initially found using the
*recoll* GUI, you should try to reproduce it using the
*recollindex* command-line indexer, which is much simpler and easier to
debug.
There are then two sources of useful information to diagnose the issue: the
debug log file and, possibly, in case of a crash, a stack trace.
Crash and other problem reports are of very high value to me, and I am
willing to help you with any of the steps described below if it is not
familiar to you. I do realize that not everybody is a programmer or a
system administrator.
=== Obtaining information from the log file
All Recoll commands write a varying amount of information to a common log file.
_All commands use the same log, and the file is reset every time a command
is started: so it is important to make a copy right after the problem
occurs (for example, do not start *recoll* after a *recollindex*
crash, this would reset the log). A workaround for this issue is to let the
messages go to the default +stderr+, and redirect this._
By default, the messages are output to +stderr+, and you probably don't even
see them if Recoll is started from the desktop. In this case, you need to
set the parameters so that output goes to a file, and the appropriate
verbosity level is set. When using the command-line, you may actually
prefer to redirect stderr to avoid the log-truncating issue described
above.
You can set the log parameters from the GUI _Indexing parameters_
section or by editing the '~/.recoll/recoll.conf' file: set the
+loglevel+ and +logfilename+ parameters. E.g.:
----
loglevel = 6
logfilename = /tmp/recolltrace
----
The log file can become very big if you need a big indexing run to
reproduce the problem. Choose a file system with enough space available
(possibly a few gigabytes).
Then run the sequence that leads to the problem, and make a copy of the log
file just after. If the log is too big, it will usually be sufficient to
use the last 500 lines or so (tail -500).
==== Single file indexing issues
When the problem concerns, or can be reproduced with, a single file it is
very cumbersome to have to run a full indexing pass to reproduce it. There
are two ways around this:
- Set up an ad hoc configuration with only the file of interest, or its
parent directory:
----
cd
mkdir recoll-test
cd recoll-test
echo /path/to/my/file/or/its/parent/dir > recoll.conf
echo 'loglevel = 6' >> recoll.conf
echo 'logfilename = /tmp/recolltrace' >> recoll.conf
recollindex -z -c .
----
- Use the -e and -i options to recollindex to erase/reindex a single
file. Set up the log, then:
----
recollindex -e /path/to/my/file
recollindex -i /path/to/my/file
----
When using the second approach, you must take care that the path used is
consistent with the paths listed/used in the configuration (ie: if '/home' is
a link to '/usr/home', and '/usr/home/me' is used in the configuration
+topdirs+, `recollindex -i /home/me/myfile` will not work, you need
to use `recollindex -i /usr/home/me/myfile`.
=== Obtaining a stack trace
If the program actually crashes, and in order to maximize usefulness, a
crash report should also include a so-called stack trace, something that
indicates what the program was doing when it crashed. Getting a useful
stack trace is not very difficult, but it may need a little work on your
part (which will then enable me do my part of the work).
If your distribution includes a separate package for Recoll debugging
symbols, it probably also has a page on its web site explaining how to use
them to get a stack trace. You should follow these instructions. If there
is no debugging package, you should follow the instructions below. A little
familiarity with the command line will be necessary.
==== Compiling and installing a debugging version
- Obtain the recoll source for the version you are using (www.recoll.org),
and extract the source tree.
- Follow the
link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.install.building.html[instructions
for building Recoll from source] with the following modifications:
- Before running configure, edit the mk/localdefs.in file and remove the
-O2 option(s).
- When running configure, specify the standard installation location for
your system as a prefix (to avoid ending up with two installed versions,
which would almost certainly end in confusion). On Linux this would
typically be: `configure --prefix=/usr`
- When installing, arrange for the installed executables not to be stripped
of debugging symbols by specifying a value for the STRIP environment
variable (ie: *echo* or *ls*): `sudo make install STRIP=ls`
==== Getting a core dump
You will need to run the operation that caused the crash inside a writable
directory, and tell the system that you accept core dumps. The commands
need to be run in a shell inside a terminal window. E.g.:
----
cd
ulimit -c unlimited
recoll #(or recollindex or whatever you want to run).
----
Hopefuly, you will succeed in getting the command to crash, and you will
get a core file. A possible approach then would be to make both the
executable and the core files available to me by uploading it to a file
sharing site (the core file may be quite big). You should be aware though
that the core file may contain some of the data that was being indexed,
which may be a privacy issue. Another approach is to generate the stack
trace yourself.
=== Using gdb to get a stack trace
- Install gdb if it is not already on the system.
- Run gdb on the command that crashed and the core file (depending on the
system, the core file may be named "core" or something else, like
recollindex.core, or core.pid), ie: {{{gdb /usr/bin/recollindex core}}}
- Inside gdb, you need to use different commands to get a stack trace for
recoll and recollindex. For recollindex you can use the bt command. For
recoll use `thread apply all bt full`
- Copy/paste the output to your report email :), and quit gdb ("q").

View File

@ -0,0 +1,61 @@
== Starting native applications ==
Another example of using an intermediary script for an application with a
command line syntax which can't be directly defined in mimeview.
We use a script to preprocess and adapt the options before calling the
actual command.
Details about configuring how the native application or script are called
are given with the
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description
of the mimeview configuration file].
*qpdfview* (link:http://launchpad.net/qpdfview[web site]) is a very
lightweight tabbed PDF viewer with great search performance and result
highlighting.
It does support parsing the search term and page number from the command
line with the following syntax:
----
qpdfview --unique "%f"#%p --search "%s"
----
However, qpdfview will not launch if either %p or %s are empty in the
command above. To accommodate for that, Recoll user Florian has written a
small wrapper shell script:
----
#!/bin/bash
qpdfviewpath=qpdfview
if [ -z $2 ]
then
page=""
else
page="#"$2""
fi
if [ -z $3 ]
then
search=""
else
search="--search "$3""
fi
$qpdfviewpath --unique "$1"$page $search >&0 2>&0 &
----
The corresponding handler line for Recoll would be (depending on how you
name the script and where you store it):
----
qpdfviewwrapper %f %p %s
----

View File

@ -0,0 +1,18 @@
== Querying Recoll from a C program
The easiest way to query Recoll from a C or C++ program is to execute an
external search command (`recollq` or `recoll -t`).
I have written a simple C module which deals with the related housekeeping
and presents an easy to use API to the rest of the code. You will find it
here:
https://bitbucket.org/medoc/recoll-capi
It is a bit experimental and will only work with recoll 1.20 for now
(because it uses a new option for recollq). However it would be trivial to
modify for working with 1.19, get in touch with me if you need this.
The other approach is to link with the Recoll library. This has no official
API, but in practise, the internal one is fairly stable, and if you want to
choose this approach, you should start from the code in recollq.cpp

View File

@ -0,0 +1,58 @@
== Replacing the Category filter controls
The document category filter controls normally appear at the top of the
*recoll* GUI, either as checkboxes just above the result list, or as a
dropbox in the tool area.
By default, they are labeled _Media_, _Message_, _Spreadsheet_, _Text_,
etc. and each map to a document category.
The mapping used to be fixed. You could change the number and composition
of categories by redefining them inside the {{{mimeconf}}} configuration
file (you still can), but the filters always used document categories.
Categories can also be selected from the query language by using an
+rclcat:+ selector. E.g.: _rclcat:message_.
As of Recoll release 1.17, the filters are not hard-wired any more. They
map to query language fragments. This means that you can freely redefine
what they do.
The associations are configured inside the 'mimeconf' file, in the
+[guifilters]+ section. Most GUI parameters are stored in the *Qt*
configuration file, so this is not entirely consistent, and you will have
to bear with my lazyness here.
A simple exemple will hopefuly make things clearer. If you add the
following to your '~/.recoll/mimeconf' file:
----
[guifilters]
Big Books = dir:"~/My Books" size>10K
My Docs = dir:"~/My Documents"
Small Books = dir:"~/My Books" size<10K
System Docs = dir:/usr/share/doc
----
You will have four filter checkboxes, labelled _Big Books_, _My Docs_, etc.
The text after the equal sign must be a valid query language fragment, and
will be translated to a *Recoll* query and combined with the rest of the
query with an AND conjunction.
Any name text before a colon character will be erased in the display, but
used for sorting. You can use this to display the checkboxes in any order
you like. For exemple, the following would do exactly the same as above,
but ordering the checkboxes in the reverse order.
----
[guifilters]
d:Big Books = dir:"~/My Books" size>10K
c:My Docs = dir:"~/My Documents"
b:Small Books = dir:"~/My Books" size<10K
a:System Docs = dir:/usr/share/doc
----

View File

@ -0,0 +1,23 @@
== Result list thumbnails and how to create them
Recoll will display thumbnails for the results if the images exist in the
standard location ('$HOME/.thumbnails' or '$HOME/.cache/thumbnails' depending
on the xdg version).
But it will not create thumbnails, mainly because it is very hard to do
portably.
Thumbnails are most commonly created when you visit a directory with your
file manager, but visiting the whole file tree just to create thumbnails is
a bit fastidious.
One simple trick to create thumbnails from the recoll GUI is to visit the
parent directory for a result by using the _Open parent document/folder_
entry in the right-click menu.
You can also find tools for the systematic creation of thumbnails for a
directory tree. Three such tools are discussed on this
link:http://askubuntu.com/questions/199110/how-can-i-instruct-nautilus-to-pre-generate-pdf-thumbnails[askubuntu.com discussion]
Also please note that no thumbnails can currently be generated or displayed
for embedded documents (attachments, archive members, etc.).

View File

@ -0,0 +1,61 @@
== User configuration backup
=== Why you would want to do this
If you are going to reinstall your system, and have some custom
configuration, you may save some time by making a backup of your
configuration and restoring it on the new system, rather than going through
the menus to recreate it.
=== How to do it
==== Index/search configuration
The main recoll configuration data is normally kept inside '~/.recoll' or
whatever *$RECOLL_CONFDIR* is set to.
This directory contains both configuration files and generated index
data.In a standard configuration, the following files and directories
contain generated data:
- 'xapiandb' contains the Xapian index, which normally consumes most of the
total space.
- 'aspdict.en.rws' contains the aspell dictionary used for spelling
corrections.
- 'mboxcache' contains cached offset data for email messages inside mbox
folders.
- 'webcache' contains saved web pages. This is more than a cache as
destroying it will purge the corresponding data during the next
indexing.
The other files are either very small or contain configuration data.
If you want to only save configuration, using minimum space, you can
destroy the above files and directories (with the possible exception of
'webcache'). Then taking a copy of the '.recoll' directory and adding the
GUI configuration data described in the next will get you a full
configuration data backup.
==== GUI configuration
The parameters set from the _Query configuration_ Qt menus are stored in
Qt standard places:
- '~/.qt/recollrc' for Qt 3.x
- '~/.config/Recoll.org/recoll.conf' for Qt 4 and later
==== Other data
If you wish to save index data in addition to the customisation files,
which only makes sense if the document access paths do not change after
reinstallation, you can just take a backup of the full '.recoll'
directory, taking care that the storage locations for some data elements
can be changed (not be inside '.recoll'):
- The index data is normally kept inside '~/.recoll/xapiandb', but the
location of this directory can be modified by the +dbdir+
configuration parameter if it is set (check 'recoll.conf').
- If you use the Firefox Recoll plugin, the WEB history cache is normally
kept inside '~/.recoll/webcache', but the location can be modified by
the +webcachedir+ configuration parameter.

View File

@ -0,0 +1,109 @@
== Building and Installing the Ubuntu Unity Recoll Lens
Important preliminary notes:
- This only makes sense for Ubuntu versions using the Unity environment:
Natty (11.04), Oneiric (11.10), Precise (12.04), and later.
- _Remember that you still need to use the recoll GUI (or the recollindex
//command) to get the indexing going !_
- The Lens is artificially limited to showing at most 20 results. Use the
recoll GUI for more complete capabilities (or edit rclsearch.py, change
the "if actual_results >= 20:" line).
=== The Lens with Recoll 1.17 and later
If you are willing to install or upgrade to Recoll version 1.17, all
necessary packages are on the Recoll PPA, you just need to add the
repository to your system sources and add or upgrade the packages: *_/This
is the recommended approach!_*
----
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
sudo apt-get update
sudo apt-get install recoll-lens recoll
----
This document may still be useful if you want to modify the lens source
code.
=== The Lens with older Recoll versions
If, for some reason, you wish to test the Lens with an older Recoll
version, read the following.
Please not that such an installation is somewhat crippled: you will not be
able to display results for embedded documents (emails inside an mbox,
attachments etc.). This requires a recoll command line option which is only
available in 1.17
The Lens is based on the Recoll Python module which is not built by default
for versions prior to 1.17, so so you will first need to pull the Recoll
source code (for you version), then untar and proceed with the
configure/build instructions below.
The following uses --prefix=/usr. I have no real reason to believe
that this would not work with /usr/local (lenses are also searched there by
default). If you confirm that things work with another prefix, please drop
me a line.
When doing this over a previous Recoll compilation, run a "make clean" to
get rid of the non-PIC objects.
Note that the following instructions change nothing to your existing Recoll
installation, they only install the Python module and the Unity Lens,
recoll, recollindex etc. are unaffected.
'/TOP/OF/RECOLL/SRC' designates the top of the recoll source tree.
=== Configure and build the recoll library and python module, install the module
The following needs the development packages for Xapian, Python and zlib.
----
cd /TOP/OF/RECOLL/SRC
# May fail if no previous build was performed
make clean
# the gui/x11 disabling is just here to avoid having to install the
# development libraries for Qt.
configure --prefix=/usr --enable-pic --without-x --disable-qtgui
make
cd python/recoll
python setup.py build
sudo python setup.py install
----
=== Build and install the Unity Lens
----
cd /TOP/OF/RECOLL/SRC
cd desktop/unity-lens-recoll
configure --prefix=/usr --sysconfdir=/etc
sudo make install
----
Voilà, it should work...
Try to start the Dash, you should see the Recoll checkerboard (or
whatever...) in the Lens list.
The Recoll Lens expects a Recoll query language string, so you can use
field searches, directory, size, and date filtering (see the
link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.lang.html[Recoll
manual] for a description of the query language).
If you want to disable the Lens, I think that you just have to delete
'/usr/share/unity/lenses/recoll'
Other installed files:
----
/usr/libexec/unity-recoll-daemon
/usr/share/dbus-1/services/unity-lens-recoll.service
/usr/share/doc/unity-lens-recoll
/usr/share/unity-lens-recoll
----

View File

@ -0,0 +1,68 @@
== Using the _Open With_ context menu in recoll 1.20 and newer
Recoll versions and newer have an _Open With_ entry in the result list
context menu (the thing which pops up on a right click).
This allows choosing the application used to edit the document, instead of
using the default one.
The list of applications is built from the desktop files found inside
'/usr/share/applications'. For each application on the system, these
files lists the mime types that the application can process.
If the application which you would want listed does not appear, the most
probable cause is that it has no desktop file, which could happen due to a
number of reasons.
This can be fixed very easily: just add a +.desktop+ file to
'/usr/share/applications', starting from an existing one as a template.
As an example, based on an original idea from Recoll user +florianbw+,
the following describes setting up a script for editing a PDF document
title found in the recoll result list.
The script uses the *zenity* shell script dialog box tool to let you
enter the new title, and then executes *exiftool* to actually change
the document.
----
#!/bin/sh
PDF=$1
TITLE=`exiftool -Title -s3 "$PDF"`
RES=`zenity --entry \
--title="Change PDF Title" \
--text="Enter the Title:" \
--entry-text "$TITLE"`
if [ "$RES" != "" ]; then
echo -n "Changing title to $RES ... " && \
exiftool -Title="$RES" "$PDF" && \
recollindex -i "$PDF" && echo "Done!"
else
echo "No title entered"
fi
----
Name it, for example, 'pdf-edit-title.sh', and make it executable
(`chmod a+x pdf-edit-title.sh`).
Then create a file named 'pdf-edit-title.desktop' inside
'/usr/share/applications'. The file name does not need to be the same as the
script's, this is just to make things clearer:
----
[Desktop Entry]
Name=PDF Title Editor
Comment=Small script based on exiftool used to edit a pdf document title
Exec=/home/dockes/bin/pdf-edit-title.sh %F
Type=Application
MimeType=application/pdf;
----
You're done ! Restart Recoll, perform a search and right-click on a PDF
result: you should see an entry named _PDF Title Editor_ in the _Open
With_ list. Click on it, and you will be able to edit the title.

View File

@ -0,0 +1,99 @@
== Using the log file to investigate indexing issues
All *Recoll* processes print trace messages. By default these go to the
standard error output, and you may not ever see them (in the case, for
example, of the *recoll* GUI started from the desktop interface).
There are a number of potential issues with indexing that may need
investigation, such as:
- A file can't be found by searching even if it appears that it should have
be indexed (this could happen because the file is not selected at all or
because a filter program crashes).
- The indexing process gets stuck and never finishes.
- The indexing process ends up with an error.
- The indexing process seems to be using too much system capacity.
The right way to approach these problems is to use the *recollindex*
command line tool (instead of the *recoll* GUI), and to set up the
trace log to provide information about what indexing is actually doing.
Trace log parameters can be set either from the GUI _Preferences->Indexing
Configuration->Global Parameters_ panel, or by editing the configuration
file '~/.recoll/recoll.conf'. You should set the following parameters:
----
loglevel = 6
logfilename = stderr
thrQSizes = -1 -1 -1
----
We use _stderr_ instead of an actual file in order to capture direct filter
messages (such as a *python* stack trace) along with normal
*recollindex* messages.
The last line sets recollindex for single-threaded operation, which will
make the log much more readable.
You should then check that no *recoll* or *recollindex* process is
currently running, and kill any you find.
Then, if this is an issue about an identified file, try indexing it only:
----
recollindex -i myunfindablefile.xxx > /tmp/myindexlog 2>&1
----
If this is a general issue with indexing (process not finishing properly),
just start it:
----
recollindex > /tmp/myindexlog 2>&1
----
Usually, having a look at the trace will allow to see what is wrong (e.g.:
a configuration issue or missing filter), and solve the problem.
In case of indexer misbehaviour (e.g. using too much memory, you should run
_tail -f_ on the log to see what is going on.
If this is not enough, please
link:http://bitbucket.org/medoc/recoll/issues/new[open a tracker issue] and
attach or link to the log data, or just email me (jfd at recoll.org).
*recollindex* and *recollindex -i* usually have the same criteria to
include a file or not (but see the _Path gotcha_ note below). It may
happen that they behave differently, so it may sometimes be useful to run a
full *recollindex* even for a specific file, but this will produce a
big log file.
When you are done, it is better to reset the verbosity to a reasonable
level (e.g.: +2+ : just errors, +4+ : basic traces).
=== Note: the path gotcha
*recollindex -i* will only index files under the directories defined by the
+topdirs+ configuration variable (your home directory by
default). Unfortunately, the test is done on the file path text, ignoring
possible symbolic links. If you give a simple file name as a parameter to
*recollindex -i* and there are symbolic links inside the +topdirs+
entries, the comparison may fail. For example, if your home directory is
'/home/me/' and '/home/' is a link to '/usr/home/', *recollindex -i
somefilename* will actually try to index '/usr/home/somefilename/', and
fail (because '/usr/home/me/' is not a subdirectory of '/home/me/'). This
will manifest itself in the log by a message like the following.
----
:4:../index/fsindexer.cpp:149:FsIndexer::indexFiles: skipping [/usr/home/me/somefile] (ntd)
----
If this happens, give a full path consistent with what is found in the
configuration file (e.g.: _recollindex -i /home/me/somefile_).
=== File system occupation
One of the possible reasons for failed indexing is a +maxfsoccup+
parameter set too low. This is the value of file system occupation, not
free space, where indexing will stop. It is set from the GUI indexing
configuration or by editing 'recoll.conf'. A value of 0 implies no
checking, but a very low, non-zero, value will just prevent indexing.

View File

@ -0,0 +1,65 @@
== Recoll Wiki file index
link:ElinksWeb.html[Extending the Recoll Firefox visited web page indexing mechanism to other browsers]
link:FaqsAndHowTos.html[Faqs and Howtos]
link:FilterArch.html[Recoll input filters ]
link:FilterRetrofit.html[Installing a filter for a new document type]
link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members]
link:GUIKeyboard.html[# Recoll GUI keyboard navigation]
link:HandleCustomField.html[Generating a custom field and using it to sort results]
link:Home.html[Welcome to the Recoll Wiki]
link:HotRecoll.html[Recoll hotkey: starting / hiding recoll with a keyboard shortcut]
link:IndexMailHeader.html[Indexing arbitrary mail headers ]
link:IndexMozillaCalendari.html[Indexing Mozilla calendar data ]
link:IndexOnAc.html[Laptops: automatically starting or stopping indexing according to AC power status]
link:IndexOutlook.html[Indexing Outlook archives]
link:IndexWebHistory.html[Indexing Web history with the Firefox extension ]
link:MultipleIndexes.html[Creating and using multiple indexes]
link:MuttAndRecoll.html[Interfacing Recoll and Mutt]
link:NonAsciiFileNames.html[Unix and non-ASCII file names, a summary of issues]
link:OpenHelperScript.html[Starting native applications ]
link:PreventIndexingDir.html[Preventing indexing in a directory]
link:ProblemSolvingData.html[Gathering useful data for asking help about or reporting a Recoll issue]
link:QpdfviewHelperScript.html[Starting native applications ]
link:QueryFromC.html[Querying Recoll from a C program]
link:ReplaceCategories.html[Replacing the Category filter controls]
link:ResultsThumbnails.html[Result list thumbnails and how to create them]
link:SavingConfig.html[User configuration backup]
link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens]
link:UsingOpenWith.html[Using the Open With context menu in recoll 1.20 and newe]
link:WhyIsMyFileNotIndexed.html[Using the log file to investigate indexing issues]
link:XDGBase.html[XDG: Tidying Recoll data storage]
link:ZDevCaseAndDiacritics1.html[Character case and diacritic marks (1), issues with stemming]
link:ZDevCaseAndDiacritics2.html[Character case and diacritic marks (2), user interface]
link:ZDevCaseAndDiacritics3.html[Character case and diacritic marks (3), implementation]

View File

@ -0,0 +1,42 @@
== XDG: Tidying Recoll data storage ==
The default storage structure of Recoll configuration and index data is
quite at odds with what recommends the
link:http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html[XDG
Base Directory Specification], the reason being that it predates said spec.
By default, Recoll stores all its data in a single directory: '$HOME/.recoll'
This is not going to change, because it would be quite disturbing for
current users.
However, the location of this directory can be modified using the
+$RECOLL_CONFDIR+ environment variable.
Furthermore all significant Recoll data categories can be moved away from
the configuration directory (maybe to '$HOME/.cache'), by setting
configuration variables:
* _dbdir_ defines the location for storing the Xapian
index. This could be set to, e.g., '$HOME/.cache/recoll/xapiandb'. It is
quite recommended that
this directory be dedicated to Xapian (don't store other things in
there).
* _mboxcachedir_ defines the location for caching access speedup information
about mail folders in mbox format. e.g. '$HOME/.cache/recoll/mboxcache'
* New in 1.22: you can use _aspellDictDir_ to define the storage
location for the aspell spelling approximation
dictionary. E.g. '$HOME/.cache/recoll'
* _webcachedir_ may be used to define where the visited web pages
archive is stored. E.g. '$HOME/.cache/recoll/webcache'. This is only used
if you activate the Firefox plugin and web history indexing. You may
want to think a bit more about where to store it, because, contrary to
the above, this is not discardable data: your Recoll Web history goes
away if you delete it.
If you use multiple Recoll configurations, each will have to be customized.
Once these are put away, there are still a few modifyiable files in the
configuration directory, for example the 'recoll.pid' and 'history'
files, but these are small files. Moving 'recoll.pid' away would be a
serious headache because it is used by scripts.

View File

@ -0,0 +1,143 @@
== Character case and diacritic marks (1), issues with stemming
=== Case and diacritics in Recoll
Recoll versions up to 1.17 almost fully ignore character case and diacritic
marks.
All terms are converted to lower case and unaccented before they are
written to the index. There are only two exceptions:
* File paths (as used in _dir:_ clauses) are not converted. This might
be a bug or a feature, but the main reason is that we don't know how they
are encoded.
* It is possible to specify that some characters will keep their diacritic
marks, because the entity formed by the character and the diacritic mark
is considered to be a different letter, not a modified one. This is
highly dependant on the language. For exemple, in Swedish, +å+ should
be preserved, not turned into +a+.
As a necessary consequence, the same transformations are applied to search
terms, and it is impossible to search for a specific capitalization of a
word (+US+ is looked for as +us+), or a specific accented form
(+café+ will be looked for as +cafe+).
However, there are some cases where you would like to be more specific:
* Searching for +US+ or +us+ should probably return different results.
* Diacritics are seldom significant in English, but we can find a
few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of
course, there are many more cases in languages which use more diacritics.
On the other hand, accents are often mistyped or forgotten (résumé, résume,
resume?), and capitalization is most often unsignificant, so that it is
very important to retain the capability to ignore accent and character
case differences, and that the discrimination can be easily switched on or
off for each search (or even for specific terms).
This text and other pages which will follow will discuss issues in adding
character case and diacritics sensitivity to Recoll, under the assumption
that the main index will contain the raw source terms instead of
case-folded and unaccented ones.
The following will use the _unaccent_ neologism to mean _remove
diacritic marks_ (and not only accents).
English examples are used when possible, but given the limited use of
diacritics in English, some French will probably creep in.
=== Diacritics and stemming
Stemming is the process by which we extend a search to terms related by
grammatical inflexion, for example singular/plural, verb tenses, etc. For
example a search for +floor+ is normally expanded by Recoll to +floors,
floored, flooring, ...+
In practice Recoll has a separate data structure that has stemmed terms
(stems) as keys pointing to a list of expansion terms
{{{floor -> (floor,floors,floorings,...)}}}
Stemming should be applied to terms before they are stripped of
diacritics. Accents may have a grammatical significance, and the accent may
change how the term is stemmed. For example, in French the +âmes+ suffix
generally marks a past conjugation but +ames+ does not. The standard
Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem,
but +évitames+ will be turned into +évitam+ (stripping
plural and feminine suffixes).
When the search is set to ignore diacritics, this poses a specific problem:
if the user enters the search term without accents (which is correct
because the system is supposed to ignore them), there is no warranty that
the term will be correctly expanded by stemming.
The diacritic mismatch breaks the family relationship between the stem
siblings, and this is independant of the type of index: it will happen with
an index where diacritics are stripped just as with a raw one.
The simpler case where diacritics in the original term only affects
diacritics in the stem also necessitates specific processing, but it is
easier to work around.
Two examples illustrating these issues follow.
==== The simple case: diacritics in the term only affect diacritics in the stem
Let's imagine that the document set contains the term +éviter+
(infinitive of +to avoid+), but not +évite+ (present). The only term in
the actual index is then +éviter+.
The user enters an unaccented +evite+, counting on the
diacritics-insensitive search mode to deal with the accents. As +évite+
is not present in the index, we have no way to guess that +evite+ is
really +évite+.
The stemmer will turn +evite+ into +evit+. There is no way that this
can be related to +éviter+, and this legitimate result can't be found.
There is a way around this: we can compute a separate
stem expansion dictionary for unaccented terms. This dictionary, to be used
with diacritic-unsensitive searches only, contains the relationship
between +evit+ and +eviter+ (as +éviter+ is in the index). We can
then relate +eviter+ and +éviter+ because they differ only by accents,
and the search will find the document with +éviter+.
==== The bad case: diacritics in the term change the stem beyond diacritics
Some grammatically significant accents will cause unexpectedly missing
search results when using a supposedly diacritics-insensitive search mode.
Let's imagine that the document set contains the term +éviter+
(infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming
expansion table has an entry for +évit+ -> +éviter+.
If the user enters an unaccented +evitames+, she would expect to find the
documents containing +éviter+ in the results, because the latter term is
a stemming sibling of +évitâmes+ and the search is supposedly not
influenced by diacritics, so that +evitames+ and +évitâmes+ should be
equivalent.
However, our search is now in trouble, because +évitâmes+ is not in any
document, so that there is no data in the index which would inform us about
how to transform the input term into something that differs only by accents
but would yield a correct input for the stemmer.
If we try to feed the raw user input to the stemmer, it will propose
an +evitam+ stem, which will not work, because the stem that actually
exists is +évit+, and +evitam+ can not be related to +éviter+.
The only palliative approach I can think of would be a spelling correction
of the input, performed independantly of the actual index contents, which
would notice that +évitames+ is not a French word and propose a change or an
expansion to +évitâmes+, which would correctly stem to +évit+ and allow
us to find +éviter+.
This issue is not specific to Recoll or indeed to the fact that the index
retains accent or not. As far as I can see, it is an intrinsic bad
interaction between diacritics insensitivity and stemming.
It is also interesting to note that this case becomes less probable when
the data set becomes bigger, because more term inflexions will then be
present in the index.
We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate
interface].

View File

@ -0,0 +1,122 @@
== Character case and diacritic marks (2), user interface
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
of the problems which arise when mixing case/diacritics sensitivity and
stemming.
As of version 1.18, Recoll can create two types of indexes:
* _Dumb_ indexes contain terms which are lowercased and stripped of
diacritics. Searches using such an index are naturally case- and
diacritics- insensitive: search terms are stripped before processing.
* _Raw_ indexes contain terms which are just like they were found in the
source document. Searching such an index is naturally sensitive to case
and diacritics, and can be made insensitive by further processing.
The following explains how users can control these Recoll features.
=== Controlling the type of index we create: stripped or raw
The kind of index that recoll creates is determined by:
* A build-time *configure* switch: _--enable-stripchars_. If this is
set, the code for case and diacritics sensitivity is not compiled in and
recoll will work like the previous versions: unaccented and casefolded
index, no runtime options for case or diacritics sensitivity
* An indexing configuration switch (in recoll.conf): if Recoll was built
with _--disable-stripchars_, this will provide a dynamic way to return
to the "traditional" index. The case and diacritics code will be present
but inactive. Normally, a recoll installation with this switch set
should behave exactly like one built with _--enable-stripchars_. When
using multiple indexes, this switch MUST be consistent between
indexes. There is no support whatsoever for mixing raw and dumb indexes.
The option is named _indexStripChars_, and it is not settable from the
GUI to avoid errors. This is something that would typically be set once
and for all for a given installation. We need to decide what the default
value will be for 1.18
* A number of query time switches. Using these it is also possible to
perform a search insensitive to case and diacritics on a raw index. Note
however, that, given the complexity of the issues involved, I give no
guaranty at this time that this will yield exactly the same results as
searching a dumb index. Details about query time behaviour follow.
=== Controlling stem, case and diacritics expansion: user query interface
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
needed to give the user a way to control stem expansion. This was done in
three ways:
* Globally, by setting a menu option.
* Globally, by setting the stemming language value to empty.
* On a term by term basis by Capitalizing the term, or, in query language
mode only, by using an 'l' clause modifier (_"term"l_).
After switching to an unstripped index, capable of case and diacritic
sensitivity, we need ways to control what processing is performed among:
* Case expansion.
* Diacritics expansion.
* Stem expansion.
The default mode will be compatible with the previous version, because
this is is most generally what we want to do: ignore case and diacritics,
expand stems.
There are two easy approaches for controlling the parameters:
* Global options set in the GUI menus or as *recollq* command line
switches.
* Per-clause options set by modifiers in the query language.
We would like, however to let the user entry automatically override the
defaults in a sensible way. For example:
* If a term is entered with diacritics, diacritic sensitivity is turned on
(for this term only).
* If a term is entered with upper-case characters, case sensitivity is
turned on. In this case, we turn off stem expansion, because it makes
really no sense with case sensitivity.
With this method we are stuck with 3 problems (only if the global mode is
set to insensitive, and we're not using the query language):
* Turning off stemming without turning on case sensitivity.
* Searching for an all lower-case term in case-sensitive mode.
* Searching for a term without diacritics in diacritic-sensitive mode.
The two latter issues are relatively marginal and can be worked around easily
by switching to query language mode or using negative clauses in the
advanced search.
However, we need to be able to turn stemming off while remaining
insensitive to case, and we need to stay reasonably compatible with the
previous versions. This means that a term which has a capital first letter
but is otherwise lowercase will turn stemming off, but not case sensitivity
on.
So we're left with how to search for such a term in a case-sensitive way,
and for this, you'll have to use global options or the query language.
The modified method is:
* If a term is entered with diacritics, diacritic sensitivity is turned on
(for this term only).
* If the first letter in a term is upper-case and the rest is lower-case,
we turn stem expansion off, but we do not become case-sensitive
* If any letter in a term except the first is upper-case, case sensitivity
is turned on. Stem expansion is also turned-off (even if the first
letter is lower-case), because it makes really no sense with case
sensitivity.
* To search for an all lower-case or capitalized term in a case-sensitive
way, use the query language: "Capitalized"C, "lowercase"C
* Use the query language and the "D" modifier to turn on diacritics
sensitivity.
It can be noted that some combinations of choices do not make sense and
they are not allowed by Recoll: for example, diacritics or case sensitivity
do not make sense with stem expansion (which cannot preserve diacritics in
any meaningful general way).
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
implementation in Recoll 1.18.

View File

@ -0,0 +1,67 @@
== Character case and diacritic marks (3), implementation
In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics
and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate
interface] for switchable search sensitivity to diacritics and character
case.
So you are in this mood again and you don't want to type accents (maybe you're
stuck with a QWERTY American english keyboard), or conversely you're
want to resume looking for your résumé, and you've told Recoll as much,
using the appropriate interface. What happens then ?
The second case is easy if the index is raw, and mostly impossible if it is
stripped. So we'll concentrate on the first case: how to achieve case and
diacritics insensitivity on a raw index ?
Recoll uses three expansion tables:
* The first table has stripped and lowercased terms as keys and raw terms as
data: +mate -> (mate, maté, MATE,...)+.
* The second table has lowercased stems as keys and original lowercase terms
as data (when using multiple languages, there are several such tables):
+évit -> (éviter, évite, évitâmes, ...)+.
* The third table has stripped and lowercased stems as keys and stripped
lowercased terms as data:
+evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+
The first table can be used for full case and diacritics expansion or for
only one of those, by post-filtering the results of full expansion (e.g. if
we only want diacritics expansion, we filter by stripping diacritics from
each result term and check that it's identical to the input). For example
if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to
only perform case expansion for an input of +maté+, we apply case folding
to the initial output and keep only +maté+, as +mate+ differs from the
input.
We only perform stemming expansion when case and diacritics sensitivity is
off. It is performed using the second and third tables, both on the
lowercased and lowercased/stripped output of the first step, and each term
in the output stemming is expanded again for case (using the first table).
A full example of the expansion occurring during an insensitive search
for +resume+ using French stemming on a mixed English/French index
follows. An important thing to remember is that the result of each
expansion is a function of the terms actually present in the index, not
some arbitrary computation (and so, of course, many of the possible but
absent variations are missing).
# The case and diacritics expansion of +resume+ yields +RESUME Resume
Résumé resumé résume résumé resume+
# The Stem expansion input list (lower-cased) is:
+resume resumé résume résumé+, and the output is:
+resum resume resumenes resumer resumes resumé resumée résum résumait
résumant résume résumer résumerai résumerait résumes résumez résumé résumée
résumées résumés+
# Each of the above terms is then fed to case and diacritics expansion (first
table), for the final output:
+resume résumé Résumé résumer résume Resume résumés RESUME resumes
resumer résumant resúmenes resumé résumait résumes résumée resumee
résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.
A Xapian OR query is finally constructed from the expanded term list.

View File

@ -0,0 +1,20 @@
#!/bin/sh
WIDX=WikiIndex.txt
echo "== Recoll Wiki file index" > $WIDX
for f in *.txt; do
if test "$f" = $WIDX ; then continue; fi
h="`basename $f .txt`.html"
title=`head -1 "$f" | sed -e 's/=//g' -e 's/^ *//' -e 's/ *$//' -e 's/ //g'`
echo 'link:'$h'['$title']' >> $WIDX
echo >> $WIDX
done
exit 0
# Check and display what files are in the index but not in the contents table:
grep \| FaqsAndHowTos.txt | awk -F\| '{print $1}' | sed -e 's/\* \[\[//' -e 's/.wiki//' |sort > ctfiles.tmp
grep '\[\[' WikiIndex.txt | awk -F\| '{print $1}' | sed -e 's/\[\[//' -e 's/.wiki//' -e 's/.md//' | sort > ixfiles.tmp
echo 'diff ContentFiles IndexFiles:'
diff ctfiles.tmp ixfiles.tmp
rm ctfiles.tmp ixfiles.tmp