web
This commit is contained in:
parent
06b414cfc6
commit
821fb780d2
35
website/faqsandhowtos/ElinksWeb.txt
Normal file
35
website/faqsandhowtos/ElinksWeb.txt
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
== Extending the Recoll Firefox visited web page indexing mechanism to other browsers
|
||||||
|
|
||||||
|
The *Recoll* _Web Queue_ function allows using WEB browser plug-ins
|
||||||
|
originally designed for indexing visited WEB pages with *Beagle* (rip). The
|
||||||
|
browser plug-ins works very simply by creating copies of the visited pages
|
||||||
|
in a designated directory. Two files are created for each page, one for the
|
||||||
|
contents, the other for the metadata.
|
||||||
|
|
||||||
|
When activated, *Recoll* will visit the queue directory and index each HTML
|
||||||
|
page and its associated metadata. There is more detail about the mechanism
|
||||||
|
on the [[IndexWebHistory|page about the Recoll Web queue]], but mostly, you
|
||||||
|
just need to go to the _Indexing Preferences_ in the *recoll* GUI, open the
|
||||||
|
_Web history_ panel and check the top button.
|
||||||
|
|
||||||
|
Franck, a *Recoll* and *Elinks* user from New Zealand, designed a method
|
||||||
|
and wrote a script to index the *Elinks* WEB history in this fashion.
|
||||||
|
|
||||||
|
The script works by using *wget* to fetch the visited page into the queue
|
||||||
|
directory. This means that it would be reusable to index arbitrary WEB
|
||||||
|
pages in contexts other than *Elinks* visits.
|
||||||
|
|
||||||
|
Recipee for *Elinks* and Recoll 1.18 and later:
|
||||||
|
|
||||||
|
* Retrieve the
|
||||||
|
link:https://www.recoll.org/files/elinks_recoll.sh[elinks_recoll.sh] shell
|
||||||
|
script and make it executable (`chmod a+x elinks_recoll.sh`).
|
||||||
|
* In the Elinks Keyboard shortcut manager (k)/Main, add a shortcut to pass
|
||||||
|
the current URL to an external commande, e.g. _Ctrl-P_.
|
||||||
|
* In the Options manager (o) /Document/Uri Passing, add an action named for
|
||||||
|
example _ToIndex_
|
||||||
|
* Modify the ToIndex action to execute `/path/to/the/script/elinks_recoll.sh %c`
|
||||||
|
* Save, you are done
|
||||||
|
|
||||||
|
For Recoll 1.17, the method is analog, but the script is named
|
||||||
|
link:https://www.recoll.org/files/elinks_recoll.sh[elinks_beagle.sh].
|
||||||
37
website/faqsandhowtos/FaqsAndHowTos.txt
Normal file
37
website/faqsandhowtos/FaqsAndHowTos.txt
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
== Faqs and Howtos
|
||||||
|
|
||||||
|
=== Indexing
|
||||||
|
* link:WhyIsMyFileNotIndexed.html[Why is this file not indexed ? Investigating indexing issues]
|
||||||
|
* link:PreventIndexingDir.html[Preventing the indexing of a directory]
|
||||||
|
* link:IndexOnAc.html[Starting/stopping the indexer depending on power/battery status]
|
||||||
|
* link:IndexMozillaCalendari.html[Indexing Mozilla Sunbird / Lightning calendar data]
|
||||||
|
* link:MultipleIndexes.html[Creating and using multiple indexes]
|
||||||
|
* link:IndexWebHistory.html[Indexing Web history with the Firefox browser extension]
|
||||||
|
* link:ElinksWeb.html[Extending the Web queue mechanism to other browsers and general WEB indexing]
|
||||||
|
* link:IndexMailHeader.html[Indexing arbitrary mail headers]
|
||||||
|
* link:IndexOutlook.html[Indexing Outlook archives]
|
||||||
|
* link:HandleCustomField.html[Generating a custom field and using it to sort results]
|
||||||
|
* link:http://www.recoll.org/recoll_XMP/index.html.html[An example of filter/field customisation, using XMP metadata with PDFs]
|
||||||
|
* link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members]
|
||||||
|
|
||||||
|
=== Searching
|
||||||
|
* link:GUIKeyboard.html[Recoll GUI keyboard navigation]
|
||||||
|
* link:HotRecoll.html[On the desktop: using a keyboard shortcut for starting/hiding recoll]
|
||||||
|
* link:OpenHelperScript.html[Handling issues for starting native apps, esp. email clients - getting Thunderbird to open message files]
|
||||||
|
* link:QpdfviewHelperScript.html[Another example open helper script - using qpdfview to open pdf and postscript files, with support for page and search options]
|
||||||
|
* link:UsingOpenWith.html[Using the new Open With menu in recoll 1.20 with a custom
|
||||||
|
app]
|
||||||
|
* link:ReplaceCategories.html[Replacing the document category filters]
|
||||||
|
* link:ResultsThumbnails.html[Result list thumbnails and how to create them]
|
||||||
|
* link:MuttAndRecoll.html[Interfacing Recoll and Mutt]
|
||||||
|
* link:QueryFromC.html[Querying from a C program]
|
||||||
|
|
||||||
|
=== Administration and miscellaneous
|
||||||
|
* link:http://www.recoll.org/pages/recoll-webui-install-wsgi.html.html[Installation of the Recoll WebUI with Apache]
|
||||||
|
* link:FilterRetrofit.wiki.html[//Installing a filter for a new document type//]
|
||||||
|
* link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens]
|
||||||
|
* link:SavingConfig.wiki.html[Recoll configuration backup]
|
||||||
|
* link:XDGBase.wiki.html[Tidying Recoll data storage]
|
||||||
|
* link:ProblemSolvingData.html[Collecting diagnostic information]
|
||||||
|
* link:NonAsciiFileNames.html[Unix and non-ascii file names]
|
||||||
|
* link:FilterArch.html[Recoll filters]
|
||||||
82
website/faqsandhowtos/FilterArch.txt
Normal file
82
website/faqsandhowtos/FilterArch.txt
Normal file
@ -0,0 +1,82 @@
|
|||||||
|
== Recoll input handlers
|
||||||
|
|
||||||
|
In the end, Recoll indexes plain UTF-8 text, remembering when it came
|
||||||
|
from.
|
||||||
|
|
||||||
|
But of course, this is not how the source data looks like.
|
||||||
|
The text content of the original documents is encoded in many fashions
|
||||||
|
(ie pdf, ms-word, html, etc.), and it can also be stored in quite
|
||||||
|
involved ways (inside archives, email attachments ...).
|
||||||
|
|
||||||
|
For getting to the data and converting it to plain text, Recoll uses a set
|
||||||
|
of modules which it calls input handlers (or filters), which either operate
|
||||||
|
on the storage structure (ie: a zip handler), or the storage format (ie a
|
||||||
|
pdf to text translator), or both. In addition, there is a tentative notion
|
||||||
|
of a higher level storage backend which we will ignore for now (for
|
||||||
|
reference there are currently two of those: the file system and the web
|
||||||
|
history cache).
|
||||||
|
|
||||||
|
The basic task of filters is to take a document as input and produce a
|
||||||
|
series of subdocuments as output. The subdocument's format is defined
|
||||||
|
either dynamically (as part of the output data), or statically, in the
|
||||||
|
filter definition.
|
||||||
|
|
||||||
|
=== Simple filters
|
||||||
|
|
||||||
|
These are executed by a the **mh_exec** recoll module. They are the vast
|
||||||
|
majority.
|
||||||
|
|
||||||
|
These filters are very simple. They are designed to perform a simple task
|
||||||
|
with minimal interface, they mostly don't know anything about each other,
|
||||||
|
and they don't know much about their context. This makes writing a filter
|
||||||
|
quite easy as there is not much to learn about their environment.
|
||||||
|
|
||||||
|
Only one output document is produced and the format is fixed.
|
||||||
|
|
||||||
|
In practise the filter, which is most generally a shell-script (but could
|
||||||
|
be any executable program), takes a file name on the command line and
|
||||||
|
outputs an html or plain text document on standard output, then exits.
|
||||||
|
|
||||||
|
For example, the pdf filter takes one pdf file name as input on the command
|
||||||
|
line and produces one html document on stdout. The fact that the output is
|
||||||
|
html is statically defined in a configuration file.
|
||||||
|
|
||||||
|
For filters which produce plain text, the output character set information
|
||||||
|
is in general defined in the configuration file. Else it will be obtained
|
||||||
|
from the locale (hoping that it makes sense).
|
||||||
|
|
||||||
|
Filters that output html can produce metadata information in the html
|
||||||
|
header (ie author etc.). Filters that output plain text can only output
|
||||||
|
main text data, no metadata fields.
|
||||||
|
|
||||||
|
Besides the file name, there is one other piece of input information, which
|
||||||
|
is in the form of an environment variable, and can be safely ignored:
|
||||||
|
+RECOLL_FILTER_FORPREVIEW+. This indicates if the filter is being used
|
||||||
|
for previewing or for indexing data. Some filters will elect to suppress
|
||||||
|
repetitive parts of the output text when indexing to avoid distorting the
|
||||||
|
term statistics. For exemple, the man filter suppresses the section
|
||||||
|
headers (NAME, SYNOPSIS...) when indexing.
|
||||||
|
|
||||||
|
=== Multiple input filters
|
||||||
|
|
||||||
|
These filters are more complex, but still quite easy to write, especially
|
||||||
|
if you can use Python, because they can then use a common module which
|
||||||
|
manages the communication with the indexer.
|
||||||
|
|
||||||
|
Newer Recoll versions have converted many previously 'simple' filters to
|
||||||
|
this kind as part of the port to Windows.
|
||||||
|
|
||||||
|
These filters are executed by the *mh_execm* Recoll module.
|
||||||
|
|
||||||
|
They are persistent (one instance will persist through a whole indexing
|
||||||
|
pass), and will index successive multiple input files (the point being to
|
||||||
|
avoid startup performance penalty), and possibly multiple documents per
|
||||||
|
input file if this makes sense for their input format (ie: zip archive, chm
|
||||||
|
help file).
|
||||||
|
|
||||||
|
They use a simple communication protocol over a pipe with the main recoll
|
||||||
|
or recollindex process, with file names and a few other parameters being
|
||||||
|
sent as input, and decoded data and attributes being sent in return.
|
||||||
|
|
||||||
|
The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip'
|
||||||
|
or 'rclaudio' for reasonably straightforward exemples.
|
||||||
62
website/faqsandhowtos/FilterRetrofit.txt
Normal file
62
website/faqsandhowtos/FilterRetrofit.txt
Normal file
@ -0,0 +1,62 @@
|
|||||||
|
== Installing a filter for a new document type
|
||||||
|
|
||||||
|
It will sometimes happen that a newer Recoll release has support for a
|
||||||
|
document type which would be useful to you, but which your older release
|
||||||
|
does not support.
|
||||||
|
|
||||||
|
It is in general easy to import support from the newer to the older
|
||||||
|
release: the Recoll input handler interface is very stable, so things should just
|
||||||
|
work.
|
||||||
|
|
||||||
|
Input Handler updates are generally described on the Recoll web site
|
||||||
|
link:https://www.recoll.org/filters/filters.html[new filters pages]. They
|
||||||
|
may include notes about which versions need the new input handler, or specifics
|
||||||
|
about installing it.
|
||||||
|
|
||||||
|
An up to date copy of input handlers and configuration files is also kept
|
||||||
|
link:https://www.recoll.org/filters/[at the same location].
|
||||||
|
|
||||||
|
We will take an example to make things more concrete: Tomboy and Gnote
|
||||||
|
files are directly supported by Recoll 1.19, but not in older Recoll
|
||||||
|
releases. The *rclxml* handler is needed to process them.
|
||||||
|
|
||||||
|
The following procedure will allow you to retrofit support:
|
||||||
|
|
||||||
|
- Retrieve the *rclxml* input handler from:
|
||||||
|
link:https://www.lesbonscomptes.com/recoll/filters/rclxml[]
|
||||||
|
|
||||||
|
- Copy it to '/usr/share/recoll/filters' and make it executable:
|
||||||
|
`chmod +x rclxml`
|
||||||
|
The input handler needs *xsltproc*, but this is probably already on your
|
||||||
|
system (else get it with the package manager).
|
||||||
|
|
||||||
|
- Edit '~/.recoll/mimemap', add the following line:
|
||||||
|
`.note = application/x-gnote`
|
||||||
|
- Edit '~/.recoll/mimeconf', add the following lines:
|
||||||
|
+
|
||||||
|
----
|
||||||
|
[index]
|
||||||
|
application/x-gnote = exec rclxml
|
||||||
|
----
|
||||||
|
- Edit '~/.recoll/mimeview', add the following lines:
|
||||||
|
+
|
||||||
|
----
|
||||||
|
[view]
|
||||||
|
application/x-gnote = tomboy %f
|
||||||
|
----
|
||||||
|
|
||||||
|
- The easiest way to make sure the files are indexed with the new input
|
||||||
|
handlers may then be to just run a full indexing pass (`recollindex -z`).
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- The MIME type which is used is not crucial, you could prefer to use,
|
||||||
|
e.g., +application/x-tomboy+ instead, it just has to be consistent. To
|
||||||
|
avoid future trouble, it's better to use the type used by newer Recoll
|
||||||
|
releases though.
|
||||||
|
- The 'mimeview' entry is necessary even if you are using the desktop
|
||||||
|
preferences to open files. The value will not be used, but it has to be
|
||||||
|
there.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
34
website/faqsandhowtos/FilteringOutZipArchiveMembers.txt
Normal file
34
website/faqsandhowtos/FilteringOutZipArchiveMembers.txt
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
== Filtering out Zip archive members ==
|
||||||
|
|
||||||
|
The *rclzip* Zip archive extraction input handler does not use the general
|
||||||
|
configuration variables which define what file system objects should be
|
||||||
|
skipped, but it has an equivalent internal function.
|
||||||
|
|
||||||
|
The name-skipping code depends on a recent member of the the Recoll Python
|
||||||
|
package. This will become standard for release 1.20, but for earlier
|
||||||
|
releases, you need to do two things to use this function:
|
||||||
|
|
||||||
|
- Fetch 'python/recoll/recoll/rclconfig.py' and 'filters/rclzip' from the
|
||||||
|
source repository.
|
||||||
|
- Copy both to '/usr/share/recoll/filters' and make 'rclzip' executable.
|
||||||
|
|
||||||
|
You can then set a variable named +zipSkippedNames+ inside
|
||||||
|
'recoll.conf'. +zipSkippedNames+ should be a space-separated list of
|
||||||
|
patterns which will be passed to the Python fnmatch() function. The +/+
|
||||||
|
characters are not special (matched as any character).
|
||||||
|
|
||||||
|
You can't use embedded spaces in patterns (no double-quote quoting for now)
|
||||||
|
|
||||||
|
This can be redefined for file system directories using the usual section
|
||||||
|
indicators (Zip archives in different file-system directories can have
|
||||||
|
different skip lists).
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
----
|
||||||
|
zipSkippedNames = *.txt
|
||||||
|
[/path/to/the/dir]
|
||||||
|
zipSkippedNames = somedir/*/*.html
|
||||||
|
----
|
||||||
|
|
||||||
|
|
||||||
60
website/faqsandhowtos/GUIKeyboard.txt
Normal file
60
website/faqsandhowtos/GUIKeyboard.txt
Normal file
@ -0,0 +1,60 @@
|
|||||||
|
== Recoll GUI keyboard navigation
|
||||||
|
|
||||||
|
Using Recoll without the mouse is not completely straightforward, but it is
|
||||||
|
mostly feasible. Here follows a description of the usable shortcuts.
|
||||||
|
|
||||||
|
=== Anywhere
|
||||||
|
|
||||||
|
`Ctrl+q` should exit Recoll from anywhere.
|
||||||
|
|
||||||
|
=== Main window and result list ===
|
||||||
|
|
||||||
|
When Recoll starts up, the focus is in the simple search entry. The main
|
||||||
|
window tab order is as follows:
|
||||||
|
|
||||||
|
* Clear
|
||||||
|
* Search
|
||||||
|
* Search type combo
|
||||||
|
* Search entry (Initial focus)
|
||||||
|
* Result list (scrolling etc)
|
||||||
|
* Result list 1st link
|
||||||
|
* Result list next links...
|
||||||
|
* Back to Clear
|
||||||
|
|
||||||
|
Each result list entry has 3 links: the icon link is not active, but its
|
||||||
|
value is the URL, so that it can be dragged and dropped to another
|
||||||
|
application. The 2 other links are _Preview_ and _Open_ and can be
|
||||||
|
activated by typing _Enter_.
|
||||||
|
|
||||||
|
Typing _Ctrl+Shift+s_ anywhere in the main window should return the focus to the search entry. So will _Ctrl+l_ in future versions (for compatibility with WEB browser usage).
|
||||||
|
|
||||||
|
For pure keyboard usage, you can improve this by:
|
||||||
|
|
||||||
|
- Disabling the icon link: use _Preferences->GUI configuration->Result
|
||||||
|
List->Edit result paragraph_ and remove the `<a href='%U'>` and `</a>`
|
||||||
|
around the `<img...>` tag.
|
||||||
|
- Making the active link more visible by adding the following code to the
|
||||||
|
result page HTML header insert (same preferences tab). Feel free to
|
||||||
|
adjust the color :=) :
|
||||||
|
|
||||||
|
----
|
||||||
|
<style type="text/css">
|
||||||
|
a:focus {background-color: red;}
|
||||||
|
</style>
|
||||||
|
----
|
||||||
|
|
||||||
|
=== Result table
|
||||||
|
|
||||||
|
The same _Ctrl+Shift+s_ will return the focus to the search entry when
|
||||||
|
working with the result table.
|
||||||
|
|
||||||
|
_Ctrl+r_ will move the focus from the entry to the spreadsheet. When in
|
||||||
|
there the arrow keys will navigate the lines.
|
||||||
|
|
||||||
|
When a line is selected:
|
||||||
|
|
||||||
|
* _Ctrl+o_ will _Open_ the document.
|
||||||
|
* _Ctrl+Shift+o_ will _Open_ the document and exit Recoll.
|
||||||
|
* _Ctrl+d_ (detail) will start a _Preview_
|
||||||
|
|
||||||
|
_Esc_ will deselect the current line so that mouse hovering will work again.
|
||||||
69
website/faqsandhowtos/HandleCustomField.txt
Normal file
69
website/faqsandhowtos/HandleCustomField.txt
Normal file
@ -0,0 +1,69 @@
|
|||||||
|
== Generating a custom field and using it to sort results
|
||||||
|
|
||||||
|
We are going to show how to generate a custom field from a Recoll filter,
|
||||||
|
and use it for sorting results. The example chosen comes from an actual
|
||||||
|
user request: sorting results on pdf page counts.
|
||||||
|
|
||||||
|
The details here are obsolete, as the +pdf+ input handler is now a quite
|
||||||
|
different python program, but the general idea is still relevant.
|
||||||
|
|
||||||
|
The page count from a pdf file can be displayed by the pdfinfo command
|
||||||
|
(xpdf or poppler tools).
|
||||||
|
|
||||||
|
We first modify a copy of the rclpdf filter
|
||||||
|
('/usr/[local/]share/recoll/filters/rclpdf'), to compute the pdf page count,
|
||||||
|
and output the value as an html meta field. This is a not very interesting
|
||||||
|
bit of shell/awk magic. Another approach would be to just rewrite the
|
||||||
|
rclpdf filter in your favorite scripting language (ie: perl, python...), as
|
||||||
|
all it does is execute pdftotext and pdfinfo and output html, nothing
|
||||||
|
complicated. Here follows the rclpdf modification as a pseudo patch:
|
||||||
|
|
||||||
|
----
|
||||||
|
# compute the page count and format it so that it's alphabetically sortable
|
||||||
|
+set `pdfinfo "$infile" | egrep ^Pages:`
|
||||||
|
+pages=`printf "%04d" $2`
|
||||||
|
[skip...]
|
||||||
|
# Pass the page count value to awk
|
||||||
|
-awk 'BEGIN'\
|
||||||
|
+awk -v Pages="$pages" 'BEGIN'\
|
||||||
|
[skip...]
|
||||||
|
# Inside the awk program startup section: compute the "meta" field line
|
||||||
|
+ pagemeta = "<meta name=\"pdfpages\" content=\"" Pages "\">\n"
|
||||||
|
[skip...]
|
||||||
|
# Then print it as part of the header:
|
||||||
|
+ $0 = part1 charsetmeta pagemeta part2
|
||||||
|
[skip...]
|
||||||
|
----
|
||||||
|
|
||||||
|
You can execute your own version of rclpdf by modifying '~/.recoll/mimeconf':
|
||||||
|
|
||||||
|
----
|
||||||
|
[index]
|
||||||
|
application/pdf = exec /path/to/my/own/rclpdf
|
||||||
|
----
|
||||||
|
|
||||||
|
At this point, recollindex would receive and extract a +pdfpages+ field,
|
||||||
|
but it would not know what to do with it. We are going to tell it to store
|
||||||
|
the value inside the document data record so that it can be displayed in
|
||||||
|
the results, and sorted on. For this we modify the '~/.recoll/fields' file:
|
||||||
|
|
||||||
|
----
|
||||||
|
[stored]
|
||||||
|
pdfpages=
|
||||||
|
----
|
||||||
|
|
||||||
|
That's it ! After reindexing, you can now display +pdfpages+ inside the
|
||||||
|
result list (add a +%(pdfpages)+ value to the paragraph format), and display
|
||||||
|
+pdfpages+ inside the result table (right-click the table header), and sort
|
||||||
|
the results on page count (click the column header).
|
||||||
|
|
||||||
|
Note that +pdfpages+ has not been defined as searchable (this would not make
|
||||||
|
much sense). For this, you'd have to define a prefix and add it to the
|
||||||
|
[prefixes] fields file section:
|
||||||
|
|
||||||
|
----
|
||||||
|
[prefixes]
|
||||||
|
pdfpages = XYPDFP
|
||||||
|
----
|
||||||
|
|
||||||
|
Have a look at the comments inside the 'fields' file for more information.
|
||||||
13
website/faqsandhowtos/Home.txt
Normal file
13
website/faqsandhowtos/Home.txt
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
== Welcome to the Recoll Faqs and Recipees
|
||||||
|
|
||||||
|
link:FaqsAndHowTos.html[FAQs and Howtos] are stored here, but
|
||||||
|
the main source for Recoll user documentation is
|
||||||
|
link:https://www.recoll.org/doc.html[the _Recoll user manual_] on the
|
||||||
|
link:https://www.recoll.org/[Recoll Web site] where you will also find a
|
||||||
|
lot of other Recoll information, source code tarballs and contact
|
||||||
|
information.
|
||||||
|
|
||||||
|
If you want to make your problem report as useful as possible, you may want
|
||||||
|
to take a look at link:ProblemSolvingData.html[this page].
|
||||||
|
|
||||||
|
link:WikiIndex.html[Full file index]
|
||||||
79
website/faqsandhowtos/HotRecoll.txt
Normal file
79
website/faqsandhowtos/HotRecoll.txt
Normal file
@ -0,0 +1,79 @@
|
|||||||
|
== Recoll hotkey: starting / hiding recoll with a keyboard shortcut
|
||||||
|
|
||||||
|
Type a key (ie: F12) and have recoll appear or disappear. On the first
|
||||||
|
occurrence, recoll is started if it's not already running. Further
|
||||||
|
occurrences toggle recoll between visible and minimized states. Never
|
||||||
|
thought this would be useful until someone asked for it. Can't do without
|
||||||
|
it anymore :)
|
||||||
|
|
||||||
|
This works well with both Gnome and KDE, but is implemented using a gnome
|
||||||
|
library (*libwnck*) and its python interface, which you may have to install
|
||||||
|
on your system if you are a pure KDE user. The library most probably exists
|
||||||
|
in the package repositories for your distribution, so this should not be
|
||||||
|
too complicated.
|
||||||
|
|
||||||
|
This should also work with other window managers, because it is based on a
|
||||||
|
standard window manager interface extension (EWMH) that most modern window
|
||||||
|
managers implement.
|
||||||
|
|
||||||
|
=== Installing the script (all desktops):
|
||||||
|
|
||||||
|
- You will need the libwnck library and its python interface. These are
|
||||||
|
usually part of a gnome installation, otherwise check and possibly
|
||||||
|
install them. For OpenSuse, the library should already be there but you
|
||||||
|
need to install gnome-python-desktop.
|
||||||
|
- Download the
|
||||||
|
link:https://www.recoll.org/files/hotrecoll.py[http://www.recoll.org/files/hotrecoll.py
|
||||||
|
script]. If you have a recent recoll installation (1.14.3 and
|
||||||
|
further), it's already in the recoll filters directory
|
||||||
|
('/usr/[local/]share/recoll/filters')
|
||||||
|
- Copy the script to some permanent place (ie: '~/bin') and make it
|
||||||
|
executable (you can leave it in the filters dirs if it's there). In a
|
||||||
|
shell window: `chmod +x hotrecoll.py`.
|
||||||
|
- You can check that the script works (or not) by executing it on the
|
||||||
|
command line. It does not need an argument. Recoll should appear or
|
||||||
|
disappear every time you execute the script. A few warning messages may
|
||||||
|
be considered normal. If the script says that it does not find the wnck
|
||||||
|
library or some other module, you'll have to install them.
|
||||||
|
|
||||||
|
=== Installing the keyboard shortcut (Gnome):
|
||||||
|
|
||||||
|
- _System->Preferences->Keyboard shortcuts_, or execute
|
||||||
|
*gnome-keybinding-properties*
|
||||||
|
- Click add, Name, ie: StartRecoll, Action: /path/to/hotrecoll.py
|
||||||
|
- This will add the shortcut to the "Custom shortcuts" section. You can
|
||||||
|
then click in the "Shortcut" column for "StartRecoll", and type any key
|
||||||
|
combination (ie: push F12) to assign a key shortcut.
|
||||||
|
|
||||||
|
=== Installing the keyboard shortcut (KDE):
|
||||||
|
|
||||||
|
Under KDE installing a global custom keyboard shortcut like we need is most
|
||||||
|
helpfully not under "Keyboard Shortcuts" but under "Input Actions".
|
||||||
|
|
||||||
|
- _Kmenu -> Configure Desktop -> Input Actions -> Edit -> New -> Global
|
||||||
|
Shortcut -> Command/Url_
|
||||||
|
- A new Action appears, named _New Action_. You can rename it something
|
||||||
|
like +hotrecoll+ for clarity.
|
||||||
|
- Click the _Trigger_ tab, click the input area and press your preferred
|
||||||
|
key combination (ie: F12)
|
||||||
|
- Click the _Action_ tab, and enter +hotrecoll.py+ (if it's in your PATH),
|
||||||
|
or else the full path to the command (e.g.:
|
||||||
|
'/usr/share/recoll/filters/hotrecoll.py').
|
||||||
|
- Click _Apply_.
|
||||||
|
|
||||||
|
=== Installing the keyboard shortcut (XFCE):
|
||||||
|
|
||||||
|
Open the settings manager, and add the shortcut in the
|
||||||
|
_Application Shortcuts_ panel inside the _Keyboard_ tool.
|
||||||
|
|
||||||
|
|
||||||
|
=== Other environments
|
||||||
|
|
||||||
|
Many window managers have a way to set up a keyboard shortcut for running
|
||||||
|
an arbitrary command. You'll need to look at the documentation for yours,
|
||||||
|
or search the web for a solution.
|
||||||
|
|
||||||
|
An alternative independant of the environment would be to use the XBindKeys
|
||||||
|
utility. See this link:http://www.linux.com/archive/feed/59494[linux.com
|
||||||
|
article] for helpful instructions.
|
||||||
|
|
||||||
33
website/faqsandhowtos/IndexMailHeader.txt
Normal file
33
website/faqsandhowtos/IndexMailHeader.txt
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
== Indexing arbitrary mail headers
|
||||||
|
|
||||||
|
By default the Recoll mail handler only processes a subset of email headers
|
||||||
|
(+From+, +To+, +Cc+, +Date+, +Subject+). It is possible to index additional
|
||||||
|
headers by specifying them inside the 'fields' configuration file, inside
|
||||||
|
the configuration directory (typically '~/.recoll/').
|
||||||
|
|
||||||
|
Lengthy explanations are not really needed here, and I'll just show an
|
||||||
|
example (duplicated from the configuration section of the manual):
|
||||||
|
|
||||||
|
----
|
||||||
|
[prefixes]
|
||||||
|
# Index mailmytag contents (with the given prefix)
|
||||||
|
mailmytag = XMTAG
|
||||||
|
|
||||||
|
[stored]
|
||||||
|
# Store mailmytag inside the document data record (so that it can be
|
||||||
|
# displayed - as %(mailmytag) - in result lists).
|
||||||
|
mailmytag =
|
||||||
|
|
||||||
|
[mail]
|
||||||
|
# Extract the X-My-Tag mail header, and use it internally with the
|
||||||
|
# mailmytag field name
|
||||||
|
x-my-tag = mailmytag
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Limitations:
|
||||||
|
|
||||||
|
- The mail filter will only process the first instance for a header
|
||||||
|
occurring several times.
|
||||||
|
- No decoding will take place (ie for non-ascii headers which would have
|
||||||
|
some kind of encoding).
|
||||||
32
website/faqsandhowtos/IndexMozillaCalendari.txt
Normal file
32
website/faqsandhowtos/IndexMozillaCalendari.txt
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
== Indexing Mozilla calendar data
|
||||||
|
|
||||||
|
Mozilla calendar programs (*Sunbird*, *Lightning*) do not store their
|
||||||
|
data in +ics+ files natively. They use an *SQLite* database (the
|
||||||
|
'storage.sdb' file inside the profile). This means that calendar data
|
||||||
|
cannot be indexed directly.
|
||||||
|
|
||||||
|
To get Recoll to index calendar data, you need to export it to an +ics+
|
||||||
|
file. This can be done manually, from the application menus, or, by
|
||||||
|
installing the
|
||||||
|
link:https://addons.mozilla.org/en-US/sunbird/addon/3740[Automatic Export
|
||||||
|
extension].
|
||||||
|
|
||||||
|
The extension can be configured to export the data when exiting the
|
||||||
|
program, or at regular time intervals. You can even set up a command to be
|
||||||
|
executed after the export. If you are not using real time indexing, this
|
||||||
|
can usefully be *recollindex*.
|
||||||
|
|
||||||
|
In _Tools->Add Ons->Automatic Export preferences_, in the _Start an
|
||||||
|
application after export_ subpanel, set _Path of application_ to
|
||||||
|
'/usr/[local/]bin/recollindex' and _Parameters of application_ to
|
||||||
|
something like _-i;/home/me/path/to/nameofexportedcal.ics_
|
||||||
|
|
||||||
|
This will ensure that the calendar is indexed every time it is exported
|
||||||
|
(this is not necessary though, you can let the next batch indexing pass
|
||||||
|
take care of it).
|
||||||
|
|
||||||
|
It may happen that the exported data has some syntax errors which will
|
||||||
|
prevent indexing with the *rclics* filter which was distributed up to
|
||||||
|
Recoll 1.13.04 (included). You may get an updated filter from the
|
||||||
|
link:https://www.recoll.org/download.html[Recoll download page].
|
||||||
|
|
||||||
24
website/faqsandhowtos/IndexOnAc.txt
Normal file
24
website/faqsandhowtos/IndexOnAc.txt
Normal file
@ -0,0 +1,24 @@
|
|||||||
|
== Laptops: starting or stopping indexing according to AC power status
|
||||||
|
|
||||||
|
For people using real time indexing on a laptop, kind user "The Doctor"
|
||||||
|
contributed a script to automatically start and stop indexing according to
|
||||||
|
power status. The script can be found here:
|
||||||
|
link:https://bitbucket.org/medoc/recoll/src/tip/src/desktop/recoll_index_on_ac.sh[recoll_index_on_ac.sh]
|
||||||
|
|
||||||
|
To use it, you need to copy it somewhere (e.g.: '/usr/bin', but any place
|
||||||
|
will do), make it executable (`chmod a+x recoll_index_on_ac.sh`), and edit
|
||||||
|
'~/.config/autostart/recollindex.desktop'
|
||||||
|
|
||||||
|
Change the following line:
|
||||||
|
|
||||||
|
Exec=recollindex -w 60 -m
|
||||||
|
|
||||||
|
to something like the following (depending where you copied the script):
|
||||||
|
|
||||||
|
Exec=/usr/bin/recoll_index_on_ac.sh
|
||||||
|
|
||||||
|
You may also want to change
|
||||||
|
'/usr/share/recoll/examples/recollindex.desktop', otherwise your change
|
||||||
|
will be reverted the next time you toggle real time indexing through the
|
||||||
|
GUI. And, yes, sorry about it, _this_ change will be lost on the next
|
||||||
|
Recoll update, so save a copy.
|
||||||
11
website/faqsandhowtos/IndexOutlook.txt
Normal file
11
website/faqsandhowtos/IndexOutlook.txt
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
== Indexing Outlook archives ==
|
||||||
|
|
||||||
|
Recoll has no direct support for indexing Microsoft Outlook data, because,
|
||||||
|
if you are a Windows user, you probably are not a good customer for Linux
|
||||||
|
desktop indexing...
|
||||||
|
|
||||||
|
However, if you have a need to index Outlook data at some point, I can
|
||||||
|
recommend the excellent link:http://www.five-ten-sg.com/libpst/[libpst]
|
||||||
|
library and its link:http://www.five-ten-sg.com/libpst/rn01re01.html[readpst]
|
||||||
|
utility. Using this you can very easily convert the Outlook data into MH or
|
||||||
|
mbox format, and then index the result with Recoll.
|
||||||
29
website/faqsandhowtos/IndexWebHistory.txt
Normal file
29
website/faqsandhowtos/IndexWebHistory.txt
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
== Indexing Web history with the Firefox extension ==
|
||||||
|
|
||||||
|
Note: this document is valid for Recoll versions from 1.18.
|
||||||
|
|
||||||
|
The link:http://sourceforge.net/projects/recollfirefox/[Recoll Firefox
|
||||||
|
extension]
|
||||||
|
works together with Recoll to index the Web pages that you visit. The
|
||||||
|
extension is based on an older one which was initially written for the
|
||||||
|
Beagle indexer.
|
||||||
|
|
||||||
|
The extension works by copying the data for the visited pages to a queue
|
||||||
|
directory ('~/.recollweb/ToIndex' by default), from which they are
|
||||||
|
indexed and removed by Recoll, and then stored in a local cache.
|
||||||
|
|
||||||
|
The extension is now hosted on the Mozilla add-ons site, so you can install
|
||||||
|
it very simply in Firefox: link:https://addons.mozilla.org/fr/firefox/addon/recoll-indexer-1/[Recoll Firefox add-on page].
|
||||||
|
|
||||||
|
This feature can be enabled in the Recoll GUI index configuration panel
|
||||||
|
(Web history section), or by editing the configuration file (set
|
||||||
|
+processwebqueue+ to 1).
|
||||||
|
|
||||||
|
Please remember that Recoll only stores a limited amount of cached web data
|
||||||
|
(adjustable from the GUI Index Configuration section), and that old pages
|
||||||
|
will be purged from the index. Pages that you want to archive permanently
|
||||||
|
need to be saved elsewhere, as they will otherwise eventually disappear
|
||||||
|
from the Recoll results.
|
||||||
|
|
||||||
|
Recoll will index +.maff+ files, which may be a better choice for archival
|
||||||
|
usage.
|
||||||
9
website/faqsandhowtos/Makefile
Normal file
9
website/faqsandhowtos/Makefile
Normal file
@ -0,0 +1,9 @@
|
|||||||
|
.SUFFIXES: .txt .html
|
||||||
|
|
||||||
|
.txt.html:
|
||||||
|
asciidoc $<
|
||||||
|
|
||||||
|
all: $(addsuffix .html,$(basename $(wildcard *.txt)))
|
||||||
|
|
||||||
|
clean:
|
||||||
|
rm *.html
|
||||||
96
website/faqsandhowtos/MultipleIndexes.txt
Normal file
96
website/faqsandhowtos/MultipleIndexes.txt
Normal file
@ -0,0 +1,96 @@
|
|||||||
|
== Creating and using multiple indexes
|
||||||
|
|
||||||
|
=== Why would you want to do this ?
|
||||||
|
|
||||||
|
- Easy adjustment of search areas: you can filter results by using the
|
||||||
|
directory filter in the advanced search panel, but, if you have
|
||||||
|
separate well defined places where you store different kind of data,
|
||||||
|
it is easier to maintain separate index and use the External indexes
|
||||||
|
dialog to switch them on or off, and it will also yield much better
|
||||||
|
search performance.
|
||||||
|
- Shared indexes: it may be useful to maintain one or several indexes
|
||||||
|
for shared data, and separate personal indexes for each user. Indexes
|
||||||
|
can be shared over the network.
|
||||||
|
- Creating separate indexes for removable volumes.
|
||||||
|
|
||||||
|
=== How to do it
|
||||||
|
|
||||||
|
As an example we'll suppose that you have Recoll installed and indexing
|
||||||
|
your home directory, and that you would like to have a separate index for
|
||||||
|
/usr/shared/doc.
|
||||||
|
|
||||||
|
You need to create a separate configuration for the new index, then add it
|
||||||
|
to the external indexes list in the user interface, and activate it as
|
||||||
|
needed.
|
||||||
|
|
||||||
|
. Create a directory for the new index, and create an empty configuration
|
||||||
|
file
|
||||||
|
+
|
||||||
|
----
|
||||||
|
cd
|
||||||
|
mkdir .recoll-sharedoc
|
||||||
|
touch .recoll-sharedoc/recoll.conf
|
||||||
|
----
|
||||||
|
. Either edit the new configuration by hand or start recoll to use the GUI
|
||||||
|
configuration editor.
|
||||||
|
+
|
||||||
|
----
|
||||||
|
cd .recoll-sharedoc
|
||||||
|
echo "topdirs = /usr/share/doc" > recoll.conf
|
||||||
|
# OR
|
||||||
|
recoll -c ~/.recoll-sharedoc
|
||||||
|
----
|
||||||
|
+
|
||||||
|
If using the GUI, click _Cancel_ when asked, to start the configuration
|
||||||
|
editor.
|
||||||
|
|
||||||
|
. Perform initial indexing. If you chose the GUI route, indexing will
|
||||||
|
start as soon as you leave the configuration editor. Else, on the
|
||||||
|
command line:
|
||||||
|
+
|
||||||
|
----
|
||||||
|
recollindex -c ~/.recoll-sharedoc
|
||||||
|
----
|
||||||
|
. Optionally set up *cron* to perform nightly indexing, use +crontab -e+
|
||||||
|
and insert a line like the following:
|
||||||
|
+
|
||||||
|
----
|
||||||
|
45 20 * * * recollindex -c ~/.recoll-sharedoc
|
||||||
|
----
|
||||||
|
+
|
||||||
|
This would start the indexing at 20:45. `crontab -e` will use the *vi*
|
||||||
|
editor by default, you can change this by using the EDITOR
|
||||||
|
environment variable. Exemple: `EDITOR=kate crontab -e`
|
||||||
|
Your favorite desktop may also have a dedicated tool to add crontab entries.
|
||||||
|
|
||||||
|
. Start recoll and choose the _Preferences->External_ index dialog menu
|
||||||
|
entry, then click the Browse button (near the bottom), and select the
|
||||||
|
new index Xapian database directory '~/.recoll-sharedoc/xapiandb'
|
||||||
|
Then click _Add index_.
|
||||||
|
|
||||||
|
. You can then activate or deactivate the new index by clicking the box
|
||||||
|
in front of the directory name in the list.
|
||||||
|
|
||||||
|
When adding an index shared by multiple users, it may be helpful to use the
|
||||||
|
RECOLL_EXTRA_DBS environment variable instead of editing individual
|
||||||
|
configurations, see the manual for more details.
|
||||||
|
|
||||||
|
=== Paths adjustments
|
||||||
|
|
||||||
|
When sharing indexes over a network, in most cases, the indexed data will
|
||||||
|
be accessible through different paths on the different hosts. This will
|
||||||
|
prevent the Preview and Open functions to work because the paths they get
|
||||||
|
from the index do not match the ones which are usable from the local
|
||||||
|
host.
|
||||||
|
|
||||||
|
For example my home directory is accessed as '/home/me' on my home
|
||||||
|
machine, and as '/net/myhost/home/me' on other hosts. By default, trying
|
||||||
|
to access a result from a remote host would use the first path, when the
|
||||||
|
second is the one that would work.
|
||||||
|
|
||||||
|
As of release 1.19 **Recoll** has a facility to perform index-dependant
|
||||||
|
path translations. This facility is accessible from the _external index
|
||||||
|
dialog_ in the GUI preferences. Paths translations can be set for the main
|
||||||
|
index if no index is selected (rarely useful), or for the selected
|
||||||
|
additional index.
|
||||||
|
|
||||||
77
website/faqsandhowtos/MuttAndRecoll.txt
Normal file
77
website/faqsandhowtos/MuttAndRecoll.txt
Normal file
@ -0,0 +1,77 @@
|
|||||||
|
== Interfacing Recoll and Mutt
|
||||||
|
|
||||||
|
It is possible to either use Mutt as a Recoll search result viewer, or
|
||||||
|
start Recoll from the Mutt search.
|
||||||
|
|
||||||
|
=== Starting Mutt to view Recoll search results
|
||||||
|
|
||||||
|
This method and the associated
|
||||||
|
link:http://www.recoll.org/files/recoll2mutt[recoll2mutt script] were kindly
|
||||||
|
contributed by Morten Langlo.
|
||||||
|
|
||||||
|
This allows finding mail messages in recoll and then calling *mutt*
|
||||||
|
or *mutt-kz* to read or process the mail.
|
||||||
|
|
||||||
|
Installation:
|
||||||
|
|
||||||
|
- Copy the [[http://www.recoll.org/files/recoll2mutt|recoll2mutt script]]
|
||||||
|
somewhere in your PATH, and make it executable.
|
||||||
|
- In the **recoll** GUI menus:
|
||||||
|
_Preferences->GUI configuration->User interface->Choose editor applications_
|
||||||
|
change the entry for "message/rfc822" to: +recoll2mutt %f+
|
||||||
|
|
||||||
|
The script has options for setting a number of parameters, you may not need
|
||||||
|
to set any of them, the defaults are:
|
||||||
|
|
||||||
|
- -c mutt
|
||||||
|
- -F .muttrc
|
||||||
|
- -m Mail
|
||||||
|
- -x "-fn 10*20 -geometry 115x40"
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
----
|
||||||
|
recoll2mutt -c mutt-kz -F .mutt_kzrc -m Mail -x "-fn 10*20 -geometry 115x40" %f
|
||||||
|
----
|
||||||
|
|
||||||
|
The option +-x+ is passed to *xterm*, which is used to call *mutt* or
|
||||||
|
*mutt-kz*.
|
||||||
|
|
||||||
|
The script works for both _mbox_ and _maildir_ mail boxes, and it
|
||||||
|
expects the configuration file for mutt and the mail directory to reside in
|
||||||
|
your $HOME and the spool file to be '/var/spool/mail/$USER' if it is
|
||||||
|
not in your mail directory. But it is easy to change the values in the
|
||||||
|
script if you need to.
|
||||||
|
|
||||||
|
*mutt* is opened with the right mailbox and limit set to _Date_ and
|
||||||
|
_Sender_. In theory you could set limit to _Message-Id_, but very often
|
||||||
|
*mutt* reports, that there are invalid patterns in _Message-Id_, so do it
|
||||||
|
safe, even though all emails in the opened mail box with the same date from
|
||||||
|
the sender are shown.
|
||||||
|
|
||||||
|
|
||||||
|
=== Starting Recoll from the Mutt search
|
||||||
|
|
||||||
|
This will work only when using maildir storage (messages in individual
|
||||||
|
files). It will not work with mailbox files. The latter would probably be
|
||||||
|
possible by extracting the individual result messages using the Python
|
||||||
|
interface, but I did not try.
|
||||||
|
|
||||||
|
The classic way to interface Mutt and a search application is to create a
|
||||||
|
shortcut to an external command which creates a temporary Maildir
|
||||||
|
containing the search results.
|
||||||
|
|
||||||
|
There is such a script for Recoll, you will find it link:https://bitbucket.org/medoc/recoll/raw/41d41799dbac4c69a34db985b3ab9f1597c9c742/src/python/samples/mutt-recoll.py[here].
|
||||||
|
|
||||||
|
Copy the script somewhere in your PATH, and make it executable, then add
|
||||||
|
the following line to your '.muttrc':
|
||||||
|
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
macro index S "<enter-command>unset wait_key<enter><shell-escape>mutt-recoll.py -G<enter><change-folder-readonly>~/.cache/mutt_results<enter>" \
|
||||||
|
"search mail (using recoll)"
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Obviously, you can replace the 'S' letter with whatever will suit you (e.g:/)
|
||||||
85
website/faqsandhowtos/NonAsciiFileNames.txt
Normal file
85
website/faqsandhowtos/NonAsciiFileNames.txt
Normal file
@ -0,0 +1,85 @@
|
|||||||
|
== Unix and non-ASCII file names, a summary of issues
|
||||||
|
|
||||||
|
Unix/Linux file and directory names are binary byte C strings. Only the
|
||||||
|
null byte and the slash character (/) are forbidden inside a name,
|
||||||
|
nowhere does the kernel interpret the strings as meaningful or
|
||||||
|
printable.
|
||||||
|
|
||||||
|
In the old times, all utilities that would display to the user were
|
||||||
|
ASCII-based, and people would use pure printable ASCII file names (even
|
||||||
|
using space characters inside names was a cause for trouble). Non
|
||||||
|
alphanumeric characters were exclusively used for playing tricks on
|
||||||
|
colleagues. And all was well.
|
||||||
|
|
||||||
|
Then the devil came under the guise of accented 8 bit characters. The
|
||||||
|
system has no problem with them, file names are still binary C strings, but
|
||||||
|
the utilities have to display them or take them as input, and, because
|
||||||
|
there is no encoding specification stored with the file names, they can
|
||||||
|
only do this according to the character encoding taken from the user's
|
||||||
|
current locale.
|
||||||
|
|
||||||
|
For example fr_FR.UTF-8, and fr_FR.ISO8859-1 could be used simultaneously
|
||||||
|
on the same system (by different users), but they are completely
|
||||||
|
uncompatible: ISO-8859-1 strings are illegal when viewed in an UTF-8 locale
|
||||||
|
(will display as interrogation points or some other conventional error
|
||||||
|
marker). UTF-8 strings will display as gibberish in an ISO-8859-1 locale.
|
||||||
|
|
||||||
|
This means that the file names created by an UTF-8 user are displayed as
|
||||||
|
garbage to the ISO-8859 one...
|
||||||
|
|
||||||
|
If you ever change your locale, your old files are still there and named
|
||||||
|
the same (in the binary sense), but the names display badly and you have
|
||||||
|
great trouble inputing them. If you add distributed (NFS) file system
|
||||||
|
issues, things become totally unmanageable. Also think about archives sent
|
||||||
|
from another system with a different encoding.
|
||||||
|
|
||||||
|
For what concerns Recoll:
|
||||||
|
|
||||||
|
- The file names inside recoll.conf are not transcoded, they are taken as
|
||||||
|
binary strings (mostly, only +\n+ and +space+ are a bit special), and
|
||||||
|
passed as is to the system. So if you edit 'recoll.conf' with a text
|
||||||
|
editor, inside the same locale that is or has been used for file names,
|
||||||
|
you'll be fine.
|
||||||
|
- There was a bug in the GUI configuration tool, up to 1.12, it should
|
||||||
|
transcode between the internal Qt format and locale-dependant strings,
|
||||||
|
but it doesn't or does it badly.
|
||||||
|
- There is also an exception for the +unac_except_trans+ variable, this
|
||||||
|
*has* to be UTF-8, so if the rest of the file uses another encoding,
|
||||||
|
you'll need to edit two separate files and concatenate them.
|
||||||
|
|
||||||
|
As of version 1.13, Recoll uses local8Bit()/fromLocal8Bit() to convert
|
||||||
|
recoll.conf file names from/to QStrings (it uses UTF-8 for all string
|
||||||
|
values which are not file names).
|
||||||
|
|
||||||
|
The Qt file dialog is broken (at least was, I have not checked this on
|
||||||
|
recent versions). It should consider file paths as almost-binary data, not
|
||||||
|
QStrings, but doesn't. In consequence, things are even more broken than
|
||||||
|
necessary as seen from there:
|
||||||
|
|
||||||
|
With LANG="C", no non-ASCII paths can't be used at all:
|
||||||
|
|
||||||
|
- Strings read from recoll.conf are stripped of 8bit characters before display.
|
||||||
|
- Directory entries with 8bit characters are not displayed at all in the
|
||||||
|
selection dialog.
|
||||||
|
|
||||||
|
With LANG="fr_FR.UTF-8", only UTF-8 paths can be used:
|
||||||
|
|
||||||
|
- Strings read from recoll.conf are damaged when converted to QString
|
||||||
|
(except those that were actually UTF-8)
|
||||||
|
- Only the UTF-8 directory entries are displayed in the selection dialog.
|
||||||
|
|
||||||
|
|
||||||
|
With LANG="fr_FR.iso8859-1", everything works ok.
|
||||||
|
|
||||||
|
- Strings read from recoll.conf are displayed with weird characters if
|
||||||
|
they use another encoding such as UTF-8, but are correctly maintained
|
||||||
|
and can be read back from the dialogs and rewritten without damage.
|
||||||
|
- Directory entries with 8 bit characters are displayed weirdly (normal),
|
||||||
|
but can be manipulated without trouble (this includes utf-8 names of
|
||||||
|
course).
|
||||||
|
|
||||||
|
In conclusion, only the iso-8859 locales can be used for handling mixed
|
||||||
|
encoding situations. This is a possible workaround for people who need it.
|
||||||
|
|
||||||
|
More data about path encoding issues:
|
||||||
|
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
|
||||||
71
website/faqsandhowtos/OpenHelperScript.txt
Normal file
71
website/faqsandhowtos/OpenHelperScript.txt
Normal file
@ -0,0 +1,71 @@
|
|||||||
|
== Starting native applications
|
||||||
|
|
||||||
|
It is sometimes difficult to start a native application on a result
|
||||||
|
document, especially when the result comes from a container file (ie: email
|
||||||
|
folder file, chm file).
|
||||||
|
|
||||||
|
The problem is that native applications usually expect at most a file name
|
||||||
|
on the command line, and sometimes not even that (emailers).
|
||||||
|
|
||||||
|
The _Open parent documents_ link in the result list right click menu is
|
||||||
|
sometimes useful in this situation (e.g.: +chm+ files).
|
||||||
|
|
||||||
|
In some other cases it may help that Recoll does make a lot of data
|
||||||
|
available to the application. This data may have to be pre-processed in a
|
||||||
|
script before calling the actual application.
|
||||||
|
|
||||||
|
Details about configuring how the native application or script are called
|
||||||
|
are given with the
|
||||||
|
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description of the mimeview configuration file]
|
||||||
|
|
||||||
|
Information about
|
||||||
|
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.FIELDS[configuring
|
||||||
|
customised fields] may also be useful in combination.
|
||||||
|
|
||||||
|
=== Example
|
||||||
|
|
||||||
|
This is a simple example, because it does not need to use special
|
||||||
|
fields. It just shows how to solve a simple issue by using an intermediary
|
||||||
|
script. The problem is due to the fact that thunderbird's +-file+ option
|
||||||
|
won't open a file if the extension is not '.eml'. Jorge, the kind Recoll
|
||||||
|
user who supplied the example stores his email in Maildir++ format, the
|
||||||
|
file names have no extension, so an intermediary script is necessary to get
|
||||||
|
thunderbird to open them:
|
||||||
|
|
||||||
|
Note that this only works with messages stored in Maildir or MH format (one
|
||||||
|
message per file). As far as I know, there is no way to get Thunderbird to
|
||||||
|
open an arbitrary mbox file.
|
||||||
|
|
||||||
|
The 'recoll-thunderbird-open-file' script:
|
||||||
|
|
||||||
|
----
|
||||||
|
#!/bin/sh
|
||||||
|
cp $1 /tmp/$$.eml
|
||||||
|
thunderbird -file /tmp/$$.eml
|
||||||
|
----
|
||||||
|
|
||||||
|
Create the file in an editor, save it somewhere, and make it executable
|
||||||
|
(`chmod +x recoll-thunderbird-open-file`).
|
||||||
|
|
||||||
|
The mail line in the '~/.recoll/mimeview' file:
|
||||||
|
|
||||||
|
----
|
||||||
|
[view]
|
||||||
|
message/rfc822 = recoll-thunderbird-open-file %f
|
||||||
|
----
|
||||||
|
|
||||||
|
If the place where you saved the script is not in your PATH, you will need
|
||||||
|
to use the full path instead of just the script name, as in
|
||||||
|
|
||||||
|
----
|
||||||
|
[view]
|
||||||
|
message/rfc822 = /home/me/somewhere/recoll-thunderbird-open-file %f
|
||||||
|
----
|
||||||
|
|
||||||
|
You should then be able to open the messages in Thunderbird, which is
|
||||||
|
useful, for example, to handle the attachments.
|
||||||
|
|
||||||
|
With recent Recoll versions, if using the normal option of letting the
|
||||||
|
Desktop chose the _Open_ application to use (_Use Desktop default_),
|
||||||
|
you should also add +message/rfc822+ to the exceptions, and the whole
|
||||||
|
thing is probably more easily done from the Recoll GUI.
|
||||||
27
website/faqsandhowtos/PreventIndexingDir.txt
Normal file
27
website/faqsandhowtos/PreventIndexingDir.txt
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
== Preventing indexing in a directory
|
||||||
|
|
||||||
|
=== Why would you want to do this ?
|
||||||
|
|
||||||
|
By default, recollindex (or the indexing thread inside the recoll QT user
|
||||||
|
interface) will process your home directories and most its subdirectories,
|
||||||
|
at the exception of some well known places (thumbnails, beagle and web
|
||||||
|
browser caches, etc.)
|
||||||
|
|
||||||
|
You may want to prevent indexing in some directories where you don't expect
|
||||||
|
interesting search results. This will avoid polluting the search result
|
||||||
|
lists, speed up indexing times and make the index smaller.
|
||||||
|
|
||||||
|
=== How to do it
|
||||||
|
|
||||||
|
There are two ways to block indexing at certain points: either by listing
|
||||||
|
specific paths, or by directory name pattern matches.
|
||||||
|
|
||||||
|
- Blocking specific paths: this is controlled by the skippedPaths variable
|
||||||
|
in the main configuration file. You can adjust the value either by
|
||||||
|
editing the file or by using the indexing configuration dialog:
|
||||||
|
_Preferences->Indexing configuration->Global parameters->Skipped paths_
|
||||||
|
- Using pattern matches: these are listed in the skippedNames variable in
|
||||||
|
the main configuration file. You can adjust the value either by editing
|
||||||
|
the file or by using the GUI: _Preferences->Indexing configuration->Local
|
||||||
|
parameters->Skipped names_
|
||||||
|
|
||||||
157
website/faqsandhowtos/ProblemSolvingData.txt
Normal file
157
website/faqsandhowtos/ProblemSolvingData.txt
Normal file
@ -0,0 +1,157 @@
|
|||||||
|
== Gathering useful data for asking help about or reporting a Recoll issue
|
||||||
|
|
||||||
|
Once in a while it will happen that a Recoll program will either signal an
|
||||||
|
error, or even crash (either the *recoll* graphical interface or the
|
||||||
|
*recollindex* command line indexing command).
|
||||||
|
|
||||||
|
Reporting errors and crashes is very useful. It can help others, and it can
|
||||||
|
get your own problem solved.
|
||||||
|
|
||||||
|
Any problem report should include the exact Recoll and system versions.
|
||||||
|
|
||||||
|
If at all possible, reading the following and performing part of the
|
||||||
|
suggested steps will be useful. This is not a condition for obtaining help
|
||||||
|
though ! If you have any problem and have a difficulty with the following,
|
||||||
|
just contact the mailing list or the developers (see contacts on
|
||||||
|
link:https://www.recoll.org/support.html[the Recoll site support page]).
|
||||||
|
|
||||||
|
If the problem concerns indexing, and was initially found using the
|
||||||
|
*recoll* GUI, you should try to reproduce it using the
|
||||||
|
*recollindex* command-line indexer, which is much simpler and easier to
|
||||||
|
debug.
|
||||||
|
|
||||||
|
There are then two sources of useful information to diagnose the issue: the
|
||||||
|
debug log file and, possibly, in case of a crash, a stack trace.
|
||||||
|
|
||||||
|
Crash and other problem reports are of very high value to me, and I am
|
||||||
|
willing to help you with any of the steps described below if it is not
|
||||||
|
familiar to you. I do realize that not everybody is a programmer or a
|
||||||
|
system administrator.
|
||||||
|
|
||||||
|
=== Obtaining information from the log file
|
||||||
|
|
||||||
|
All Recoll commands write a varying amount of information to a common log file.
|
||||||
|
|
||||||
|
_All commands use the same log, and the file is reset every time a command
|
||||||
|
is started: so it is important to make a copy right after the problem
|
||||||
|
occurs (for example, do not start *recoll* after a *recollindex*
|
||||||
|
crash, this would reset the log). A workaround for this issue is to let the
|
||||||
|
messages go to the default +stderr+, and redirect this._
|
||||||
|
|
||||||
|
By default, the messages are output to +stderr+, and you probably don't even
|
||||||
|
see them if Recoll is started from the desktop. In this case, you need to
|
||||||
|
set the parameters so that output goes to a file, and the appropriate
|
||||||
|
verbosity level is set. When using the command-line, you may actually
|
||||||
|
prefer to redirect stderr to avoid the log-truncating issue described
|
||||||
|
above.
|
||||||
|
|
||||||
|
You can set the log parameters from the GUI _Indexing parameters_
|
||||||
|
section or by editing the '~/.recoll/recoll.conf' file: set the
|
||||||
|
+loglevel+ and +logfilename+ parameters. E.g.:
|
||||||
|
|
||||||
|
----
|
||||||
|
loglevel = 6
|
||||||
|
logfilename = /tmp/recolltrace
|
||||||
|
----
|
||||||
|
|
||||||
|
The log file can become very big if you need a big indexing run to
|
||||||
|
reproduce the problem. Choose a file system with enough space available
|
||||||
|
(possibly a few gigabytes).
|
||||||
|
|
||||||
|
Then run the sequence that leads to the problem, and make a copy of the log
|
||||||
|
file just after. If the log is too big, it will usually be sufficient to
|
||||||
|
use the last 500 lines or so (tail -500).
|
||||||
|
|
||||||
|
==== Single file indexing issues
|
||||||
|
|
||||||
|
When the problem concerns, or can be reproduced with, a single file it is
|
||||||
|
very cumbersome to have to run a full indexing pass to reproduce it. There
|
||||||
|
are two ways around this:
|
||||||
|
|
||||||
|
- Set up an ad hoc configuration with only the file of interest, or its
|
||||||
|
parent directory:
|
||||||
|
----
|
||||||
|
cd
|
||||||
|
mkdir recoll-test
|
||||||
|
cd recoll-test
|
||||||
|
echo /path/to/my/file/or/its/parent/dir > recoll.conf
|
||||||
|
echo 'loglevel = 6' >> recoll.conf
|
||||||
|
echo 'logfilename = /tmp/recolltrace' >> recoll.conf
|
||||||
|
recollindex -z -c .
|
||||||
|
----
|
||||||
|
- Use the -e and -i options to recollindex to erase/reindex a single
|
||||||
|
file. Set up the log, then:
|
||||||
|
----
|
||||||
|
recollindex -e /path/to/my/file
|
||||||
|
recollindex -i /path/to/my/file
|
||||||
|
----
|
||||||
|
|
||||||
|
When using the second approach, you must take care that the path used is
|
||||||
|
consistent with the paths listed/used in the configuration (ie: if '/home' is
|
||||||
|
a link to '/usr/home', and '/usr/home/me' is used in the configuration
|
||||||
|
+topdirs+, `recollindex -i /home/me/myfile` will not work, you need
|
||||||
|
to use `recollindex -i /usr/home/me/myfile`.
|
||||||
|
|
||||||
|
|
||||||
|
=== Obtaining a stack trace
|
||||||
|
|
||||||
|
If the program actually crashes, and in order to maximize usefulness, a
|
||||||
|
crash report should also include a so-called stack trace, something that
|
||||||
|
indicates what the program was doing when it crashed. Getting a useful
|
||||||
|
stack trace is not very difficult, but it may need a little work on your
|
||||||
|
part (which will then enable me do my part of the work).
|
||||||
|
|
||||||
|
If your distribution includes a separate package for Recoll debugging
|
||||||
|
symbols, it probably also has a page on its web site explaining how to use
|
||||||
|
them to get a stack trace. You should follow these instructions. If there
|
||||||
|
is no debugging package, you should follow the instructions below. A little
|
||||||
|
familiarity with the command line will be necessary.
|
||||||
|
|
||||||
|
==== Compiling and installing a debugging version
|
||||||
|
|
||||||
|
- Obtain the recoll source for the version you are using (www.recoll.org),
|
||||||
|
and extract the source tree.
|
||||||
|
- Follow the
|
||||||
|
link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.install.building.html[instructions
|
||||||
|
for building Recoll from source] with the following modifications:
|
||||||
|
- Before running configure, edit the mk/localdefs.in file and remove the
|
||||||
|
-O2 option(s).
|
||||||
|
- When running configure, specify the standard installation location for
|
||||||
|
your system as a prefix (to avoid ending up with two installed versions,
|
||||||
|
which would almost certainly end in confusion). On Linux this would
|
||||||
|
typically be: `configure --prefix=/usr`
|
||||||
|
- When installing, arrange for the installed executables not to be stripped
|
||||||
|
of debugging symbols by specifying a value for the STRIP environment
|
||||||
|
variable (ie: *echo* or *ls*): `sudo make install STRIP=ls`
|
||||||
|
|
||||||
|
==== Getting a core dump
|
||||||
|
|
||||||
|
You will need to run the operation that caused the crash inside a writable
|
||||||
|
directory, and tell the system that you accept core dumps. The commands
|
||||||
|
need to be run in a shell inside a terminal window. E.g.:
|
||||||
|
|
||||||
|
----
|
||||||
|
cd
|
||||||
|
ulimit -c unlimited
|
||||||
|
recoll #(or recollindex or whatever you want to run).
|
||||||
|
----
|
||||||
|
|
||||||
|
Hopefuly, you will succeed in getting the command to crash, and you will
|
||||||
|
get a core file. A possible approach then would be to make both the
|
||||||
|
executable and the core files available to me by uploading it to a file
|
||||||
|
sharing site (the core file may be quite big). You should be aware though
|
||||||
|
that the core file may contain some of the data that was being indexed,
|
||||||
|
which may be a privacy issue. Another approach is to generate the stack
|
||||||
|
trace yourself.
|
||||||
|
|
||||||
|
=== Using gdb to get a stack trace
|
||||||
|
|
||||||
|
- Install gdb if it is not already on the system.
|
||||||
|
- Run gdb on the command that crashed and the core file (depending on the
|
||||||
|
system, the core file may be named "core" or something else, like
|
||||||
|
recollindex.core, or core.pid), ie: {{{gdb /usr/bin/recollindex core}}}
|
||||||
|
- Inside gdb, you need to use different commands to get a stack trace for
|
||||||
|
recoll and recollindex. For recollindex you can use the bt command. For
|
||||||
|
recoll use `thread apply all bt full`
|
||||||
|
- Copy/paste the output to your report email :), and quit gdb ("q").
|
||||||
|
|
||||||
61
website/faqsandhowtos/QpdfviewHelperScript.txt
Normal file
61
website/faqsandhowtos/QpdfviewHelperScript.txt
Normal file
@ -0,0 +1,61 @@
|
|||||||
|
== Starting native applications ==
|
||||||
|
|
||||||
|
Another example of using an intermediary script for an application with a
|
||||||
|
command line syntax which can't be directly defined in mimeview.
|
||||||
|
|
||||||
|
We use a script to preprocess and adapt the options before calling the
|
||||||
|
actual command.
|
||||||
|
|
||||||
|
Details about configuring how the native application or script are called
|
||||||
|
are given with the
|
||||||
|
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description
|
||||||
|
of the mimeview configuration file].
|
||||||
|
|
||||||
|
*qpdfview* (link:http://launchpad.net/qpdfview[web site]) is a very
|
||||||
|
lightweight tabbed PDF viewer with great search performance and result
|
||||||
|
highlighting.
|
||||||
|
|
||||||
|
It does support parsing the search term and page number from the command
|
||||||
|
line with the following syntax:
|
||||||
|
|
||||||
|
----
|
||||||
|
qpdfview --unique "%f"#%p --search "%s"
|
||||||
|
----
|
||||||
|
|
||||||
|
However, qpdfview will not launch if either %p or %s are empty in the
|
||||||
|
command above. To accommodate for that, Recoll user Florian has written a
|
||||||
|
small wrapper shell script:
|
||||||
|
|
||||||
|
----
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
qpdfviewpath=qpdfview
|
||||||
|
|
||||||
|
if [ -z $2 ]
|
||||||
|
then
|
||||||
|
page=""
|
||||||
|
|
||||||
|
else
|
||||||
|
page="#"$2""
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z $3 ]
|
||||||
|
then
|
||||||
|
search=""
|
||||||
|
|
||||||
|
else
|
||||||
|
search="--search "$3""
|
||||||
|
fi
|
||||||
|
|
||||||
|
$qpdfviewpath --unique "$1"$page $search >&0 2>&0 &
|
||||||
|
----
|
||||||
|
|
||||||
|
|
||||||
|
The corresponding handler line for Recoll would be (depending on how you
|
||||||
|
name the script and where you store it):
|
||||||
|
|
||||||
|
----
|
||||||
|
qpdfviewwrapper %f %p %s
|
||||||
|
----
|
||||||
|
|
||||||
|
|
||||||
18
website/faqsandhowtos/QueryFromC.txt
Normal file
18
website/faqsandhowtos/QueryFromC.txt
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
== Querying Recoll from a C program
|
||||||
|
|
||||||
|
The easiest way to query Recoll from a C or C++ program is to execute an
|
||||||
|
external search command (`recollq` or `recoll -t`).
|
||||||
|
|
||||||
|
I have written a simple C module which deals with the related housekeeping
|
||||||
|
and presents an easy to use API to the rest of the code. You will find it
|
||||||
|
here:
|
||||||
|
|
||||||
|
https://bitbucket.org/medoc/recoll-capi
|
||||||
|
|
||||||
|
It is a bit experimental and will only work with recoll 1.20 for now
|
||||||
|
(because it uses a new option for recollq). However it would be trivial to
|
||||||
|
modify for working with 1.19, get in touch with me if you need this.
|
||||||
|
|
||||||
|
The other approach is to link with the Recoll library. This has no official
|
||||||
|
API, but in practise, the internal one is fairly stable, and if you want to
|
||||||
|
choose this approach, you should start from the code in recollq.cpp
|
||||||
58
website/faqsandhowtos/ReplaceCategories.txt
Normal file
58
website/faqsandhowtos/ReplaceCategories.txt
Normal file
@ -0,0 +1,58 @@
|
|||||||
|
== Replacing the Category filter controls
|
||||||
|
|
||||||
|
The document category filter controls normally appear at the top of the
|
||||||
|
*recoll* GUI, either as checkboxes just above the result list, or as a
|
||||||
|
dropbox in the tool area.
|
||||||
|
|
||||||
|
By default, they are labeled _Media_, _Message_, _Spreadsheet_, _Text_,
|
||||||
|
etc. and each map to a document category.
|
||||||
|
|
||||||
|
The mapping used to be fixed. You could change the number and composition
|
||||||
|
of categories by redefining them inside the {{{mimeconf}}} configuration
|
||||||
|
file (you still can), but the filters always used document categories.
|
||||||
|
|
||||||
|
Categories can also be selected from the query language by using an
|
||||||
|
+rclcat:+ selector. E.g.: _rclcat:message_.
|
||||||
|
|
||||||
|
As of Recoll release 1.17, the filters are not hard-wired any more. They
|
||||||
|
map to query language fragments. This means that you can freely redefine
|
||||||
|
what they do.
|
||||||
|
|
||||||
|
The associations are configured inside the 'mimeconf' file, in the
|
||||||
|
+[guifilters]+ section. Most GUI parameters are stored in the *Qt*
|
||||||
|
configuration file, so this is not entirely consistent, and you will have
|
||||||
|
to bear with my lazyness here.
|
||||||
|
|
||||||
|
A simple exemple will hopefuly make things clearer. If you add the
|
||||||
|
following to your '~/.recoll/mimeconf' file:
|
||||||
|
|
||||||
|
----
|
||||||
|
[guifilters]
|
||||||
|
|
||||||
|
Big Books = dir:"~/My Books" size>10K
|
||||||
|
My Docs = dir:"~/My Documents"
|
||||||
|
Small Books = dir:"~/My Books" size<10K
|
||||||
|
System Docs = dir:/usr/share/doc
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
You will have four filter checkboxes, labelled _Big Books_, _My Docs_, etc.
|
||||||
|
|
||||||
|
The text after the equal sign must be a valid query language fragment, and
|
||||||
|
will be translated to a *Recoll* query and combined with the rest of the
|
||||||
|
query with an AND conjunction.
|
||||||
|
|
||||||
|
Any name text before a colon character will be erased in the display, but
|
||||||
|
used for sorting. You can use this to display the checkboxes in any order
|
||||||
|
you like. For exemple, the following would do exactly the same as above,
|
||||||
|
but ordering the checkboxes in the reverse order.
|
||||||
|
|
||||||
|
----
|
||||||
|
[guifilters]
|
||||||
|
|
||||||
|
d:Big Books = dir:"~/My Books" size>10K
|
||||||
|
c:My Docs = dir:"~/My Documents"
|
||||||
|
b:Small Books = dir:"~/My Books" size<10K
|
||||||
|
a:System Docs = dir:/usr/share/doc
|
||||||
|
|
||||||
|
----
|
||||||
23
website/faqsandhowtos/ResultsThumbnails.txt
Normal file
23
website/faqsandhowtos/ResultsThumbnails.txt
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
== Result list thumbnails and how to create them
|
||||||
|
|
||||||
|
Recoll will display thumbnails for the results if the images exist in the
|
||||||
|
standard location ('$HOME/.thumbnails' or '$HOME/.cache/thumbnails' depending
|
||||||
|
on the xdg version).
|
||||||
|
|
||||||
|
But it will not create thumbnails, mainly because it is very hard to do
|
||||||
|
portably.
|
||||||
|
|
||||||
|
Thumbnails are most commonly created when you visit a directory with your
|
||||||
|
file manager, but visiting the whole file tree just to create thumbnails is
|
||||||
|
a bit fastidious.
|
||||||
|
|
||||||
|
One simple trick to create thumbnails from the recoll GUI is to visit the
|
||||||
|
parent directory for a result by using the _Open parent document/folder_
|
||||||
|
entry in the right-click menu.
|
||||||
|
|
||||||
|
You can also find tools for the systematic creation of thumbnails for a
|
||||||
|
directory tree. Three such tools are discussed on this
|
||||||
|
link:http://askubuntu.com/questions/199110/how-can-i-instruct-nautilus-to-pre-generate-pdf-thumbnails[askubuntu.com discussion]
|
||||||
|
|
||||||
|
Also please note that no thumbnails can currently be generated or displayed
|
||||||
|
for embedded documents (attachments, archive members, etc.).
|
||||||
61
website/faqsandhowtos/SavingConfig.txt
Normal file
61
website/faqsandhowtos/SavingConfig.txt
Normal file
@ -0,0 +1,61 @@
|
|||||||
|
== User configuration backup
|
||||||
|
|
||||||
|
=== Why you would want to do this
|
||||||
|
|
||||||
|
If you are going to reinstall your system, and have some custom
|
||||||
|
configuration, you may save some time by making a backup of your
|
||||||
|
configuration and restoring it on the new system, rather than going through
|
||||||
|
the menus to recreate it.
|
||||||
|
|
||||||
|
=== How to do it
|
||||||
|
|
||||||
|
==== Index/search configuration
|
||||||
|
|
||||||
|
The main recoll configuration data is normally kept inside '~/.recoll' or
|
||||||
|
whatever *$RECOLL_CONFDIR* is set to.
|
||||||
|
|
||||||
|
This directory contains both configuration files and generated index
|
||||||
|
data.In a standard configuration, the following files and directories
|
||||||
|
contain generated data:
|
||||||
|
|
||||||
|
- 'xapiandb' contains the Xapian index, which normally consumes most of the
|
||||||
|
total space.
|
||||||
|
- 'aspdict.en.rws' contains the aspell dictionary used for spelling
|
||||||
|
corrections.
|
||||||
|
- 'mboxcache' contains cached offset data for email messages inside mbox
|
||||||
|
folders.
|
||||||
|
- 'webcache' contains saved web pages. This is more than a cache as
|
||||||
|
destroying it will purge the corresponding data during the next
|
||||||
|
indexing.
|
||||||
|
|
||||||
|
The other files are either very small or contain configuration data.
|
||||||
|
|
||||||
|
If you want to only save configuration, using minimum space, you can
|
||||||
|
destroy the above files and directories (with the possible exception of
|
||||||
|
'webcache'). Then taking a copy of the '.recoll' directory and adding the
|
||||||
|
GUI configuration data described in the next will get you a full
|
||||||
|
configuration data backup.
|
||||||
|
|
||||||
|
==== GUI configuration
|
||||||
|
|
||||||
|
The parameters set from the _Query configuration_ Qt menus are stored in
|
||||||
|
Qt standard places:
|
||||||
|
|
||||||
|
- '~/.qt/recollrc' for Qt 3.x
|
||||||
|
- '~/.config/Recoll.org/recoll.conf' for Qt 4 and later
|
||||||
|
|
||||||
|
|
||||||
|
==== Other data
|
||||||
|
|
||||||
|
If you wish to save index data in addition to the customisation files,
|
||||||
|
which only makes sense if the document access paths do not change after
|
||||||
|
reinstallation, you can just take a backup of the full '.recoll'
|
||||||
|
directory, taking care that the storage locations for some data elements
|
||||||
|
can be changed (not be inside '.recoll'):
|
||||||
|
|
||||||
|
- The index data is normally kept inside '~/.recoll/xapiandb', but the
|
||||||
|
location of this directory can be modified by the +dbdir+
|
||||||
|
configuration parameter if it is set (check 'recoll.conf').
|
||||||
|
- If you use the Firefox Recoll plugin, the WEB history cache is normally
|
||||||
|
kept inside '~/.recoll/webcache', but the location can be modified by
|
||||||
|
the +webcachedir+ configuration parameter.
|
||||||
109
website/faqsandhowtos/UnityLens.txt
Normal file
109
website/faqsandhowtos/UnityLens.txt
Normal file
@ -0,0 +1,109 @@
|
|||||||
|
== Building and Installing the Ubuntu Unity Recoll Lens
|
||||||
|
|
||||||
|
Important preliminary notes:
|
||||||
|
|
||||||
|
- This only makes sense for Ubuntu versions using the Unity environment:
|
||||||
|
Natty (11.04), Oneiric (11.10), Precise (12.04), and later.
|
||||||
|
- _Remember that you still need to use the recoll GUI (or the recollindex
|
||||||
|
//command) to get the indexing going !_
|
||||||
|
- The Lens is artificially limited to showing at most 20 results. Use the
|
||||||
|
recoll GUI for more complete capabilities (or edit rclsearch.py, change
|
||||||
|
the "if actual_results >= 20:" line).
|
||||||
|
|
||||||
|
|
||||||
|
=== The Lens with Recoll 1.17 and later
|
||||||
|
|
||||||
|
If you are willing to install or upgrade to Recoll version 1.17, all
|
||||||
|
necessary packages are on the Recoll PPA, you just need to add the
|
||||||
|
repository to your system sources and add or upgrade the packages: *_/This
|
||||||
|
is the recommended approach!_*
|
||||||
|
|
||||||
|
----
|
||||||
|
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install recoll-lens recoll
|
||||||
|
----
|
||||||
|
|
||||||
|
This document may still be useful if you want to modify the lens source
|
||||||
|
code.
|
||||||
|
|
||||||
|
=== The Lens with older Recoll versions
|
||||||
|
|
||||||
|
If, for some reason, you wish to test the Lens with an older Recoll
|
||||||
|
version, read the following.
|
||||||
|
|
||||||
|
Please not that such an installation is somewhat crippled: you will not be
|
||||||
|
able to display results for embedded documents (emails inside an mbox,
|
||||||
|
attachments etc.). This requires a recoll command line option which is only
|
||||||
|
available in 1.17
|
||||||
|
|
||||||
|
The Lens is based on the Recoll Python module which is not built by default
|
||||||
|
for versions prior to 1.17, so so you will first need to pull the Recoll
|
||||||
|
source code (for you version), then untar and proceed with the
|
||||||
|
configure/build instructions below.
|
||||||
|
|
||||||
|
The following uses --prefix=/usr. I have no real reason to believe
|
||||||
|
that this would not work with /usr/local (lenses are also searched there by
|
||||||
|
default). If you confirm that things work with another prefix, please drop
|
||||||
|
me a line.
|
||||||
|
|
||||||
|
When doing this over a previous Recoll compilation, run a "make clean" to
|
||||||
|
get rid of the non-PIC objects.
|
||||||
|
|
||||||
|
Note that the following instructions change nothing to your existing Recoll
|
||||||
|
installation, they only install the Python module and the Unity Lens,
|
||||||
|
recoll, recollindex etc. are unaffected.
|
||||||
|
|
||||||
|
'/TOP/OF/RECOLL/SRC' designates the top of the recoll source tree.
|
||||||
|
|
||||||
|
=== Configure and build the recoll library and python module, install the module
|
||||||
|
|
||||||
|
The following needs the development packages for Xapian, Python and zlib.
|
||||||
|
|
||||||
|
----
|
||||||
|
cd /TOP/OF/RECOLL/SRC
|
||||||
|
# May fail if no previous build was performed
|
||||||
|
make clean
|
||||||
|
|
||||||
|
# the gui/x11 disabling is just here to avoid having to install the
|
||||||
|
# development libraries for Qt.
|
||||||
|
configure --prefix=/usr --enable-pic --without-x --disable-qtgui
|
||||||
|
make
|
||||||
|
|
||||||
|
cd python/recoll
|
||||||
|
python setup.py build
|
||||||
|
sudo python setup.py install
|
||||||
|
----
|
||||||
|
|
||||||
|
=== Build and install the Unity Lens
|
||||||
|
|
||||||
|
----
|
||||||
|
cd /TOP/OF/RECOLL/SRC
|
||||||
|
cd desktop/unity-lens-recoll
|
||||||
|
configure --prefix=/usr --sysconfdir=/etc
|
||||||
|
sudo make install
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Voilà, it should work...
|
||||||
|
|
||||||
|
Try to start the Dash, you should see the Recoll checkerboard (or
|
||||||
|
whatever...) in the Lens list.
|
||||||
|
|
||||||
|
The Recoll Lens expects a Recoll query language string, so you can use
|
||||||
|
field searches, directory, size, and date filtering (see the
|
||||||
|
link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.lang.html[Recoll
|
||||||
|
manual] for a description of the query language).
|
||||||
|
|
||||||
|
If you want to disable the Lens, I think that you just have to delete
|
||||||
|
'/usr/share/unity/lenses/recoll'
|
||||||
|
|
||||||
|
Other installed files:
|
||||||
|
|
||||||
|
----
|
||||||
|
/usr/libexec/unity-recoll-daemon
|
||||||
|
/usr/share/dbus-1/services/unity-lens-recoll.service
|
||||||
|
/usr/share/doc/unity-lens-recoll
|
||||||
|
/usr/share/unity-lens-recoll
|
||||||
|
----
|
||||||
|
|
||||||
68
website/faqsandhowtos/UsingOpenWith.txt
Normal file
68
website/faqsandhowtos/UsingOpenWith.txt
Normal file
@ -0,0 +1,68 @@
|
|||||||
|
== Using the _Open With_ context menu in recoll 1.20 and newer
|
||||||
|
|
||||||
|
Recoll versions and newer have an _Open With_ entry in the result list
|
||||||
|
context menu (the thing which pops up on a right click).
|
||||||
|
|
||||||
|
This allows choosing the application used to edit the document, instead of
|
||||||
|
using the default one.
|
||||||
|
|
||||||
|
The list of applications is built from the desktop files found inside
|
||||||
|
'/usr/share/applications'. For each application on the system, these
|
||||||
|
files lists the mime types that the application can process.
|
||||||
|
|
||||||
|
If the application which you would want listed does not appear, the most
|
||||||
|
probable cause is that it has no desktop file, which could happen due to a
|
||||||
|
number of reasons.
|
||||||
|
|
||||||
|
This can be fixed very easily: just add a +.desktop+ file to
|
||||||
|
'/usr/share/applications', starting from an existing one as a template.
|
||||||
|
|
||||||
|
As an example, based on an original idea from Recoll user +florianbw+,
|
||||||
|
the following describes setting up a script for editing a PDF document
|
||||||
|
title found in the recoll result list.
|
||||||
|
|
||||||
|
The script uses the *zenity* shell script dialog box tool to let you
|
||||||
|
enter the new title, and then executes *exiftool* to actually change
|
||||||
|
the document.
|
||||||
|
|
||||||
|
----
|
||||||
|
#!/bin/sh
|
||||||
|
|
||||||
|
PDF=$1
|
||||||
|
TITLE=`exiftool -Title -s3 "$PDF"`
|
||||||
|
|
||||||
|
RES=`zenity --entry \
|
||||||
|
--title="Change PDF Title" \
|
||||||
|
--text="Enter the Title:" \
|
||||||
|
--entry-text "$TITLE"`
|
||||||
|
|
||||||
|
if [ "$RES" != "" ]; then
|
||||||
|
echo -n "Changing title to $RES ... " && \
|
||||||
|
exiftool -Title="$RES" "$PDF" && \
|
||||||
|
recollindex -i "$PDF" && echo "Done!"
|
||||||
|
else
|
||||||
|
echo "No title entered"
|
||||||
|
fi
|
||||||
|
----
|
||||||
|
|
||||||
|
Name it, for example, 'pdf-edit-title.sh', and make it executable
|
||||||
|
(`chmod a+x pdf-edit-title.sh`).
|
||||||
|
|
||||||
|
Then create a file named 'pdf-edit-title.desktop' inside
|
||||||
|
'/usr/share/applications'. The file name does not need to be the same as the
|
||||||
|
script's, this is just to make things clearer:
|
||||||
|
|
||||||
|
----
|
||||||
|
[Desktop Entry]
|
||||||
|
Name=PDF Title Editor
|
||||||
|
Comment=Small script based on exiftool used to edit a pdf document title
|
||||||
|
Exec=/home/dockes/bin/pdf-edit-title.sh %F
|
||||||
|
Type=Application
|
||||||
|
MimeType=application/pdf;
|
||||||
|
----
|
||||||
|
|
||||||
|
You're done ! Restart Recoll, perform a search and right-click on a PDF
|
||||||
|
result: you should see an entry named _PDF Title Editor_ in the _Open
|
||||||
|
With_ list. Click on it, and you will be able to edit the title.
|
||||||
|
|
||||||
|
|
||||||
99
website/faqsandhowtos/WhyIsMyFileNotIndexed.txt
Normal file
99
website/faqsandhowtos/WhyIsMyFileNotIndexed.txt
Normal file
@ -0,0 +1,99 @@
|
|||||||
|
== Using the log file to investigate indexing issues
|
||||||
|
|
||||||
|
All *Recoll* processes print trace messages. By default these go to the
|
||||||
|
standard error output, and you may not ever see them (in the case, for
|
||||||
|
example, of the *recoll* GUI started from the desktop interface).
|
||||||
|
|
||||||
|
There are a number of potential issues with indexing that may need
|
||||||
|
investigation, such as:
|
||||||
|
|
||||||
|
- A file can't be found by searching even if it appears that it should have
|
||||||
|
be indexed (this could happen because the file is not selected at all or
|
||||||
|
because a filter program crashes).
|
||||||
|
- The indexing process gets stuck and never finishes.
|
||||||
|
- The indexing process ends up with an error.
|
||||||
|
- The indexing process seems to be using too much system capacity.
|
||||||
|
|
||||||
|
The right way to approach these problems is to use the *recollindex*
|
||||||
|
command line tool (instead of the *recoll* GUI), and to set up the
|
||||||
|
trace log to provide information about what indexing is actually doing.
|
||||||
|
|
||||||
|
Trace log parameters can be set either from the GUI _Preferences->Indexing
|
||||||
|
Configuration->Global Parameters_ panel, or by editing the configuration
|
||||||
|
file '~/.recoll/recoll.conf'. You should set the following parameters:
|
||||||
|
|
||||||
|
----
|
||||||
|
loglevel = 6
|
||||||
|
logfilename = stderr
|
||||||
|
thrQSizes = -1 -1 -1
|
||||||
|
----
|
||||||
|
|
||||||
|
We use _stderr_ instead of an actual file in order to capture direct filter
|
||||||
|
messages (such as a *python* stack trace) along with normal
|
||||||
|
*recollindex* messages.
|
||||||
|
|
||||||
|
The last line sets recollindex for single-threaded operation, which will
|
||||||
|
make the log much more readable.
|
||||||
|
|
||||||
|
You should then check that no *recoll* or *recollindex* process is
|
||||||
|
currently running, and kill any you find.
|
||||||
|
|
||||||
|
Then, if this is an issue about an identified file, try indexing it only:
|
||||||
|
|
||||||
|
----
|
||||||
|
recollindex -i myunfindablefile.xxx > /tmp/myindexlog 2>&1
|
||||||
|
----
|
||||||
|
|
||||||
|
If this is a general issue with indexing (process not finishing properly),
|
||||||
|
just start it:
|
||||||
|
|
||||||
|
----
|
||||||
|
recollindex > /tmp/myindexlog 2>&1
|
||||||
|
----
|
||||||
|
|
||||||
|
Usually, having a look at the trace will allow to see what is wrong (e.g.:
|
||||||
|
a configuration issue or missing filter), and solve the problem.
|
||||||
|
|
||||||
|
In case of indexer misbehaviour (e.g. using too much memory, you should run
|
||||||
|
_tail -f_ on the log to see what is going on.
|
||||||
|
|
||||||
|
If this is not enough, please
|
||||||
|
link:http://bitbucket.org/medoc/recoll/issues/new[open a tracker issue] and
|
||||||
|
attach or link to the log data, or just email me (jfd at recoll.org).
|
||||||
|
|
||||||
|
*recollindex* and *recollindex -i* usually have the same criteria to
|
||||||
|
include a file or not (but see the _Path gotcha_ note below). It may
|
||||||
|
happen that they behave differently, so it may sometimes be useful to run a
|
||||||
|
full *recollindex* even for a specific file, but this will produce a
|
||||||
|
big log file.
|
||||||
|
|
||||||
|
When you are done, it is better to reset the verbosity to a reasonable
|
||||||
|
level (e.g.: +2+ : just errors, +4+ : basic traces).
|
||||||
|
|
||||||
|
=== Note: the path gotcha
|
||||||
|
|
||||||
|
*recollindex -i* will only index files under the directories defined by the
|
||||||
|
+topdirs+ configuration variable (your home directory by
|
||||||
|
default). Unfortunately, the test is done on the file path text, ignoring
|
||||||
|
possible symbolic links. If you give a simple file name as a parameter to
|
||||||
|
*recollindex -i* and there are symbolic links inside the +topdirs+
|
||||||
|
entries, the comparison may fail. For example, if your home directory is
|
||||||
|
'/home/me/' and '/home/' is a link to '/usr/home/', *recollindex -i
|
||||||
|
somefilename* will actually try to index '/usr/home/somefilename/', and
|
||||||
|
fail (because '/usr/home/me/' is not a subdirectory of '/home/me/'). This
|
||||||
|
will manifest itself in the log by a message like the following.
|
||||||
|
|
||||||
|
----
|
||||||
|
:4:../index/fsindexer.cpp:149:FsIndexer::indexFiles: skipping [/usr/home/me/somefile] (ntd)
|
||||||
|
----
|
||||||
|
|
||||||
|
If this happens, give a full path consistent with what is found in the
|
||||||
|
configuration file (e.g.: _recollindex -i /home/me/somefile_).
|
||||||
|
|
||||||
|
=== File system occupation
|
||||||
|
|
||||||
|
One of the possible reasons for failed indexing is a +maxfsoccup+
|
||||||
|
parameter set too low. This is the value of file system occupation, not
|
||||||
|
free space, where indexing will stop. It is set from the GUI indexing
|
||||||
|
configuration or by editing 'recoll.conf'. A value of 0 implies no
|
||||||
|
checking, but a very low, non-zero, value will just prevent indexing.
|
||||||
65
website/faqsandhowtos/WikiIndex.txt
Normal file
65
website/faqsandhowtos/WikiIndex.txt
Normal file
@ -0,0 +1,65 @@
|
|||||||
|
== Recoll Wiki file index
|
||||||
|
link:ElinksWeb.html[Extending the Recoll Firefox visited web page indexing mechanism to other browsers]
|
||||||
|
|
||||||
|
link:FaqsAndHowTos.html[Faqs and Howtos]
|
||||||
|
|
||||||
|
link:FilterArch.html[Recoll input filters ]
|
||||||
|
|
||||||
|
link:FilterRetrofit.html[Installing a filter for a new document type]
|
||||||
|
|
||||||
|
link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members]
|
||||||
|
|
||||||
|
link:GUIKeyboard.html[# Recoll GUI keyboard navigation]
|
||||||
|
|
||||||
|
link:HandleCustomField.html[Generating a custom field and using it to sort results]
|
||||||
|
|
||||||
|
link:Home.html[Welcome to the Recoll Wiki]
|
||||||
|
|
||||||
|
link:HotRecoll.html[Recoll hotkey: starting / hiding recoll with a keyboard shortcut]
|
||||||
|
|
||||||
|
link:IndexMailHeader.html[Indexing arbitrary mail headers ]
|
||||||
|
|
||||||
|
link:IndexMozillaCalendari.html[Indexing Mozilla calendar data ]
|
||||||
|
|
||||||
|
link:IndexOnAc.html[Laptops: automatically starting or stopping indexing according to AC power status]
|
||||||
|
|
||||||
|
link:IndexOutlook.html[Indexing Outlook archives]
|
||||||
|
|
||||||
|
link:IndexWebHistory.html[Indexing Web history with the Firefox extension ]
|
||||||
|
|
||||||
|
link:MultipleIndexes.html[Creating and using multiple indexes]
|
||||||
|
|
||||||
|
link:MuttAndRecoll.html[Interfacing Recoll and Mutt]
|
||||||
|
|
||||||
|
link:NonAsciiFileNames.html[Unix and non-ASCII file names, a summary of issues]
|
||||||
|
|
||||||
|
link:OpenHelperScript.html[Starting native applications ]
|
||||||
|
|
||||||
|
link:PreventIndexingDir.html[Preventing indexing in a directory]
|
||||||
|
|
||||||
|
link:ProblemSolvingData.html[Gathering useful data for asking help about or reporting a Recoll issue]
|
||||||
|
|
||||||
|
link:QpdfviewHelperScript.html[Starting native applications ]
|
||||||
|
|
||||||
|
link:QueryFromC.html[Querying Recoll from a C program]
|
||||||
|
|
||||||
|
link:ReplaceCategories.html[Replacing the Category filter controls]
|
||||||
|
|
||||||
|
link:ResultsThumbnails.html[Result list thumbnails and how to create them]
|
||||||
|
|
||||||
|
link:SavingConfig.html[User configuration backup]
|
||||||
|
|
||||||
|
link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens]
|
||||||
|
|
||||||
|
link:UsingOpenWith.html[Using the Open With context menu in recoll 1.20 and newe]
|
||||||
|
|
||||||
|
link:WhyIsMyFileNotIndexed.html[Using the log file to investigate indexing issues]
|
||||||
|
|
||||||
|
link:XDGBase.html[XDG: Tidying Recoll data storage]
|
||||||
|
|
||||||
|
link:ZDevCaseAndDiacritics1.html[Character case and diacritic marks (1), issues with stemming]
|
||||||
|
|
||||||
|
link:ZDevCaseAndDiacritics2.html[Character case and diacritic marks (2), user interface]
|
||||||
|
|
||||||
|
link:ZDevCaseAndDiacritics3.html[Character case and diacritic marks (3), implementation]
|
||||||
|
|
||||||
42
website/faqsandhowtos/XDGBase.txt
Normal file
42
website/faqsandhowtos/XDGBase.txt
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
== XDG: Tidying Recoll data storage ==
|
||||||
|
|
||||||
|
The default storage structure of Recoll configuration and index data is
|
||||||
|
quite at odds with what recommends the
|
||||||
|
link:http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html[XDG
|
||||||
|
Base Directory Specification], the reason being that it predates said spec.
|
||||||
|
|
||||||
|
By default, Recoll stores all its data in a single directory: '$HOME/.recoll'
|
||||||
|
|
||||||
|
This is not going to change, because it would be quite disturbing for
|
||||||
|
current users.
|
||||||
|
|
||||||
|
However, the location of this directory can be modified using the
|
||||||
|
+$RECOLL_CONFDIR+ environment variable.
|
||||||
|
|
||||||
|
Furthermore all significant Recoll data categories can be moved away from
|
||||||
|
the configuration directory (maybe to '$HOME/.cache'), by setting
|
||||||
|
configuration variables:
|
||||||
|
|
||||||
|
* _dbdir_ defines the location for storing the Xapian
|
||||||
|
index. This could be set to, e.g., '$HOME/.cache/recoll/xapiandb'. It is
|
||||||
|
quite recommended that
|
||||||
|
this directory be dedicated to Xapian (don't store other things in
|
||||||
|
there).
|
||||||
|
* _mboxcachedir_ defines the location for caching access speedup information
|
||||||
|
about mail folders in mbox format. e.g. '$HOME/.cache/recoll/mboxcache'
|
||||||
|
* New in 1.22: you can use _aspellDictDir_ to define the storage
|
||||||
|
location for the aspell spelling approximation
|
||||||
|
dictionary. E.g. '$HOME/.cache/recoll'
|
||||||
|
* _webcachedir_ may be used to define where the visited web pages
|
||||||
|
archive is stored. E.g. '$HOME/.cache/recoll/webcache'. This is only used
|
||||||
|
if you activate the Firefox plugin and web history indexing. You may
|
||||||
|
want to think a bit more about where to store it, because, contrary to
|
||||||
|
the above, this is not discardable data: your Recoll Web history goes
|
||||||
|
away if you delete it.
|
||||||
|
|
||||||
|
If you use multiple Recoll configurations, each will have to be customized.
|
||||||
|
|
||||||
|
Once these are put away, there are still a few modifyiable files in the
|
||||||
|
configuration directory, for example the 'recoll.pid' and 'history'
|
||||||
|
files, but these are small files. Moving 'recoll.pid' away would be a
|
||||||
|
serious headache because it is used by scripts.
|
||||||
143
website/faqsandhowtos/ZDevCaseAndDiacritics1.txt
Normal file
143
website/faqsandhowtos/ZDevCaseAndDiacritics1.txt
Normal file
@ -0,0 +1,143 @@
|
|||||||
|
== Character case and diacritic marks (1), issues with stemming
|
||||||
|
|
||||||
|
=== Case and diacritics in Recoll
|
||||||
|
|
||||||
|
Recoll versions up to 1.17 almost fully ignore character case and diacritic
|
||||||
|
marks.
|
||||||
|
|
||||||
|
All terms are converted to lower case and unaccented before they are
|
||||||
|
written to the index. There are only two exceptions:
|
||||||
|
|
||||||
|
* File paths (as used in _dir:_ clauses) are not converted. This might
|
||||||
|
be a bug or a feature, but the main reason is that we don't know how they
|
||||||
|
are encoded.
|
||||||
|
* It is possible to specify that some characters will keep their diacritic
|
||||||
|
marks, because the entity formed by the character and the diacritic mark
|
||||||
|
is considered to be a different letter, not a modified one. This is
|
||||||
|
highly dependant on the language. For exemple, in Swedish, +å+ should
|
||||||
|
be preserved, not turned into +a+.
|
||||||
|
|
||||||
|
As a necessary consequence, the same transformations are applied to search
|
||||||
|
terms, and it is impossible to search for a specific capitalization of a
|
||||||
|
word (+US+ is looked for as +us+), or a specific accented form
|
||||||
|
(+café+ will be looked for as +cafe+).
|
||||||
|
|
||||||
|
However, there are some cases where you would like to be more specific:
|
||||||
|
|
||||||
|
* Searching for +US+ or +us+ should probably return different results.
|
||||||
|
* Diacritics are seldom significant in English, but we can find a
|
||||||
|
few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of
|
||||||
|
course, there are many more cases in languages which use more diacritics.
|
||||||
|
|
||||||
|
On the other hand, accents are often mistyped or forgotten (résumé, résume,
|
||||||
|
resume?), and capitalization is most often unsignificant, so that it is
|
||||||
|
very important to retain the capability to ignore accent and character
|
||||||
|
case differences, and that the discrimination can be easily switched on or
|
||||||
|
off for each search (or even for specific terms).
|
||||||
|
|
||||||
|
This text and other pages which will follow will discuss issues in adding
|
||||||
|
character case and diacritics sensitivity to Recoll, under the assumption
|
||||||
|
that the main index will contain the raw source terms instead of
|
||||||
|
case-folded and unaccented ones.
|
||||||
|
|
||||||
|
The following will use the _unaccent_ neologism to mean _remove
|
||||||
|
diacritic marks_ (and not only accents).
|
||||||
|
|
||||||
|
English examples are used when possible, but given the limited use of
|
||||||
|
diacritics in English, some French will probably creep in.
|
||||||
|
|
||||||
|
=== Diacritics and stemming
|
||||||
|
|
||||||
|
Stemming is the process by which we extend a search to terms related by
|
||||||
|
grammatical inflexion, for example singular/plural, verb tenses, etc. For
|
||||||
|
example a search for +floor+ is normally expanded by Recoll to +floors,
|
||||||
|
floored, flooring, ...+
|
||||||
|
|
||||||
|
In practice Recoll has a separate data structure that has stemmed terms
|
||||||
|
(stems) as keys pointing to a list of expansion terms
|
||||||
|
{{{floor -> (floor,floors,floorings,...)}}}
|
||||||
|
|
||||||
|
Stemming should be applied to terms before they are stripped of
|
||||||
|
diacritics. Accents may have a grammatical significance, and the accent may
|
||||||
|
change how the term is stemmed. For example, in French the +âmes+ suffix
|
||||||
|
generally marks a past conjugation but +ames+ does not. The standard
|
||||||
|
Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem,
|
||||||
|
but +évitames+ will be turned into +évitam+ (stripping
|
||||||
|
plural and feminine suffixes).
|
||||||
|
|
||||||
|
When the search is set to ignore diacritics, this poses a specific problem:
|
||||||
|
if the user enters the search term without accents (which is correct
|
||||||
|
because the system is supposed to ignore them), there is no warranty that
|
||||||
|
the term will be correctly expanded by stemming.
|
||||||
|
|
||||||
|
The diacritic mismatch breaks the family relationship between the stem
|
||||||
|
siblings, and this is independant of the type of index: it will happen with
|
||||||
|
an index where diacritics are stripped just as with a raw one.
|
||||||
|
|
||||||
|
The simpler case where diacritics in the original term only affects
|
||||||
|
diacritics in the stem also necessitates specific processing, but it is
|
||||||
|
easier to work around.
|
||||||
|
|
||||||
|
Two examples illustrating these issues follow.
|
||||||
|
|
||||||
|
==== The simple case: diacritics in the term only affect diacritics in the stem
|
||||||
|
|
||||||
|
Let's imagine that the document set contains the term +éviter+
|
||||||
|
(infinitive of +to avoid+), but not +évite+ (present). The only term in
|
||||||
|
the actual index is then +éviter+.
|
||||||
|
|
||||||
|
The user enters an unaccented +evite+, counting on the
|
||||||
|
diacritics-insensitive search mode to deal with the accents. As +évite+
|
||||||
|
is not present in the index, we have no way to guess that +evite+ is
|
||||||
|
really +évite+.
|
||||||
|
|
||||||
|
The stemmer will turn +evite+ into +evit+. There is no way that this
|
||||||
|
can be related to +éviter+, and this legitimate result can't be found.
|
||||||
|
|
||||||
|
There is a way around this: we can compute a separate
|
||||||
|
stem expansion dictionary for unaccented terms. This dictionary, to be used
|
||||||
|
with diacritic-unsensitive searches only, contains the relationship
|
||||||
|
between +evit+ and +eviter+ (as +éviter+ is in the index). We can
|
||||||
|
then relate +eviter+ and +éviter+ because they differ only by accents,
|
||||||
|
and the search will find the document with +éviter+.
|
||||||
|
|
||||||
|
==== The bad case: diacritics in the term change the stem beyond diacritics
|
||||||
|
|
||||||
|
Some grammatically significant accents will cause unexpectedly missing
|
||||||
|
search results when using a supposedly diacritics-insensitive search mode.
|
||||||
|
|
||||||
|
Let's imagine that the document set contains the term +éviter+
|
||||||
|
(infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming
|
||||||
|
expansion table has an entry for +évit+ -> +éviter+.
|
||||||
|
|
||||||
|
If the user enters an unaccented +evitames+, she would expect to find the
|
||||||
|
documents containing +éviter+ in the results, because the latter term is
|
||||||
|
a stemming sibling of +évitâmes+ and the search is supposedly not
|
||||||
|
influenced by diacritics, so that +evitames+ and +évitâmes+ should be
|
||||||
|
equivalent.
|
||||||
|
|
||||||
|
However, our search is now in trouble, because +évitâmes+ is not in any
|
||||||
|
document, so that there is no data in the index which would inform us about
|
||||||
|
how to transform the input term into something that differs only by accents
|
||||||
|
but would yield a correct input for the stemmer.
|
||||||
|
|
||||||
|
If we try to feed the raw user input to the stemmer, it will propose
|
||||||
|
an +evitam+ stem, which will not work, because the stem that actually
|
||||||
|
exists is +évit+, and +evitam+ can not be related to +éviter+.
|
||||||
|
|
||||||
|
The only palliative approach I can think of would be a spelling correction
|
||||||
|
of the input, performed independantly of the actual index contents, which
|
||||||
|
would notice that +évitames+ is not a French word and propose a change or an
|
||||||
|
expansion to +évitâmes+, which would correctly stem to +évit+ and allow
|
||||||
|
us to find +éviter+.
|
||||||
|
|
||||||
|
This issue is not specific to Recoll or indeed to the fact that the index
|
||||||
|
retains accent or not. As far as I can see, it is an intrinsic bad
|
||||||
|
interaction between diacritics insensitivity and stemming.
|
||||||
|
|
||||||
|
It is also interesting to note that this case becomes less probable when
|
||||||
|
the data set becomes bigger, because more term inflexions will then be
|
||||||
|
present in the index.
|
||||||
|
|
||||||
|
We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate
|
||||||
|
interface].
|
||||||
122
website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
Normal file
122
website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
Normal file
@ -0,0 +1,122 @@
|
|||||||
|
== Character case and diacritic marks (2), user interface
|
||||||
|
|
||||||
|
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
|
||||||
|
of the problems which arise when mixing case/diacritics sensitivity and
|
||||||
|
stemming.
|
||||||
|
|
||||||
|
As of version 1.18, Recoll can create two types of indexes:
|
||||||
|
* _Dumb_ indexes contain terms which are lowercased and stripped of
|
||||||
|
diacritics. Searches using such an index are naturally case- and
|
||||||
|
diacritics- insensitive: search terms are stripped before processing.
|
||||||
|
* _Raw_ indexes contain terms which are just like they were found in the
|
||||||
|
source document. Searching such an index is naturally sensitive to case
|
||||||
|
and diacritics, and can be made insensitive by further processing.
|
||||||
|
|
||||||
|
The following explains how users can control these Recoll features.
|
||||||
|
|
||||||
|
=== Controlling the type of index we create: stripped or raw
|
||||||
|
|
||||||
|
The kind of index that recoll creates is determined by:
|
||||||
|
|
||||||
|
* A build-time *configure* switch: _--enable-stripchars_. If this is
|
||||||
|
set, the code for case and diacritics sensitivity is not compiled in and
|
||||||
|
recoll will work like the previous versions: unaccented and casefolded
|
||||||
|
index, no runtime options for case or diacritics sensitivity
|
||||||
|
|
||||||
|
* An indexing configuration switch (in recoll.conf): if Recoll was built
|
||||||
|
with _--disable-stripchars_, this will provide a dynamic way to return
|
||||||
|
to the "traditional" index. The case and diacritics code will be present
|
||||||
|
but inactive. Normally, a recoll installation with this switch set
|
||||||
|
should behave exactly like one built with _--enable-stripchars_. When
|
||||||
|
using multiple indexes, this switch MUST be consistent between
|
||||||
|
indexes. There is no support whatsoever for mixing raw and dumb indexes.
|
||||||
|
The option is named _indexStripChars_, and it is not settable from the
|
||||||
|
GUI to avoid errors. This is something that would typically be set once
|
||||||
|
and for all for a given installation. We need to decide what the default
|
||||||
|
value will be for 1.18
|
||||||
|
|
||||||
|
* A number of query time switches. Using these it is also possible to
|
||||||
|
perform a search insensitive to case and diacritics on a raw index. Note
|
||||||
|
however, that, given the complexity of the issues involved, I give no
|
||||||
|
guaranty at this time that this will yield exactly the same results as
|
||||||
|
searching a dumb index. Details about query time behaviour follow.
|
||||||
|
|
||||||
|
|
||||||
|
=== Controlling stem, case and diacritics expansion: user query interface
|
||||||
|
|
||||||
|
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
|
||||||
|
needed to give the user a way to control stem expansion. This was done in
|
||||||
|
three ways:
|
||||||
|
|
||||||
|
* Globally, by setting a menu option.
|
||||||
|
* Globally, by setting the stemming language value to empty.
|
||||||
|
* On a term by term basis by Capitalizing the term, or, in query language
|
||||||
|
mode only, by using an 'l' clause modifier (_"term"l_).
|
||||||
|
|
||||||
|
After switching to an unstripped index, capable of case and diacritic
|
||||||
|
sensitivity, we need ways to control what processing is performed among:
|
||||||
|
|
||||||
|
* Case expansion.
|
||||||
|
* Diacritics expansion.
|
||||||
|
* Stem expansion.
|
||||||
|
|
||||||
|
The default mode will be compatible with the previous version, because
|
||||||
|
this is is most generally what we want to do: ignore case and diacritics,
|
||||||
|
expand stems.
|
||||||
|
|
||||||
|
There are two easy approaches for controlling the parameters:
|
||||||
|
* Global options set in the GUI menus or as *recollq* command line
|
||||||
|
switches.
|
||||||
|
* Per-clause options set by modifiers in the query language.
|
||||||
|
|
||||||
|
We would like, however to let the user entry automatically override the
|
||||||
|
defaults in a sensible way. For example:
|
||||||
|
|
||||||
|
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
||||||
|
(for this term only).
|
||||||
|
* If a term is entered with upper-case characters, case sensitivity is
|
||||||
|
turned on. In this case, we turn off stem expansion, because it makes
|
||||||
|
really no sense with case sensitivity.
|
||||||
|
|
||||||
|
With this method we are stuck with 3 problems (only if the global mode is
|
||||||
|
set to insensitive, and we're not using the query language):
|
||||||
|
|
||||||
|
* Turning off stemming without turning on case sensitivity.
|
||||||
|
* Searching for an all lower-case term in case-sensitive mode.
|
||||||
|
* Searching for a term without diacritics in diacritic-sensitive mode.
|
||||||
|
|
||||||
|
The two latter issues are relatively marginal and can be worked around easily
|
||||||
|
by switching to query language mode or using negative clauses in the
|
||||||
|
advanced search.
|
||||||
|
|
||||||
|
However, we need to be able to turn stemming off while remaining
|
||||||
|
insensitive to case, and we need to stay reasonably compatible with the
|
||||||
|
previous versions. This means that a term which has a capital first letter
|
||||||
|
but is otherwise lowercase will turn stemming off, but not case sensitivity
|
||||||
|
on.
|
||||||
|
|
||||||
|
So we're left with how to search for such a term in a case-sensitive way,
|
||||||
|
and for this, you'll have to use global options or the query language.
|
||||||
|
|
||||||
|
The modified method is:
|
||||||
|
|
||||||
|
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
||||||
|
(for this term only).
|
||||||
|
* If the first letter in a term is upper-case and the rest is lower-case,
|
||||||
|
we turn stem expansion off, but we do not become case-sensitive
|
||||||
|
* If any letter in a term except the first is upper-case, case sensitivity
|
||||||
|
is turned on. Stem expansion is also turned-off (even if the first
|
||||||
|
letter is lower-case), because it makes really no sense with case
|
||||||
|
sensitivity.
|
||||||
|
* To search for an all lower-case or capitalized term in a case-sensitive
|
||||||
|
way, use the query language: "Capitalized"C, "lowercase"C
|
||||||
|
* Use the query language and the "D" modifier to turn on diacritics
|
||||||
|
sensitivity.
|
||||||
|
|
||||||
|
It can be noted that some combinations of choices do not make sense and
|
||||||
|
they are not allowed by Recoll: for example, diacritics or case sensitivity
|
||||||
|
do not make sense with stem expansion (which cannot preserve diacritics in
|
||||||
|
any meaningful general way).
|
||||||
|
|
||||||
|
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
|
||||||
|
implementation in Recoll 1.18.
|
||||||
67
website/faqsandhowtos/ZDevCaseAndDiacritics3.txt
Normal file
67
website/faqsandhowtos/ZDevCaseAndDiacritics3.txt
Normal file
@ -0,0 +1,67 @@
|
|||||||
|
== Character case and diacritic marks (3), implementation
|
||||||
|
|
||||||
|
In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics
|
||||||
|
and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate
|
||||||
|
interface] for switchable search sensitivity to diacritics and character
|
||||||
|
case.
|
||||||
|
|
||||||
|
So you are in this mood again and you don't want to type accents (maybe you're
|
||||||
|
stuck with a QWERTY American english keyboard), or conversely you're
|
||||||
|
want to resume looking for your résumé, and you've told Recoll as much,
|
||||||
|
using the appropriate interface. What happens then ?
|
||||||
|
|
||||||
|
The second case is easy if the index is raw, and mostly impossible if it is
|
||||||
|
stripped. So we'll concentrate on the first case: how to achieve case and
|
||||||
|
diacritics insensitivity on a raw index ?
|
||||||
|
|
||||||
|
Recoll uses three expansion tables:
|
||||||
|
|
||||||
|
* The first table has stripped and lowercased terms as keys and raw terms as
|
||||||
|
data: +mate -> (mate, maté, MATE,...)+.
|
||||||
|
|
||||||
|
* The second table has lowercased stems as keys and original lowercase terms
|
||||||
|
as data (when using multiple languages, there are several such tables):
|
||||||
|
+évit -> (éviter, évite, évitâmes, ...)+.
|
||||||
|
|
||||||
|
* The third table has stripped and lowercased stems as keys and stripped
|
||||||
|
lowercased terms as data:
|
||||||
|
+evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+
|
||||||
|
|
||||||
|
The first table can be used for full case and diacritics expansion or for
|
||||||
|
only one of those, by post-filtering the results of full expansion (e.g. if
|
||||||
|
we only want diacritics expansion, we filter by stripping diacritics from
|
||||||
|
each result term and check that it's identical to the input). For example
|
||||||
|
if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to
|
||||||
|
only perform case expansion for an input of +maté+, we apply case folding
|
||||||
|
to the initial output and keep only +maté+, as +mate+ differs from the
|
||||||
|
input.
|
||||||
|
|
||||||
|
We only perform stemming expansion when case and diacritics sensitivity is
|
||||||
|
off. It is performed using the second and third tables, both on the
|
||||||
|
lowercased and lowercased/stripped output of the first step, and each term
|
||||||
|
in the output stemming is expanded again for case (using the first table).
|
||||||
|
|
||||||
|
A full example of the expansion occurring during an insensitive search
|
||||||
|
for +resume+ using French stemming on a mixed English/French index
|
||||||
|
follows. An important thing to remember is that the result of each
|
||||||
|
expansion is a function of the terms actually present in the index, not
|
||||||
|
some arbitrary computation (and so, of course, many of the possible but
|
||||||
|
absent variations are missing).
|
||||||
|
|
||||||
|
# The case and diacritics expansion of +resume+ yields +RESUME Resume
|
||||||
|
Résumé resumé résume résumé resume+
|
||||||
|
|
||||||
|
# The Stem expansion input list (lower-cased) is:
|
||||||
|
+resume resumé résume résumé+, and the output is:
|
||||||
|
+resum resume resumenes resumer resumes resumé resumée résum résumait
|
||||||
|
résumant résume résumer résumerai résumerait résumes résumez résumé résumée
|
||||||
|
résumées résumés+
|
||||||
|
|
||||||
|
# Each of the above terms is then fed to case and diacritics expansion (first
|
||||||
|
table), for the final output:
|
||||||
|
+resume résumé Résumé résumer résume Resume résumés RESUME resumes
|
||||||
|
resumer résumant resúmenes resumé résumait résumes résumée resumee
|
||||||
|
résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.
|
||||||
|
|
||||||
|
A Xapian OR query is finally constructed from the expanded term list.
|
||||||
|
|
||||||
20
website/faqsandhowtos/makeindex.sh
Normal file
20
website/faqsandhowtos/makeindex.sh
Normal file
@ -0,0 +1,20 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
WIDX=WikiIndex.txt
|
||||||
|
|
||||||
|
echo "== Recoll Wiki file index" > $WIDX
|
||||||
|
for f in *.txt; do
|
||||||
|
if test "$f" = $WIDX ; then continue; fi
|
||||||
|
h="`basename $f .txt`.html"
|
||||||
|
title=`head -1 "$f" | sed -e 's/=//g' -e 's/^ *//' -e 's/ *$//' -e 's/
//g'`
|
||||||
|
echo 'link:'$h'['$title']' >> $WIDX
|
||||||
|
echo >> $WIDX
|
||||||
|
done
|
||||||
|
|
||||||
|
exit 0
|
||||||
|
# Check and display what files are in the index but not in the contents table:
|
||||||
|
|
||||||
|
grep \| FaqsAndHowTos.txt | awk -F\| '{print $1}' | sed -e 's/\* \[\[//' -e 's/.wiki//' |sort > ctfiles.tmp
|
||||||
|
grep '\[\[' WikiIndex.txt | awk -F\| '{print $1}' | sed -e 's/\[\[//' -e 's/.wiki//' -e 's/.md//' | sort > ixfiles.tmp
|
||||||
|
echo 'diff ContentFiles IndexFiles:'
|
||||||
|
diff ctfiles.tmp ixfiles.tmp
|
||||||
|
rm ctfiles.tmp ixfiles.tmp
|
||||||
Loading…
x
Reference in New Issue
Block a user