web
This commit is contained in:
parent
06b414cfc6
commit
821fb780d2
35
website/faqsandhowtos/ElinksWeb.txt
Normal file
35
website/faqsandhowtos/ElinksWeb.txt
Normal file
@ -0,0 +1,35 @@
|
||||
== Extending the Recoll Firefox visited web page indexing mechanism to other browsers
|
||||
|
||||
The *Recoll* _Web Queue_ function allows using WEB browser plug-ins
|
||||
originally designed for indexing visited WEB pages with *Beagle* (rip). The
|
||||
browser plug-ins works very simply by creating copies of the visited pages
|
||||
in a designated directory. Two files are created for each page, one for the
|
||||
contents, the other for the metadata.
|
||||
|
||||
When activated, *Recoll* will visit the queue directory and index each HTML
|
||||
page and its associated metadata. There is more detail about the mechanism
|
||||
on the [[IndexWebHistory|page about the Recoll Web queue]], but mostly, you
|
||||
just need to go to the _Indexing Preferences_ in the *recoll* GUI, open the
|
||||
_Web history_ panel and check the top button.
|
||||
|
||||
Franck, a *Recoll* and *Elinks* user from New Zealand, designed a method
|
||||
and wrote a script to index the *Elinks* WEB history in this fashion.
|
||||
|
||||
The script works by using *wget* to fetch the visited page into the queue
|
||||
directory. This means that it would be reusable to index arbitrary WEB
|
||||
pages in contexts other than *Elinks* visits.
|
||||
|
||||
Recipee for *Elinks* and Recoll 1.18 and later:
|
||||
|
||||
* Retrieve the
|
||||
link:https://www.recoll.org/files/elinks_recoll.sh[elinks_recoll.sh] shell
|
||||
script and make it executable (`chmod a+x elinks_recoll.sh`).
|
||||
* In the Elinks Keyboard shortcut manager (k)/Main, add a shortcut to pass
|
||||
the current URL to an external commande, e.g. _Ctrl-P_.
|
||||
* In the Options manager (o) /Document/Uri Passing, add an action named for
|
||||
example _ToIndex_
|
||||
* Modify the ToIndex action to execute `/path/to/the/script/elinks_recoll.sh %c`
|
||||
* Save, you are done
|
||||
|
||||
For Recoll 1.17, the method is analog, but the script is named
|
||||
link:https://www.recoll.org/files/elinks_recoll.sh[elinks_beagle.sh].
|
||||
37
website/faqsandhowtos/FaqsAndHowTos.txt
Normal file
37
website/faqsandhowtos/FaqsAndHowTos.txt
Normal file
@ -0,0 +1,37 @@
|
||||
== Faqs and Howtos
|
||||
|
||||
=== Indexing
|
||||
* link:WhyIsMyFileNotIndexed.html[Why is this file not indexed ? Investigating indexing issues]
|
||||
* link:PreventIndexingDir.html[Preventing the indexing of a directory]
|
||||
* link:IndexOnAc.html[Starting/stopping the indexer depending on power/battery status]
|
||||
* link:IndexMozillaCalendari.html[Indexing Mozilla Sunbird / Lightning calendar data]
|
||||
* link:MultipleIndexes.html[Creating and using multiple indexes]
|
||||
* link:IndexWebHistory.html[Indexing Web history with the Firefox browser extension]
|
||||
* link:ElinksWeb.html[Extending the Web queue mechanism to other browsers and general WEB indexing]
|
||||
* link:IndexMailHeader.html[Indexing arbitrary mail headers]
|
||||
* link:IndexOutlook.html[Indexing Outlook archives]
|
||||
* link:HandleCustomField.html[Generating a custom field and using it to sort results]
|
||||
* link:http://www.recoll.org/recoll_XMP/index.html.html[An example of filter/field customisation, using XMP metadata with PDFs]
|
||||
* link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members]
|
||||
|
||||
=== Searching
|
||||
* link:GUIKeyboard.html[Recoll GUI keyboard navigation]
|
||||
* link:HotRecoll.html[On the desktop: using a keyboard shortcut for starting/hiding recoll]
|
||||
* link:OpenHelperScript.html[Handling issues for starting native apps, esp. email clients - getting Thunderbird to open message files]
|
||||
* link:QpdfviewHelperScript.html[Another example open helper script - using qpdfview to open pdf and postscript files, with support for page and search options]
|
||||
* link:UsingOpenWith.html[Using the new Open With menu in recoll 1.20 with a custom
|
||||
app]
|
||||
* link:ReplaceCategories.html[Replacing the document category filters]
|
||||
* link:ResultsThumbnails.html[Result list thumbnails and how to create them]
|
||||
* link:MuttAndRecoll.html[Interfacing Recoll and Mutt]
|
||||
* link:QueryFromC.html[Querying from a C program]
|
||||
|
||||
=== Administration and miscellaneous
|
||||
* link:http://www.recoll.org/pages/recoll-webui-install-wsgi.html.html[Installation of the Recoll WebUI with Apache]
|
||||
* link:FilterRetrofit.wiki.html[//Installing a filter for a new document type//]
|
||||
* link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens]
|
||||
* link:SavingConfig.wiki.html[Recoll configuration backup]
|
||||
* link:XDGBase.wiki.html[Tidying Recoll data storage]
|
||||
* link:ProblemSolvingData.html[Collecting diagnostic information]
|
||||
* link:NonAsciiFileNames.html[Unix and non-ascii file names]
|
||||
* link:FilterArch.html[Recoll filters]
|
||||
82
website/faqsandhowtos/FilterArch.txt
Normal file
82
website/faqsandhowtos/FilterArch.txt
Normal file
@ -0,0 +1,82 @@
|
||||
== Recoll input handlers
|
||||
|
||||
In the end, Recoll indexes plain UTF-8 text, remembering when it came
|
||||
from.
|
||||
|
||||
But of course, this is not how the source data looks like.
|
||||
The text content of the original documents is encoded in many fashions
|
||||
(ie pdf, ms-word, html, etc.), and it can also be stored in quite
|
||||
involved ways (inside archives, email attachments ...).
|
||||
|
||||
For getting to the data and converting it to plain text, Recoll uses a set
|
||||
of modules which it calls input handlers (or filters), which either operate
|
||||
on the storage structure (ie: a zip handler), or the storage format (ie a
|
||||
pdf to text translator), or both. In addition, there is a tentative notion
|
||||
of a higher level storage backend which we will ignore for now (for
|
||||
reference there are currently two of those: the file system and the web
|
||||
history cache).
|
||||
|
||||
The basic task of filters is to take a document as input and produce a
|
||||
series of subdocuments as output. The subdocument's format is defined
|
||||
either dynamically (as part of the output data), or statically, in the
|
||||
filter definition.
|
||||
|
||||
=== Simple filters
|
||||
|
||||
These are executed by a the **mh_exec** recoll module. They are the vast
|
||||
majority.
|
||||
|
||||
These filters are very simple. They are designed to perform a simple task
|
||||
with minimal interface, they mostly don't know anything about each other,
|
||||
and they don't know much about their context. This makes writing a filter
|
||||
quite easy as there is not much to learn about their environment.
|
||||
|
||||
Only one output document is produced and the format is fixed.
|
||||
|
||||
In practise the filter, which is most generally a shell-script (but could
|
||||
be any executable program), takes a file name on the command line and
|
||||
outputs an html or plain text document on standard output, then exits.
|
||||
|
||||
For example, the pdf filter takes one pdf file name as input on the command
|
||||
line and produces one html document on stdout. The fact that the output is
|
||||
html is statically defined in a configuration file.
|
||||
|
||||
For filters which produce plain text, the output character set information
|
||||
is in general defined in the configuration file. Else it will be obtained
|
||||
from the locale (hoping that it makes sense).
|
||||
|
||||
Filters that output html can produce metadata information in the html
|
||||
header (ie author etc.). Filters that output plain text can only output
|
||||
main text data, no metadata fields.
|
||||
|
||||
Besides the file name, there is one other piece of input information, which
|
||||
is in the form of an environment variable, and can be safely ignored:
|
||||
+RECOLL_FILTER_FORPREVIEW+. This indicates if the filter is being used
|
||||
for previewing or for indexing data. Some filters will elect to suppress
|
||||
repetitive parts of the output text when indexing to avoid distorting the
|
||||
term statistics. For exemple, the man filter suppresses the section
|
||||
headers (NAME, SYNOPSIS...) when indexing.
|
||||
|
||||
=== Multiple input filters
|
||||
|
||||
These filters are more complex, but still quite easy to write, especially
|
||||
if you can use Python, because they can then use a common module which
|
||||
manages the communication with the indexer.
|
||||
|
||||
Newer Recoll versions have converted many previously 'simple' filters to
|
||||
this kind as part of the port to Windows.
|
||||
|
||||
These filters are executed by the *mh_execm* Recoll module.
|
||||
|
||||
They are persistent (one instance will persist through a whole indexing
|
||||
pass), and will index successive multiple input files (the point being to
|
||||
avoid startup performance penalty), and possibly multiple documents per
|
||||
input file if this makes sense for their input format (ie: zip archive, chm
|
||||
help file).
|
||||
|
||||
They use a simple communication protocol over a pipe with the main recoll
|
||||
or recollindex process, with file names and a few other parameters being
|
||||
sent as input, and decoded data and attributes being sent in return.
|
||||
|
||||
The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip'
|
||||
or 'rclaudio' for reasonably straightforward exemples.
|
||||
62
website/faqsandhowtos/FilterRetrofit.txt
Normal file
62
website/faqsandhowtos/FilterRetrofit.txt
Normal file
@ -0,0 +1,62 @@
|
||||
== Installing a filter for a new document type
|
||||
|
||||
It will sometimes happen that a newer Recoll release has support for a
|
||||
document type which would be useful to you, but which your older release
|
||||
does not support.
|
||||
|
||||
It is in general easy to import support from the newer to the older
|
||||
release: the Recoll input handler interface is very stable, so things should just
|
||||
work.
|
||||
|
||||
Input Handler updates are generally described on the Recoll web site
|
||||
link:https://www.recoll.org/filters/filters.html[new filters pages]. They
|
||||
may include notes about which versions need the new input handler, or specifics
|
||||
about installing it.
|
||||
|
||||
An up to date copy of input handlers and configuration files is also kept
|
||||
link:https://www.recoll.org/filters/[at the same location].
|
||||
|
||||
We will take an example to make things more concrete: Tomboy and Gnote
|
||||
files are directly supported by Recoll 1.19, but not in older Recoll
|
||||
releases. The *rclxml* handler is needed to process them.
|
||||
|
||||
The following procedure will allow you to retrofit support:
|
||||
|
||||
- Retrieve the *rclxml* input handler from:
|
||||
link:https://www.lesbonscomptes.com/recoll/filters/rclxml[]
|
||||
|
||||
- Copy it to '/usr/share/recoll/filters' and make it executable:
|
||||
`chmod +x rclxml`
|
||||
The input handler needs *xsltproc*, but this is probably already on your
|
||||
system (else get it with the package manager).
|
||||
|
||||
- Edit '~/.recoll/mimemap', add the following line:
|
||||
`.note = application/x-gnote`
|
||||
- Edit '~/.recoll/mimeconf', add the following lines:
|
||||
+
|
||||
----
|
||||
[index]
|
||||
application/x-gnote = exec rclxml
|
||||
----
|
||||
- Edit '~/.recoll/mimeview', add the following lines:
|
||||
+
|
||||
----
|
||||
[view]
|
||||
application/x-gnote = tomboy %f
|
||||
----
|
||||
|
||||
- The easiest way to make sure the files are indexed with the new input
|
||||
handlers may then be to just run a full indexing pass (`recollindex -z`).
|
||||
|
||||
Notes:
|
||||
|
||||
- The MIME type which is used is not crucial, you could prefer to use,
|
||||
e.g., +application/x-tomboy+ instead, it just has to be consistent. To
|
||||
avoid future trouble, it's better to use the type used by newer Recoll
|
||||
releases though.
|
||||
- The 'mimeview' entry is necessary even if you are using the desktop
|
||||
preferences to open files. The value will not be used, but it has to be
|
||||
there.
|
||||
|
||||
|
||||
|
||||
34
website/faqsandhowtos/FilteringOutZipArchiveMembers.txt
Normal file
34
website/faqsandhowtos/FilteringOutZipArchiveMembers.txt
Normal file
@ -0,0 +1,34 @@
|
||||
== Filtering out Zip archive members ==
|
||||
|
||||
The *rclzip* Zip archive extraction input handler does not use the general
|
||||
configuration variables which define what file system objects should be
|
||||
skipped, but it has an equivalent internal function.
|
||||
|
||||
The name-skipping code depends on a recent member of the the Recoll Python
|
||||
package. This will become standard for release 1.20, but for earlier
|
||||
releases, you need to do two things to use this function:
|
||||
|
||||
- Fetch 'python/recoll/recoll/rclconfig.py' and 'filters/rclzip' from the
|
||||
source repository.
|
||||
- Copy both to '/usr/share/recoll/filters' and make 'rclzip' executable.
|
||||
|
||||
You can then set a variable named +zipSkippedNames+ inside
|
||||
'recoll.conf'. +zipSkippedNames+ should be a space-separated list of
|
||||
patterns which will be passed to the Python fnmatch() function. The +/+
|
||||
characters are not special (matched as any character).
|
||||
|
||||
You can't use embedded spaces in patterns (no double-quote quoting for now)
|
||||
|
||||
This can be redefined for file system directories using the usual section
|
||||
indicators (Zip archives in different file-system directories can have
|
||||
different skip lists).
|
||||
|
||||
Example:
|
||||
|
||||
----
|
||||
zipSkippedNames = *.txt
|
||||
[/path/to/the/dir]
|
||||
zipSkippedNames = somedir/*/*.html
|
||||
----
|
||||
|
||||
|
||||
60
website/faqsandhowtos/GUIKeyboard.txt
Normal file
60
website/faqsandhowtos/GUIKeyboard.txt
Normal file
@ -0,0 +1,60 @@
|
||||
== Recoll GUI keyboard navigation
|
||||
|
||||
Using Recoll without the mouse is not completely straightforward, but it is
|
||||
mostly feasible. Here follows a description of the usable shortcuts.
|
||||
|
||||
=== Anywhere
|
||||
|
||||
`Ctrl+q` should exit Recoll from anywhere.
|
||||
|
||||
=== Main window and result list ===
|
||||
|
||||
When Recoll starts up, the focus is in the simple search entry. The main
|
||||
window tab order is as follows:
|
||||
|
||||
* Clear
|
||||
* Search
|
||||
* Search type combo
|
||||
* Search entry (Initial focus)
|
||||
* Result list (scrolling etc)
|
||||
* Result list 1st link
|
||||
* Result list next links...
|
||||
* Back to Clear
|
||||
|
||||
Each result list entry has 3 links: the icon link is not active, but its
|
||||
value is the URL, so that it can be dragged and dropped to another
|
||||
application. The 2 other links are _Preview_ and _Open_ and can be
|
||||
activated by typing _Enter_.
|
||||
|
||||
Typing _Ctrl+Shift+s_ anywhere in the main window should return the focus to the search entry. So will _Ctrl+l_ in future versions (for compatibility with WEB browser usage).
|
||||
|
||||
For pure keyboard usage, you can improve this by:
|
||||
|
||||
- Disabling the icon link: use _Preferences->GUI configuration->Result
|
||||
List->Edit result paragraph_ and remove the `<a href='%U'>` and `</a>`
|
||||
around the `<img...>` tag.
|
||||
- Making the active link more visible by adding the following code to the
|
||||
result page HTML header insert (same preferences tab). Feel free to
|
||||
adjust the color :=) :
|
||||
|
||||
----
|
||||
<style type="text/css">
|
||||
a:focus {background-color: red;}
|
||||
</style>
|
||||
----
|
||||
|
||||
=== Result table
|
||||
|
||||
The same _Ctrl+Shift+s_ will return the focus to the search entry when
|
||||
working with the result table.
|
||||
|
||||
_Ctrl+r_ will move the focus from the entry to the spreadsheet. When in
|
||||
there the arrow keys will navigate the lines.
|
||||
|
||||
When a line is selected:
|
||||
|
||||
* _Ctrl+o_ will _Open_ the document.
|
||||
* _Ctrl+Shift+o_ will _Open_ the document and exit Recoll.
|
||||
* _Ctrl+d_ (detail) will start a _Preview_
|
||||
|
||||
_Esc_ will deselect the current line so that mouse hovering will work again.
|
||||
69
website/faqsandhowtos/HandleCustomField.txt
Normal file
69
website/faqsandhowtos/HandleCustomField.txt
Normal file
@ -0,0 +1,69 @@
|
||||
== Generating a custom field and using it to sort results
|
||||
|
||||
We are going to show how to generate a custom field from a Recoll filter,
|
||||
and use it for sorting results. The example chosen comes from an actual
|
||||
user request: sorting results on pdf page counts.
|
||||
|
||||
The details here are obsolete, as the +pdf+ input handler is now a quite
|
||||
different python program, but the general idea is still relevant.
|
||||
|
||||
The page count from a pdf file can be displayed by the pdfinfo command
|
||||
(xpdf or poppler tools).
|
||||
|
||||
We first modify a copy of the rclpdf filter
|
||||
('/usr/[local/]share/recoll/filters/rclpdf'), to compute the pdf page count,
|
||||
and output the value as an html meta field. This is a not very interesting
|
||||
bit of shell/awk magic. Another approach would be to just rewrite the
|
||||
rclpdf filter in your favorite scripting language (ie: perl, python...), as
|
||||
all it does is execute pdftotext and pdfinfo and output html, nothing
|
||||
complicated. Here follows the rclpdf modification as a pseudo patch:
|
||||
|
||||
----
|
||||
# compute the page count and format it so that it's alphabetically sortable
|
||||
+set `pdfinfo "$infile" | egrep ^Pages:`
|
||||
+pages=`printf "%04d" $2`
|
||||
[skip...]
|
||||
# Pass the page count value to awk
|
||||
-awk 'BEGIN'\
|
||||
+awk -v Pages="$pages" 'BEGIN'\
|
||||
[skip...]
|
||||
# Inside the awk program startup section: compute the "meta" field line
|
||||
+ pagemeta = "<meta name=\"pdfpages\" content=\"" Pages "\">\n"
|
||||
[skip...]
|
||||
# Then print it as part of the header:
|
||||
+ $0 = part1 charsetmeta pagemeta part2
|
||||
[skip...]
|
||||
----
|
||||
|
||||
You can execute your own version of rclpdf by modifying '~/.recoll/mimeconf':
|
||||
|
||||
----
|
||||
[index]
|
||||
application/pdf = exec /path/to/my/own/rclpdf
|
||||
----
|
||||
|
||||
At this point, recollindex would receive and extract a +pdfpages+ field,
|
||||
but it would not know what to do with it. We are going to tell it to store
|
||||
the value inside the document data record so that it can be displayed in
|
||||
the results, and sorted on. For this we modify the '~/.recoll/fields' file:
|
||||
|
||||
----
|
||||
[stored]
|
||||
pdfpages=
|
||||
----
|
||||
|
||||
That's it ! After reindexing, you can now display +pdfpages+ inside the
|
||||
result list (add a +%(pdfpages)+ value to the paragraph format), and display
|
||||
+pdfpages+ inside the result table (right-click the table header), and sort
|
||||
the results on page count (click the column header).
|
||||
|
||||
Note that +pdfpages+ has not been defined as searchable (this would not make
|
||||
much sense). For this, you'd have to define a prefix and add it to the
|
||||
[prefixes] fields file section:
|
||||
|
||||
----
|
||||
[prefixes]
|
||||
pdfpages = XYPDFP
|
||||
----
|
||||
|
||||
Have a look at the comments inside the 'fields' file for more information.
|
||||
13
website/faqsandhowtos/Home.txt
Normal file
13
website/faqsandhowtos/Home.txt
Normal file
@ -0,0 +1,13 @@
|
||||
== Welcome to the Recoll Faqs and Recipees
|
||||
|
||||
link:FaqsAndHowTos.html[FAQs and Howtos] are stored here, but
|
||||
the main source for Recoll user documentation is
|
||||
link:https://www.recoll.org/doc.html[the _Recoll user manual_] on the
|
||||
link:https://www.recoll.org/[Recoll Web site] where you will also find a
|
||||
lot of other Recoll information, source code tarballs and contact
|
||||
information.
|
||||
|
||||
If you want to make your problem report as useful as possible, you may want
|
||||
to take a look at link:ProblemSolvingData.html[this page].
|
||||
|
||||
link:WikiIndex.html[Full file index]
|
||||
79
website/faqsandhowtos/HotRecoll.txt
Normal file
79
website/faqsandhowtos/HotRecoll.txt
Normal file
@ -0,0 +1,79 @@
|
||||
== Recoll hotkey: starting / hiding recoll with a keyboard shortcut
|
||||
|
||||
Type a key (ie: F12) and have recoll appear or disappear. On the first
|
||||
occurrence, recoll is started if it's not already running. Further
|
||||
occurrences toggle recoll between visible and minimized states. Never
|
||||
thought this would be useful until someone asked for it. Can't do without
|
||||
it anymore :)
|
||||
|
||||
This works well with both Gnome and KDE, but is implemented using a gnome
|
||||
library (*libwnck*) and its python interface, which you may have to install
|
||||
on your system if you are a pure KDE user. The library most probably exists
|
||||
in the package repositories for your distribution, so this should not be
|
||||
too complicated.
|
||||
|
||||
This should also work with other window managers, because it is based on a
|
||||
standard window manager interface extension (EWMH) that most modern window
|
||||
managers implement.
|
||||
|
||||
=== Installing the script (all desktops):
|
||||
|
||||
- You will need the libwnck library and its python interface. These are
|
||||
usually part of a gnome installation, otherwise check and possibly
|
||||
install them. For OpenSuse, the library should already be there but you
|
||||
need to install gnome-python-desktop.
|
||||
- Download the
|
||||
link:https://www.recoll.org/files/hotrecoll.py[http://www.recoll.org/files/hotrecoll.py
|
||||
script]. If you have a recent recoll installation (1.14.3 and
|
||||
further), it's already in the recoll filters directory
|
||||
('/usr/[local/]share/recoll/filters')
|
||||
- Copy the script to some permanent place (ie: '~/bin') and make it
|
||||
executable (you can leave it in the filters dirs if it's there). In a
|
||||
shell window: `chmod +x hotrecoll.py`.
|
||||
- You can check that the script works (or not) by executing it on the
|
||||
command line. It does not need an argument. Recoll should appear or
|
||||
disappear every time you execute the script. A few warning messages may
|
||||
be considered normal. If the script says that it does not find the wnck
|
||||
library or some other module, you'll have to install them.
|
||||
|
||||
=== Installing the keyboard shortcut (Gnome):
|
||||
|
||||
- _System->Preferences->Keyboard shortcuts_, or execute
|
||||
*gnome-keybinding-properties*
|
||||
- Click add, Name, ie: StartRecoll, Action: /path/to/hotrecoll.py
|
||||
- This will add the shortcut to the "Custom shortcuts" section. You can
|
||||
then click in the "Shortcut" column for "StartRecoll", and type any key
|
||||
combination (ie: push F12) to assign a key shortcut.
|
||||
|
||||
=== Installing the keyboard shortcut (KDE):
|
||||
|
||||
Under KDE installing a global custom keyboard shortcut like we need is most
|
||||
helpfully not under "Keyboard Shortcuts" but under "Input Actions".
|
||||
|
||||
- _Kmenu -> Configure Desktop -> Input Actions -> Edit -> New -> Global
|
||||
Shortcut -> Command/Url_
|
||||
- A new Action appears, named _New Action_. You can rename it something
|
||||
like +hotrecoll+ for clarity.
|
||||
- Click the _Trigger_ tab, click the input area and press your preferred
|
||||
key combination (ie: F12)
|
||||
- Click the _Action_ tab, and enter +hotrecoll.py+ (if it's in your PATH),
|
||||
or else the full path to the command (e.g.:
|
||||
'/usr/share/recoll/filters/hotrecoll.py').
|
||||
- Click _Apply_.
|
||||
|
||||
=== Installing the keyboard shortcut (XFCE):
|
||||
|
||||
Open the settings manager, and add the shortcut in the
|
||||
_Application Shortcuts_ panel inside the _Keyboard_ tool.
|
||||
|
||||
|
||||
=== Other environments
|
||||
|
||||
Many window managers have a way to set up a keyboard shortcut for running
|
||||
an arbitrary command. You'll need to look at the documentation for yours,
|
||||
or search the web for a solution.
|
||||
|
||||
An alternative independant of the environment would be to use the XBindKeys
|
||||
utility. See this link:http://www.linux.com/archive/feed/59494[linux.com
|
||||
article] for helpful instructions.
|
||||
|
||||
33
website/faqsandhowtos/IndexMailHeader.txt
Normal file
33
website/faqsandhowtos/IndexMailHeader.txt
Normal file
@ -0,0 +1,33 @@
|
||||
== Indexing arbitrary mail headers
|
||||
|
||||
By default the Recoll mail handler only processes a subset of email headers
|
||||
(+From+, +To+, +Cc+, +Date+, +Subject+). It is possible to index additional
|
||||
headers by specifying them inside the 'fields' configuration file, inside
|
||||
the configuration directory (typically '~/.recoll/').
|
||||
|
||||
Lengthy explanations are not really needed here, and I'll just show an
|
||||
example (duplicated from the configuration section of the manual):
|
||||
|
||||
----
|
||||
[prefixes]
|
||||
# Index mailmytag contents (with the given prefix)
|
||||
mailmytag = XMTAG
|
||||
|
||||
[stored]
|
||||
# Store mailmytag inside the document data record (so that it can be
|
||||
# displayed - as %(mailmytag) - in result lists).
|
||||
mailmytag =
|
||||
|
||||
[mail]
|
||||
# Extract the X-My-Tag mail header, and use it internally with the
|
||||
# mailmytag field name
|
||||
x-my-tag = mailmytag
|
||||
|
||||
----
|
||||
|
||||
Limitations:
|
||||
|
||||
- The mail filter will only process the first instance for a header
|
||||
occurring several times.
|
||||
- No decoding will take place (ie for non-ascii headers which would have
|
||||
some kind of encoding).
|
||||
32
website/faqsandhowtos/IndexMozillaCalendari.txt
Normal file
32
website/faqsandhowtos/IndexMozillaCalendari.txt
Normal file
@ -0,0 +1,32 @@
|
||||
== Indexing Mozilla calendar data
|
||||
|
||||
Mozilla calendar programs (*Sunbird*, *Lightning*) do not store their
|
||||
data in +ics+ files natively. They use an *SQLite* database (the
|
||||
'storage.sdb' file inside the profile). This means that calendar data
|
||||
cannot be indexed directly.
|
||||
|
||||
To get Recoll to index calendar data, you need to export it to an +ics+
|
||||
file. This can be done manually, from the application menus, or, by
|
||||
installing the
|
||||
link:https://addons.mozilla.org/en-US/sunbird/addon/3740[Automatic Export
|
||||
extension].
|
||||
|
||||
The extension can be configured to export the data when exiting the
|
||||
program, or at regular time intervals. You can even set up a command to be
|
||||
executed after the export. If you are not using real time indexing, this
|
||||
can usefully be *recollindex*.
|
||||
|
||||
In _Tools->Add Ons->Automatic Export preferences_, in the _Start an
|
||||
application after export_ subpanel, set _Path of application_ to
|
||||
'/usr/[local/]bin/recollindex' and _Parameters of application_ to
|
||||
something like _-i;/home/me/path/to/nameofexportedcal.ics_
|
||||
|
||||
This will ensure that the calendar is indexed every time it is exported
|
||||
(this is not necessary though, you can let the next batch indexing pass
|
||||
take care of it).
|
||||
|
||||
It may happen that the exported data has some syntax errors which will
|
||||
prevent indexing with the *rclics* filter which was distributed up to
|
||||
Recoll 1.13.04 (included). You may get an updated filter from the
|
||||
link:https://www.recoll.org/download.html[Recoll download page].
|
||||
|
||||
24
website/faqsandhowtos/IndexOnAc.txt
Normal file
24
website/faqsandhowtos/IndexOnAc.txt
Normal file
@ -0,0 +1,24 @@
|
||||
== Laptops: starting or stopping indexing according to AC power status
|
||||
|
||||
For people using real time indexing on a laptop, kind user "The Doctor"
|
||||
contributed a script to automatically start and stop indexing according to
|
||||
power status. The script can be found here:
|
||||
link:https://bitbucket.org/medoc/recoll/src/tip/src/desktop/recoll_index_on_ac.sh[recoll_index_on_ac.sh]
|
||||
|
||||
To use it, you need to copy it somewhere (e.g.: '/usr/bin', but any place
|
||||
will do), make it executable (`chmod a+x recoll_index_on_ac.sh`), and edit
|
||||
'~/.config/autostart/recollindex.desktop'
|
||||
|
||||
Change the following line:
|
||||
|
||||
Exec=recollindex -w 60 -m
|
||||
|
||||
to something like the following (depending where you copied the script):
|
||||
|
||||
Exec=/usr/bin/recoll_index_on_ac.sh
|
||||
|
||||
You may also want to change
|
||||
'/usr/share/recoll/examples/recollindex.desktop', otherwise your change
|
||||
will be reverted the next time you toggle real time indexing through the
|
||||
GUI. And, yes, sorry about it, _this_ change will be lost on the next
|
||||
Recoll update, so save a copy.
|
||||
11
website/faqsandhowtos/IndexOutlook.txt
Normal file
11
website/faqsandhowtos/IndexOutlook.txt
Normal file
@ -0,0 +1,11 @@
|
||||
== Indexing Outlook archives ==
|
||||
|
||||
Recoll has no direct support for indexing Microsoft Outlook data, because,
|
||||
if you are a Windows user, you probably are not a good customer for Linux
|
||||
desktop indexing...
|
||||
|
||||
However, if you have a need to index Outlook data at some point, I can
|
||||
recommend the excellent link:http://www.five-ten-sg.com/libpst/[libpst]
|
||||
library and its link:http://www.five-ten-sg.com/libpst/rn01re01.html[readpst]
|
||||
utility. Using this you can very easily convert the Outlook data into MH or
|
||||
mbox format, and then index the result with Recoll.
|
||||
29
website/faqsandhowtos/IndexWebHistory.txt
Normal file
29
website/faqsandhowtos/IndexWebHistory.txt
Normal file
@ -0,0 +1,29 @@
|
||||
== Indexing Web history with the Firefox extension ==
|
||||
|
||||
Note: this document is valid for Recoll versions from 1.18.
|
||||
|
||||
The link:http://sourceforge.net/projects/recollfirefox/[Recoll Firefox
|
||||
extension]
|
||||
works together with Recoll to index the Web pages that you visit. The
|
||||
extension is based on an older one which was initially written for the
|
||||
Beagle indexer.
|
||||
|
||||
The extension works by copying the data for the visited pages to a queue
|
||||
directory ('~/.recollweb/ToIndex' by default), from which they are
|
||||
indexed and removed by Recoll, and then stored in a local cache.
|
||||
|
||||
The extension is now hosted on the Mozilla add-ons site, so you can install
|
||||
it very simply in Firefox: link:https://addons.mozilla.org/fr/firefox/addon/recoll-indexer-1/[Recoll Firefox add-on page].
|
||||
|
||||
This feature can be enabled in the Recoll GUI index configuration panel
|
||||
(Web history section), or by editing the configuration file (set
|
||||
+processwebqueue+ to 1).
|
||||
|
||||
Please remember that Recoll only stores a limited amount of cached web data
|
||||
(adjustable from the GUI Index Configuration section), and that old pages
|
||||
will be purged from the index. Pages that you want to archive permanently
|
||||
need to be saved elsewhere, as they will otherwise eventually disappear
|
||||
from the Recoll results.
|
||||
|
||||
Recoll will index +.maff+ files, which may be a better choice for archival
|
||||
usage.
|
||||
9
website/faqsandhowtos/Makefile
Normal file
9
website/faqsandhowtos/Makefile
Normal file
@ -0,0 +1,9 @@
|
||||
.SUFFIXES: .txt .html
|
||||
|
||||
.txt.html:
|
||||
asciidoc $<
|
||||
|
||||
all: $(addsuffix .html,$(basename $(wildcard *.txt)))
|
||||
|
||||
clean:
|
||||
rm *.html
|
||||
96
website/faqsandhowtos/MultipleIndexes.txt
Normal file
96
website/faqsandhowtos/MultipleIndexes.txt
Normal file
@ -0,0 +1,96 @@
|
||||
== Creating and using multiple indexes
|
||||
|
||||
=== Why would you want to do this ?
|
||||
|
||||
- Easy adjustment of search areas: you can filter results by using the
|
||||
directory filter in the advanced search panel, but, if you have
|
||||
separate well defined places where you store different kind of data,
|
||||
it is easier to maintain separate index and use the External indexes
|
||||
dialog to switch them on or off, and it will also yield much better
|
||||
search performance.
|
||||
- Shared indexes: it may be useful to maintain one or several indexes
|
||||
for shared data, and separate personal indexes for each user. Indexes
|
||||
can be shared over the network.
|
||||
- Creating separate indexes for removable volumes.
|
||||
|
||||
=== How to do it
|
||||
|
||||
As an example we'll suppose that you have Recoll installed and indexing
|
||||
your home directory, and that you would like to have a separate index for
|
||||
/usr/shared/doc.
|
||||
|
||||
You need to create a separate configuration for the new index, then add it
|
||||
to the external indexes list in the user interface, and activate it as
|
||||
needed.
|
||||
|
||||
. Create a directory for the new index, and create an empty configuration
|
||||
file
|
||||
+
|
||||
----
|
||||
cd
|
||||
mkdir .recoll-sharedoc
|
||||
touch .recoll-sharedoc/recoll.conf
|
||||
----
|
||||
. Either edit the new configuration by hand or start recoll to use the GUI
|
||||
configuration editor.
|
||||
+
|
||||
----
|
||||
cd .recoll-sharedoc
|
||||
echo "topdirs = /usr/share/doc" > recoll.conf
|
||||
# OR
|
||||
recoll -c ~/.recoll-sharedoc
|
||||
----
|
||||
+
|
||||
If using the GUI, click _Cancel_ when asked, to start the configuration
|
||||
editor.
|
||||
|
||||
. Perform initial indexing. If you chose the GUI route, indexing will
|
||||
start as soon as you leave the configuration editor. Else, on the
|
||||
command line:
|
||||
+
|
||||
----
|
||||
recollindex -c ~/.recoll-sharedoc
|
||||
----
|
||||
. Optionally set up *cron* to perform nightly indexing, use +crontab -e+
|
||||
and insert a line like the following:
|
||||
+
|
||||
----
|
||||
45 20 * * * recollindex -c ~/.recoll-sharedoc
|
||||
----
|
||||
+
|
||||
This would start the indexing at 20:45. `crontab -e` will use the *vi*
|
||||
editor by default, you can change this by using the EDITOR
|
||||
environment variable. Exemple: `EDITOR=kate crontab -e`
|
||||
Your favorite desktop may also have a dedicated tool to add crontab entries.
|
||||
|
||||
. Start recoll and choose the _Preferences->External_ index dialog menu
|
||||
entry, then click the Browse button (near the bottom), and select the
|
||||
new index Xapian database directory '~/.recoll-sharedoc/xapiandb'
|
||||
Then click _Add index_.
|
||||
|
||||
. You can then activate or deactivate the new index by clicking the box
|
||||
in front of the directory name in the list.
|
||||
|
||||
When adding an index shared by multiple users, it may be helpful to use the
|
||||
RECOLL_EXTRA_DBS environment variable instead of editing individual
|
||||
configurations, see the manual for more details.
|
||||
|
||||
=== Paths adjustments
|
||||
|
||||
When sharing indexes over a network, in most cases, the indexed data will
|
||||
be accessible through different paths on the different hosts. This will
|
||||
prevent the Preview and Open functions to work because the paths they get
|
||||
from the index do not match the ones which are usable from the local
|
||||
host.
|
||||
|
||||
For example my home directory is accessed as '/home/me' on my home
|
||||
machine, and as '/net/myhost/home/me' on other hosts. By default, trying
|
||||
to access a result from a remote host would use the first path, when the
|
||||
second is the one that would work.
|
||||
|
||||
As of release 1.19 **Recoll** has a facility to perform index-dependant
|
||||
path translations. This facility is accessible from the _external index
|
||||
dialog_ in the GUI preferences. Paths translations can be set for the main
|
||||
index if no index is selected (rarely useful), or for the selected
|
||||
additional index.
|
||||
|
||||
77
website/faqsandhowtos/MuttAndRecoll.txt
Normal file
77
website/faqsandhowtos/MuttAndRecoll.txt
Normal file
@ -0,0 +1,77 @@
|
||||
== Interfacing Recoll and Mutt
|
||||
|
||||
It is possible to either use Mutt as a Recoll search result viewer, or
|
||||
start Recoll from the Mutt search.
|
||||
|
||||
=== Starting Mutt to view Recoll search results
|
||||
|
||||
This method and the associated
|
||||
link:http://www.recoll.org/files/recoll2mutt[recoll2mutt script] were kindly
|
||||
contributed by Morten Langlo.
|
||||
|
||||
This allows finding mail messages in recoll and then calling *mutt*
|
||||
or *mutt-kz* to read or process the mail.
|
||||
|
||||
Installation:
|
||||
|
||||
- Copy the [[http://www.recoll.org/files/recoll2mutt|recoll2mutt script]]
|
||||
somewhere in your PATH, and make it executable.
|
||||
- In the **recoll** GUI menus:
|
||||
_Preferences->GUI configuration->User interface->Choose editor applications_
|
||||
change the entry for "message/rfc822" to: +recoll2mutt %f+
|
||||
|
||||
The script has options for setting a number of parameters, you may not need
|
||||
to set any of them, the defaults are:
|
||||
|
||||
- -c mutt
|
||||
- -F .muttrc
|
||||
- -m Mail
|
||||
- -x "-fn 10*20 -geometry 115x40"
|
||||
|
||||
Example:
|
||||
|
||||
----
|
||||
recoll2mutt -c mutt-kz -F .mutt_kzrc -m Mail -x "-fn 10*20 -geometry 115x40" %f
|
||||
----
|
||||
|
||||
The option +-x+ is passed to *xterm*, which is used to call *mutt* or
|
||||
*mutt-kz*.
|
||||
|
||||
The script works for both _mbox_ and _maildir_ mail boxes, and it
|
||||
expects the configuration file for mutt and the mail directory to reside in
|
||||
your $HOME and the spool file to be '/var/spool/mail/$USER' if it is
|
||||
not in your mail directory. But it is easy to change the values in the
|
||||
script if you need to.
|
||||
|
||||
*mutt* is opened with the right mailbox and limit set to _Date_ and
|
||||
_Sender_. In theory you could set limit to _Message-Id_, but very often
|
||||
*mutt* reports, that there are invalid patterns in _Message-Id_, so do it
|
||||
safe, even though all emails in the opened mail box with the same date from
|
||||
the sender are shown.
|
||||
|
||||
|
||||
=== Starting Recoll from the Mutt search
|
||||
|
||||
This will work only when using maildir storage (messages in individual
|
||||
files). It will not work with mailbox files. The latter would probably be
|
||||
possible by extracting the individual result messages using the Python
|
||||
interface, but I did not try.
|
||||
|
||||
The classic way to interface Mutt and a search application is to create a
|
||||
shortcut to an external command which creates a temporary Maildir
|
||||
containing the search results.
|
||||
|
||||
There is such a script for Recoll, you will find it link:https://bitbucket.org/medoc/recoll/raw/41d41799dbac4c69a34db985b3ab9f1597c9c742/src/python/samples/mutt-recoll.py[here].
|
||||
|
||||
Copy the script somewhere in your PATH, and make it executable, then add
|
||||
the following line to your '.muttrc':
|
||||
|
||||
|
||||
----
|
||||
|
||||
macro index S "<enter-command>unset wait_key<enter><shell-escape>mutt-recoll.py -G<enter><change-folder-readonly>~/.cache/mutt_results<enter>" \
|
||||
"search mail (using recoll)"
|
||||
|
||||
----
|
||||
|
||||
Obviously, you can replace the 'S' letter with whatever will suit you (e.g:/)
|
||||
85
website/faqsandhowtos/NonAsciiFileNames.txt
Normal file
85
website/faqsandhowtos/NonAsciiFileNames.txt
Normal file
@ -0,0 +1,85 @@
|
||||
== Unix and non-ASCII file names, a summary of issues
|
||||
|
||||
Unix/Linux file and directory names are binary byte C strings. Only the
|
||||
null byte and the slash character (/) are forbidden inside a name,
|
||||
nowhere does the kernel interpret the strings as meaningful or
|
||||
printable.
|
||||
|
||||
In the old times, all utilities that would display to the user were
|
||||
ASCII-based, and people would use pure printable ASCII file names (even
|
||||
using space characters inside names was a cause for trouble). Non
|
||||
alphanumeric characters were exclusively used for playing tricks on
|
||||
colleagues. And all was well.
|
||||
|
||||
Then the devil came under the guise of accented 8 bit characters. The
|
||||
system has no problem with them, file names are still binary C strings, but
|
||||
the utilities have to display them or take them as input, and, because
|
||||
there is no encoding specification stored with the file names, they can
|
||||
only do this according to the character encoding taken from the user's
|
||||
current locale.
|
||||
|
||||
For example fr_FR.UTF-8, and fr_FR.ISO8859-1 could be used simultaneously
|
||||
on the same system (by different users), but they are completely
|
||||
uncompatible: ISO-8859-1 strings are illegal when viewed in an UTF-8 locale
|
||||
(will display as interrogation points or some other conventional error
|
||||
marker). UTF-8 strings will display as gibberish in an ISO-8859-1 locale.
|
||||
|
||||
This means that the file names created by an UTF-8 user are displayed as
|
||||
garbage to the ISO-8859 one...
|
||||
|
||||
If you ever change your locale, your old files are still there and named
|
||||
the same (in the binary sense), but the names display badly and you have
|
||||
great trouble inputing them. If you add distributed (NFS) file system
|
||||
issues, things become totally unmanageable. Also think about archives sent
|
||||
from another system with a different encoding.
|
||||
|
||||
For what concerns Recoll:
|
||||
|
||||
- The file names inside recoll.conf are not transcoded, they are taken as
|
||||
binary strings (mostly, only +\n+ and +space+ are a bit special), and
|
||||
passed as is to the system. So if you edit 'recoll.conf' with a text
|
||||
editor, inside the same locale that is or has been used for file names,
|
||||
you'll be fine.
|
||||
- There was a bug in the GUI configuration tool, up to 1.12, it should
|
||||
transcode between the internal Qt format and locale-dependant strings,
|
||||
but it doesn't or does it badly.
|
||||
- There is also an exception for the +unac_except_trans+ variable, this
|
||||
*has* to be UTF-8, so if the rest of the file uses another encoding,
|
||||
you'll need to edit two separate files and concatenate them.
|
||||
|
||||
As of version 1.13, Recoll uses local8Bit()/fromLocal8Bit() to convert
|
||||
recoll.conf file names from/to QStrings (it uses UTF-8 for all string
|
||||
values which are not file names).
|
||||
|
||||
The Qt file dialog is broken (at least was, I have not checked this on
|
||||
recent versions). It should consider file paths as almost-binary data, not
|
||||
QStrings, but doesn't. In consequence, things are even more broken than
|
||||
necessary as seen from there:
|
||||
|
||||
With LANG="C", no non-ASCII paths can't be used at all:
|
||||
|
||||
- Strings read from recoll.conf are stripped of 8bit characters before display.
|
||||
- Directory entries with 8bit characters are not displayed at all in the
|
||||
selection dialog.
|
||||
|
||||
With LANG="fr_FR.UTF-8", only UTF-8 paths can be used:
|
||||
|
||||
- Strings read from recoll.conf are damaged when converted to QString
|
||||
(except those that were actually UTF-8)
|
||||
- Only the UTF-8 directory entries are displayed in the selection dialog.
|
||||
|
||||
|
||||
With LANG="fr_FR.iso8859-1", everything works ok.
|
||||
|
||||
- Strings read from recoll.conf are displayed with weird characters if
|
||||
they use another encoding such as UTF-8, but are correctly maintained
|
||||
and can be read back from the dialogs and rewritten without damage.
|
||||
- Directory entries with 8 bit characters are displayed weirdly (normal),
|
||||
but can be manipulated without trouble (this includes utf-8 names of
|
||||
course).
|
||||
|
||||
In conclusion, only the iso-8859 locales can be used for handling mixed
|
||||
encoding situations. This is a possible workaround for people who need it.
|
||||
|
||||
More data about path encoding issues:
|
||||
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
|
||||
71
website/faqsandhowtos/OpenHelperScript.txt
Normal file
71
website/faqsandhowtos/OpenHelperScript.txt
Normal file
@ -0,0 +1,71 @@
|
||||
== Starting native applications
|
||||
|
||||
It is sometimes difficult to start a native application on a result
|
||||
document, especially when the result comes from a container file (ie: email
|
||||
folder file, chm file).
|
||||
|
||||
The problem is that native applications usually expect at most a file name
|
||||
on the command line, and sometimes not even that (emailers).
|
||||
|
||||
The _Open parent documents_ link in the result list right click menu is
|
||||
sometimes useful in this situation (e.g.: +chm+ files).
|
||||
|
||||
In some other cases it may help that Recoll does make a lot of data
|
||||
available to the application. This data may have to be pre-processed in a
|
||||
script before calling the actual application.
|
||||
|
||||
Details about configuring how the native application or script are called
|
||||
are given with the
|
||||
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description of the mimeview configuration file]
|
||||
|
||||
Information about
|
||||
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.FIELDS[configuring
|
||||
customised fields] may also be useful in combination.
|
||||
|
||||
=== Example
|
||||
|
||||
This is a simple example, because it does not need to use special
|
||||
fields. It just shows how to solve a simple issue by using an intermediary
|
||||
script. The problem is due to the fact that thunderbird's +-file+ option
|
||||
won't open a file if the extension is not '.eml'. Jorge, the kind Recoll
|
||||
user who supplied the example stores his email in Maildir++ format, the
|
||||
file names have no extension, so an intermediary script is necessary to get
|
||||
thunderbird to open them:
|
||||
|
||||
Note that this only works with messages stored in Maildir or MH format (one
|
||||
message per file). As far as I know, there is no way to get Thunderbird to
|
||||
open an arbitrary mbox file.
|
||||
|
||||
The 'recoll-thunderbird-open-file' script:
|
||||
|
||||
----
|
||||
#!/bin/sh
|
||||
cp $1 /tmp/$$.eml
|
||||
thunderbird -file /tmp/$$.eml
|
||||
----
|
||||
|
||||
Create the file in an editor, save it somewhere, and make it executable
|
||||
(`chmod +x recoll-thunderbird-open-file`).
|
||||
|
||||
The mail line in the '~/.recoll/mimeview' file:
|
||||
|
||||
----
|
||||
[view]
|
||||
message/rfc822 = recoll-thunderbird-open-file %f
|
||||
----
|
||||
|
||||
If the place where you saved the script is not in your PATH, you will need
|
||||
to use the full path instead of just the script name, as in
|
||||
|
||||
----
|
||||
[view]
|
||||
message/rfc822 = /home/me/somewhere/recoll-thunderbird-open-file %f
|
||||
----
|
||||
|
||||
You should then be able to open the messages in Thunderbird, which is
|
||||
useful, for example, to handle the attachments.
|
||||
|
||||
With recent Recoll versions, if using the normal option of letting the
|
||||
Desktop chose the _Open_ application to use (_Use Desktop default_),
|
||||
you should also add +message/rfc822+ to the exceptions, and the whole
|
||||
thing is probably more easily done from the Recoll GUI.
|
||||
27
website/faqsandhowtos/PreventIndexingDir.txt
Normal file
27
website/faqsandhowtos/PreventIndexingDir.txt
Normal file
@ -0,0 +1,27 @@
|
||||
== Preventing indexing in a directory
|
||||
|
||||
=== Why would you want to do this ?
|
||||
|
||||
By default, recollindex (or the indexing thread inside the recoll QT user
|
||||
interface) will process your home directories and most its subdirectories,
|
||||
at the exception of some well known places (thumbnails, beagle and web
|
||||
browser caches, etc.)
|
||||
|
||||
You may want to prevent indexing in some directories where you don't expect
|
||||
interesting search results. This will avoid polluting the search result
|
||||
lists, speed up indexing times and make the index smaller.
|
||||
|
||||
=== How to do it
|
||||
|
||||
There are two ways to block indexing at certain points: either by listing
|
||||
specific paths, or by directory name pattern matches.
|
||||
|
||||
- Blocking specific paths: this is controlled by the skippedPaths variable
|
||||
in the main configuration file. You can adjust the value either by
|
||||
editing the file or by using the indexing configuration dialog:
|
||||
_Preferences->Indexing configuration->Global parameters->Skipped paths_
|
||||
- Using pattern matches: these are listed in the skippedNames variable in
|
||||
the main configuration file. You can adjust the value either by editing
|
||||
the file or by using the GUI: _Preferences->Indexing configuration->Local
|
||||
parameters->Skipped names_
|
||||
|
||||
157
website/faqsandhowtos/ProblemSolvingData.txt
Normal file
157
website/faqsandhowtos/ProblemSolvingData.txt
Normal file
@ -0,0 +1,157 @@
|
||||
== Gathering useful data for asking help about or reporting a Recoll issue
|
||||
|
||||
Once in a while it will happen that a Recoll program will either signal an
|
||||
error, or even crash (either the *recoll* graphical interface or the
|
||||
*recollindex* command line indexing command).
|
||||
|
||||
Reporting errors and crashes is very useful. It can help others, and it can
|
||||
get your own problem solved.
|
||||
|
||||
Any problem report should include the exact Recoll and system versions.
|
||||
|
||||
If at all possible, reading the following and performing part of the
|
||||
suggested steps will be useful. This is not a condition for obtaining help
|
||||
though ! If you have any problem and have a difficulty with the following,
|
||||
just contact the mailing list or the developers (see contacts on
|
||||
link:https://www.recoll.org/support.html[the Recoll site support page]).
|
||||
|
||||
If the problem concerns indexing, and was initially found using the
|
||||
*recoll* GUI, you should try to reproduce it using the
|
||||
*recollindex* command-line indexer, which is much simpler and easier to
|
||||
debug.
|
||||
|
||||
There are then two sources of useful information to diagnose the issue: the
|
||||
debug log file and, possibly, in case of a crash, a stack trace.
|
||||
|
||||
Crash and other problem reports are of very high value to me, and I am
|
||||
willing to help you with any of the steps described below if it is not
|
||||
familiar to you. I do realize that not everybody is a programmer or a
|
||||
system administrator.
|
||||
|
||||
=== Obtaining information from the log file
|
||||
|
||||
All Recoll commands write a varying amount of information to a common log file.
|
||||
|
||||
_All commands use the same log, and the file is reset every time a command
|
||||
is started: so it is important to make a copy right after the problem
|
||||
occurs (for example, do not start *recoll* after a *recollindex*
|
||||
crash, this would reset the log). A workaround for this issue is to let the
|
||||
messages go to the default +stderr+, and redirect this._
|
||||
|
||||
By default, the messages are output to +stderr+, and you probably don't even
|
||||
see them if Recoll is started from the desktop. In this case, you need to
|
||||
set the parameters so that output goes to a file, and the appropriate
|
||||
verbosity level is set. When using the command-line, you may actually
|
||||
prefer to redirect stderr to avoid the log-truncating issue described
|
||||
above.
|
||||
|
||||
You can set the log parameters from the GUI _Indexing parameters_
|
||||
section or by editing the '~/.recoll/recoll.conf' file: set the
|
||||
+loglevel+ and +logfilename+ parameters. E.g.:
|
||||
|
||||
----
|
||||
loglevel = 6
|
||||
logfilename = /tmp/recolltrace
|
||||
----
|
||||
|
||||
The log file can become very big if you need a big indexing run to
|
||||
reproduce the problem. Choose a file system with enough space available
|
||||
(possibly a few gigabytes).
|
||||
|
||||
Then run the sequence that leads to the problem, and make a copy of the log
|
||||
file just after. If the log is too big, it will usually be sufficient to
|
||||
use the last 500 lines or so (tail -500).
|
||||
|
||||
==== Single file indexing issues
|
||||
|
||||
When the problem concerns, or can be reproduced with, a single file it is
|
||||
very cumbersome to have to run a full indexing pass to reproduce it. There
|
||||
are two ways around this:
|
||||
|
||||
- Set up an ad hoc configuration with only the file of interest, or its
|
||||
parent directory:
|
||||
----
|
||||
cd
|
||||
mkdir recoll-test
|
||||
cd recoll-test
|
||||
echo /path/to/my/file/or/its/parent/dir > recoll.conf
|
||||
echo 'loglevel = 6' >> recoll.conf
|
||||
echo 'logfilename = /tmp/recolltrace' >> recoll.conf
|
||||
recollindex -z -c .
|
||||
----
|
||||
- Use the -e and -i options to recollindex to erase/reindex a single
|
||||
file. Set up the log, then:
|
||||
----
|
||||
recollindex -e /path/to/my/file
|
||||
recollindex -i /path/to/my/file
|
||||
----
|
||||
|
||||
When using the second approach, you must take care that the path used is
|
||||
consistent with the paths listed/used in the configuration (ie: if '/home' is
|
||||
a link to '/usr/home', and '/usr/home/me' is used in the configuration
|
||||
+topdirs+, `recollindex -i /home/me/myfile` will not work, you need
|
||||
to use `recollindex -i /usr/home/me/myfile`.
|
||||
|
||||
|
||||
=== Obtaining a stack trace
|
||||
|
||||
If the program actually crashes, and in order to maximize usefulness, a
|
||||
crash report should also include a so-called stack trace, something that
|
||||
indicates what the program was doing when it crashed. Getting a useful
|
||||
stack trace is not very difficult, but it may need a little work on your
|
||||
part (which will then enable me do my part of the work).
|
||||
|
||||
If your distribution includes a separate package for Recoll debugging
|
||||
symbols, it probably also has a page on its web site explaining how to use
|
||||
them to get a stack trace. You should follow these instructions. If there
|
||||
is no debugging package, you should follow the instructions below. A little
|
||||
familiarity with the command line will be necessary.
|
||||
|
||||
==== Compiling and installing a debugging version
|
||||
|
||||
- Obtain the recoll source for the version you are using (www.recoll.org),
|
||||
and extract the source tree.
|
||||
- Follow the
|
||||
link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.install.building.html[instructions
|
||||
for building Recoll from source] with the following modifications:
|
||||
- Before running configure, edit the mk/localdefs.in file and remove the
|
||||
-O2 option(s).
|
||||
- When running configure, specify the standard installation location for
|
||||
your system as a prefix (to avoid ending up with two installed versions,
|
||||
which would almost certainly end in confusion). On Linux this would
|
||||
typically be: `configure --prefix=/usr`
|
||||
- When installing, arrange for the installed executables not to be stripped
|
||||
of debugging symbols by specifying a value for the STRIP environment
|
||||
variable (ie: *echo* or *ls*): `sudo make install STRIP=ls`
|
||||
|
||||
==== Getting a core dump
|
||||
|
||||
You will need to run the operation that caused the crash inside a writable
|
||||
directory, and tell the system that you accept core dumps. The commands
|
||||
need to be run in a shell inside a terminal window. E.g.:
|
||||
|
||||
----
|
||||
cd
|
||||
ulimit -c unlimited
|
||||
recoll #(or recollindex or whatever you want to run).
|
||||
----
|
||||
|
||||
Hopefuly, you will succeed in getting the command to crash, and you will
|
||||
get a core file. A possible approach then would be to make both the
|
||||
executable and the core files available to me by uploading it to a file
|
||||
sharing site (the core file may be quite big). You should be aware though
|
||||
that the core file may contain some of the data that was being indexed,
|
||||
which may be a privacy issue. Another approach is to generate the stack
|
||||
trace yourself.
|
||||
|
||||
=== Using gdb to get a stack trace
|
||||
|
||||
- Install gdb if it is not already on the system.
|
||||
- Run gdb on the command that crashed and the core file (depending on the
|
||||
system, the core file may be named "core" or something else, like
|
||||
recollindex.core, or core.pid), ie: {{{gdb /usr/bin/recollindex core}}}
|
||||
- Inside gdb, you need to use different commands to get a stack trace for
|
||||
recoll and recollindex. For recollindex you can use the bt command. For
|
||||
recoll use `thread apply all bt full`
|
||||
- Copy/paste the output to your report email :), and quit gdb ("q").
|
||||
|
||||
61
website/faqsandhowtos/QpdfviewHelperScript.txt
Normal file
61
website/faqsandhowtos/QpdfviewHelperScript.txt
Normal file
@ -0,0 +1,61 @@
|
||||
== Starting native applications ==
|
||||
|
||||
Another example of using an intermediary script for an application with a
|
||||
command line syntax which can't be directly defined in mimeview.
|
||||
|
||||
We use a script to preprocess and adapt the options before calling the
|
||||
actual command.
|
||||
|
||||
Details about configuring how the native application or script are called
|
||||
are given with the
|
||||
link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description
|
||||
of the mimeview configuration file].
|
||||
|
||||
*qpdfview* (link:http://launchpad.net/qpdfview[web site]) is a very
|
||||
lightweight tabbed PDF viewer with great search performance and result
|
||||
highlighting.
|
||||
|
||||
It does support parsing the search term and page number from the command
|
||||
line with the following syntax:
|
||||
|
||||
----
|
||||
qpdfview --unique "%f"#%p --search "%s"
|
||||
----
|
||||
|
||||
However, qpdfview will not launch if either %p or %s are empty in the
|
||||
command above. To accommodate for that, Recoll user Florian has written a
|
||||
small wrapper shell script:
|
||||
|
||||
----
|
||||
#!/bin/bash
|
||||
|
||||
qpdfviewpath=qpdfview
|
||||
|
||||
if [ -z $2 ]
|
||||
then
|
||||
page=""
|
||||
|
||||
else
|
||||
page="#"$2""
|
||||
fi
|
||||
|
||||
if [ -z $3 ]
|
||||
then
|
||||
search=""
|
||||
|
||||
else
|
||||
search="--search "$3""
|
||||
fi
|
||||
|
||||
$qpdfviewpath --unique "$1"$page $search >&0 2>&0 &
|
||||
----
|
||||
|
||||
|
||||
The corresponding handler line for Recoll would be (depending on how you
|
||||
name the script and where you store it):
|
||||
|
||||
----
|
||||
qpdfviewwrapper %f %p %s
|
||||
----
|
||||
|
||||
|
||||
18
website/faqsandhowtos/QueryFromC.txt
Normal file
18
website/faqsandhowtos/QueryFromC.txt
Normal file
@ -0,0 +1,18 @@
|
||||
== Querying Recoll from a C program
|
||||
|
||||
The easiest way to query Recoll from a C or C++ program is to execute an
|
||||
external search command (`recollq` or `recoll -t`).
|
||||
|
||||
I have written a simple C module which deals with the related housekeeping
|
||||
and presents an easy to use API to the rest of the code. You will find it
|
||||
here:
|
||||
|
||||
https://bitbucket.org/medoc/recoll-capi
|
||||
|
||||
It is a bit experimental and will only work with recoll 1.20 for now
|
||||
(because it uses a new option for recollq). However it would be trivial to
|
||||
modify for working with 1.19, get in touch with me if you need this.
|
||||
|
||||
The other approach is to link with the Recoll library. This has no official
|
||||
API, but in practise, the internal one is fairly stable, and if you want to
|
||||
choose this approach, you should start from the code in recollq.cpp
|
||||
58
website/faqsandhowtos/ReplaceCategories.txt
Normal file
58
website/faqsandhowtos/ReplaceCategories.txt
Normal file
@ -0,0 +1,58 @@
|
||||
== Replacing the Category filter controls
|
||||
|
||||
The document category filter controls normally appear at the top of the
|
||||
*recoll* GUI, either as checkboxes just above the result list, or as a
|
||||
dropbox in the tool area.
|
||||
|
||||
By default, they are labeled _Media_, _Message_, _Spreadsheet_, _Text_,
|
||||
etc. and each map to a document category.
|
||||
|
||||
The mapping used to be fixed. You could change the number and composition
|
||||
of categories by redefining them inside the {{{mimeconf}}} configuration
|
||||
file (you still can), but the filters always used document categories.
|
||||
|
||||
Categories can also be selected from the query language by using an
|
||||
+rclcat:+ selector. E.g.: _rclcat:message_.
|
||||
|
||||
As of Recoll release 1.17, the filters are not hard-wired any more. They
|
||||
map to query language fragments. This means that you can freely redefine
|
||||
what they do.
|
||||
|
||||
The associations are configured inside the 'mimeconf' file, in the
|
||||
+[guifilters]+ section. Most GUI parameters are stored in the *Qt*
|
||||
configuration file, so this is not entirely consistent, and you will have
|
||||
to bear with my lazyness here.
|
||||
|
||||
A simple exemple will hopefuly make things clearer. If you add the
|
||||
following to your '~/.recoll/mimeconf' file:
|
||||
|
||||
----
|
||||
[guifilters]
|
||||
|
||||
Big Books = dir:"~/My Books" size>10K
|
||||
My Docs = dir:"~/My Documents"
|
||||
Small Books = dir:"~/My Books" size<10K
|
||||
System Docs = dir:/usr/share/doc
|
||||
|
||||
----
|
||||
|
||||
You will have four filter checkboxes, labelled _Big Books_, _My Docs_, etc.
|
||||
|
||||
The text after the equal sign must be a valid query language fragment, and
|
||||
will be translated to a *Recoll* query and combined with the rest of the
|
||||
query with an AND conjunction.
|
||||
|
||||
Any name text before a colon character will be erased in the display, but
|
||||
used for sorting. You can use this to display the checkboxes in any order
|
||||
you like. For exemple, the following would do exactly the same as above,
|
||||
but ordering the checkboxes in the reverse order.
|
||||
|
||||
----
|
||||
[guifilters]
|
||||
|
||||
d:Big Books = dir:"~/My Books" size>10K
|
||||
c:My Docs = dir:"~/My Documents"
|
||||
b:Small Books = dir:"~/My Books" size<10K
|
||||
a:System Docs = dir:/usr/share/doc
|
||||
|
||||
----
|
||||
23
website/faqsandhowtos/ResultsThumbnails.txt
Normal file
23
website/faqsandhowtos/ResultsThumbnails.txt
Normal file
@ -0,0 +1,23 @@
|
||||
== Result list thumbnails and how to create them
|
||||
|
||||
Recoll will display thumbnails for the results if the images exist in the
|
||||
standard location ('$HOME/.thumbnails' or '$HOME/.cache/thumbnails' depending
|
||||
on the xdg version).
|
||||
|
||||
But it will not create thumbnails, mainly because it is very hard to do
|
||||
portably.
|
||||
|
||||
Thumbnails are most commonly created when you visit a directory with your
|
||||
file manager, but visiting the whole file tree just to create thumbnails is
|
||||
a bit fastidious.
|
||||
|
||||
One simple trick to create thumbnails from the recoll GUI is to visit the
|
||||
parent directory for a result by using the _Open parent document/folder_
|
||||
entry in the right-click menu.
|
||||
|
||||
You can also find tools for the systematic creation of thumbnails for a
|
||||
directory tree. Three such tools are discussed on this
|
||||
link:http://askubuntu.com/questions/199110/how-can-i-instruct-nautilus-to-pre-generate-pdf-thumbnails[askubuntu.com discussion]
|
||||
|
||||
Also please note that no thumbnails can currently be generated or displayed
|
||||
for embedded documents (attachments, archive members, etc.).
|
||||
61
website/faqsandhowtos/SavingConfig.txt
Normal file
61
website/faqsandhowtos/SavingConfig.txt
Normal file
@ -0,0 +1,61 @@
|
||||
== User configuration backup
|
||||
|
||||
=== Why you would want to do this
|
||||
|
||||
If you are going to reinstall your system, and have some custom
|
||||
configuration, you may save some time by making a backup of your
|
||||
configuration and restoring it on the new system, rather than going through
|
||||
the menus to recreate it.
|
||||
|
||||
=== How to do it
|
||||
|
||||
==== Index/search configuration
|
||||
|
||||
The main recoll configuration data is normally kept inside '~/.recoll' or
|
||||
whatever *$RECOLL_CONFDIR* is set to.
|
||||
|
||||
This directory contains both configuration files and generated index
|
||||
data.In a standard configuration, the following files and directories
|
||||
contain generated data:
|
||||
|
||||
- 'xapiandb' contains the Xapian index, which normally consumes most of the
|
||||
total space.
|
||||
- 'aspdict.en.rws' contains the aspell dictionary used for spelling
|
||||
corrections.
|
||||
- 'mboxcache' contains cached offset data for email messages inside mbox
|
||||
folders.
|
||||
- 'webcache' contains saved web pages. This is more than a cache as
|
||||
destroying it will purge the corresponding data during the next
|
||||
indexing.
|
||||
|
||||
The other files are either very small or contain configuration data.
|
||||
|
||||
If you want to only save configuration, using minimum space, you can
|
||||
destroy the above files and directories (with the possible exception of
|
||||
'webcache'). Then taking a copy of the '.recoll' directory and adding the
|
||||
GUI configuration data described in the next will get you a full
|
||||
configuration data backup.
|
||||
|
||||
==== GUI configuration
|
||||
|
||||
The parameters set from the _Query configuration_ Qt menus are stored in
|
||||
Qt standard places:
|
||||
|
||||
- '~/.qt/recollrc' for Qt 3.x
|
||||
- '~/.config/Recoll.org/recoll.conf' for Qt 4 and later
|
||||
|
||||
|
||||
==== Other data
|
||||
|
||||
If you wish to save index data in addition to the customisation files,
|
||||
which only makes sense if the document access paths do not change after
|
||||
reinstallation, you can just take a backup of the full '.recoll'
|
||||
directory, taking care that the storage locations for some data elements
|
||||
can be changed (not be inside '.recoll'):
|
||||
|
||||
- The index data is normally kept inside '~/.recoll/xapiandb', but the
|
||||
location of this directory can be modified by the +dbdir+
|
||||
configuration parameter if it is set (check 'recoll.conf').
|
||||
- If you use the Firefox Recoll plugin, the WEB history cache is normally
|
||||
kept inside '~/.recoll/webcache', but the location can be modified by
|
||||
the +webcachedir+ configuration parameter.
|
||||
109
website/faqsandhowtos/UnityLens.txt
Normal file
109
website/faqsandhowtos/UnityLens.txt
Normal file
@ -0,0 +1,109 @@
|
||||
== Building and Installing the Ubuntu Unity Recoll Lens
|
||||
|
||||
Important preliminary notes:
|
||||
|
||||
- This only makes sense for Ubuntu versions using the Unity environment:
|
||||
Natty (11.04), Oneiric (11.10), Precise (12.04), and later.
|
||||
- _Remember that you still need to use the recoll GUI (or the recollindex
|
||||
//command) to get the indexing going !_
|
||||
- The Lens is artificially limited to showing at most 20 results. Use the
|
||||
recoll GUI for more complete capabilities (or edit rclsearch.py, change
|
||||
the "if actual_results >= 20:" line).
|
||||
|
||||
|
||||
=== The Lens with Recoll 1.17 and later
|
||||
|
||||
If you are willing to install or upgrade to Recoll version 1.17, all
|
||||
necessary packages are on the Recoll PPA, you just need to add the
|
||||
repository to your system sources and add or upgrade the packages: *_/This
|
||||
is the recommended approach!_*
|
||||
|
||||
----
|
||||
sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on
|
||||
sudo apt-get update
|
||||
sudo apt-get install recoll-lens recoll
|
||||
----
|
||||
|
||||
This document may still be useful if you want to modify the lens source
|
||||
code.
|
||||
|
||||
=== The Lens with older Recoll versions
|
||||
|
||||
If, for some reason, you wish to test the Lens with an older Recoll
|
||||
version, read the following.
|
||||
|
||||
Please not that such an installation is somewhat crippled: you will not be
|
||||
able to display results for embedded documents (emails inside an mbox,
|
||||
attachments etc.). This requires a recoll command line option which is only
|
||||
available in 1.17
|
||||
|
||||
The Lens is based on the Recoll Python module which is not built by default
|
||||
for versions prior to 1.17, so so you will first need to pull the Recoll
|
||||
source code (for you version), then untar and proceed with the
|
||||
configure/build instructions below.
|
||||
|
||||
The following uses --prefix=/usr. I have no real reason to believe
|
||||
that this would not work with /usr/local (lenses are also searched there by
|
||||
default). If you confirm that things work with another prefix, please drop
|
||||
me a line.
|
||||
|
||||
When doing this over a previous Recoll compilation, run a "make clean" to
|
||||
get rid of the non-PIC objects.
|
||||
|
||||
Note that the following instructions change nothing to your existing Recoll
|
||||
installation, they only install the Python module and the Unity Lens,
|
||||
recoll, recollindex etc. are unaffected.
|
||||
|
||||
'/TOP/OF/RECOLL/SRC' designates the top of the recoll source tree.
|
||||
|
||||
=== Configure and build the recoll library and python module, install the module
|
||||
|
||||
The following needs the development packages for Xapian, Python and zlib.
|
||||
|
||||
----
|
||||
cd /TOP/OF/RECOLL/SRC
|
||||
# May fail if no previous build was performed
|
||||
make clean
|
||||
|
||||
# the gui/x11 disabling is just here to avoid having to install the
|
||||
# development libraries for Qt.
|
||||
configure --prefix=/usr --enable-pic --without-x --disable-qtgui
|
||||
make
|
||||
|
||||
cd python/recoll
|
||||
python setup.py build
|
||||
sudo python setup.py install
|
||||
----
|
||||
|
||||
=== Build and install the Unity Lens
|
||||
|
||||
----
|
||||
cd /TOP/OF/RECOLL/SRC
|
||||
cd desktop/unity-lens-recoll
|
||||
configure --prefix=/usr --sysconfdir=/etc
|
||||
sudo make install
|
||||
|
||||
----
|
||||
|
||||
Voilà, it should work...
|
||||
|
||||
Try to start the Dash, you should see the Recoll checkerboard (or
|
||||
whatever...) in the Lens list.
|
||||
|
||||
The Recoll Lens expects a Recoll query language string, so you can use
|
||||
field searches, directory, size, and date filtering (see the
|
||||
link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.lang.html[Recoll
|
||||
manual] for a description of the query language).
|
||||
|
||||
If you want to disable the Lens, I think that you just have to delete
|
||||
'/usr/share/unity/lenses/recoll'
|
||||
|
||||
Other installed files:
|
||||
|
||||
----
|
||||
/usr/libexec/unity-recoll-daemon
|
||||
/usr/share/dbus-1/services/unity-lens-recoll.service
|
||||
/usr/share/doc/unity-lens-recoll
|
||||
/usr/share/unity-lens-recoll
|
||||
----
|
||||
|
||||
68
website/faqsandhowtos/UsingOpenWith.txt
Normal file
68
website/faqsandhowtos/UsingOpenWith.txt
Normal file
@ -0,0 +1,68 @@
|
||||
== Using the _Open With_ context menu in recoll 1.20 and newer
|
||||
|
||||
Recoll versions and newer have an _Open With_ entry in the result list
|
||||
context menu (the thing which pops up on a right click).
|
||||
|
||||
This allows choosing the application used to edit the document, instead of
|
||||
using the default one.
|
||||
|
||||
The list of applications is built from the desktop files found inside
|
||||
'/usr/share/applications'. For each application on the system, these
|
||||
files lists the mime types that the application can process.
|
||||
|
||||
If the application which you would want listed does not appear, the most
|
||||
probable cause is that it has no desktop file, which could happen due to a
|
||||
number of reasons.
|
||||
|
||||
This can be fixed very easily: just add a +.desktop+ file to
|
||||
'/usr/share/applications', starting from an existing one as a template.
|
||||
|
||||
As an example, based on an original idea from Recoll user +florianbw+,
|
||||
the following describes setting up a script for editing a PDF document
|
||||
title found in the recoll result list.
|
||||
|
||||
The script uses the *zenity* shell script dialog box tool to let you
|
||||
enter the new title, and then executes *exiftool* to actually change
|
||||
the document.
|
||||
|
||||
----
|
||||
#!/bin/sh
|
||||
|
||||
PDF=$1
|
||||
TITLE=`exiftool -Title -s3 "$PDF"`
|
||||
|
||||
RES=`zenity --entry \
|
||||
--title="Change PDF Title" \
|
||||
--text="Enter the Title:" \
|
||||
--entry-text "$TITLE"`
|
||||
|
||||
if [ "$RES" != "" ]; then
|
||||
echo -n "Changing title to $RES ... " && \
|
||||
exiftool -Title="$RES" "$PDF" && \
|
||||
recollindex -i "$PDF" && echo "Done!"
|
||||
else
|
||||
echo "No title entered"
|
||||
fi
|
||||
----
|
||||
|
||||
Name it, for example, 'pdf-edit-title.sh', and make it executable
|
||||
(`chmod a+x pdf-edit-title.sh`).
|
||||
|
||||
Then create a file named 'pdf-edit-title.desktop' inside
|
||||
'/usr/share/applications'. The file name does not need to be the same as the
|
||||
script's, this is just to make things clearer:
|
||||
|
||||
----
|
||||
[Desktop Entry]
|
||||
Name=PDF Title Editor
|
||||
Comment=Small script based on exiftool used to edit a pdf document title
|
||||
Exec=/home/dockes/bin/pdf-edit-title.sh %F
|
||||
Type=Application
|
||||
MimeType=application/pdf;
|
||||
----
|
||||
|
||||
You're done ! Restart Recoll, perform a search and right-click on a PDF
|
||||
result: you should see an entry named _PDF Title Editor_ in the _Open
|
||||
With_ list. Click on it, and you will be able to edit the title.
|
||||
|
||||
|
||||
99
website/faqsandhowtos/WhyIsMyFileNotIndexed.txt
Normal file
99
website/faqsandhowtos/WhyIsMyFileNotIndexed.txt
Normal file
@ -0,0 +1,99 @@
|
||||
== Using the log file to investigate indexing issues
|
||||
|
||||
All *Recoll* processes print trace messages. By default these go to the
|
||||
standard error output, and you may not ever see them (in the case, for
|
||||
example, of the *recoll* GUI started from the desktop interface).
|
||||
|
||||
There are a number of potential issues with indexing that may need
|
||||
investigation, such as:
|
||||
|
||||
- A file can't be found by searching even if it appears that it should have
|
||||
be indexed (this could happen because the file is not selected at all or
|
||||
because a filter program crashes).
|
||||
- The indexing process gets stuck and never finishes.
|
||||
- The indexing process ends up with an error.
|
||||
- The indexing process seems to be using too much system capacity.
|
||||
|
||||
The right way to approach these problems is to use the *recollindex*
|
||||
command line tool (instead of the *recoll* GUI), and to set up the
|
||||
trace log to provide information about what indexing is actually doing.
|
||||
|
||||
Trace log parameters can be set either from the GUI _Preferences->Indexing
|
||||
Configuration->Global Parameters_ panel, or by editing the configuration
|
||||
file '~/.recoll/recoll.conf'. You should set the following parameters:
|
||||
|
||||
----
|
||||
loglevel = 6
|
||||
logfilename = stderr
|
||||
thrQSizes = -1 -1 -1
|
||||
----
|
||||
|
||||
We use _stderr_ instead of an actual file in order to capture direct filter
|
||||
messages (such as a *python* stack trace) along with normal
|
||||
*recollindex* messages.
|
||||
|
||||
The last line sets recollindex for single-threaded operation, which will
|
||||
make the log much more readable.
|
||||
|
||||
You should then check that no *recoll* or *recollindex* process is
|
||||
currently running, and kill any you find.
|
||||
|
||||
Then, if this is an issue about an identified file, try indexing it only:
|
||||
|
||||
----
|
||||
recollindex -i myunfindablefile.xxx > /tmp/myindexlog 2>&1
|
||||
----
|
||||
|
||||
If this is a general issue with indexing (process not finishing properly),
|
||||
just start it:
|
||||
|
||||
----
|
||||
recollindex > /tmp/myindexlog 2>&1
|
||||
----
|
||||
|
||||
Usually, having a look at the trace will allow to see what is wrong (e.g.:
|
||||
a configuration issue or missing filter), and solve the problem.
|
||||
|
||||
In case of indexer misbehaviour (e.g. using too much memory, you should run
|
||||
_tail -f_ on the log to see what is going on.
|
||||
|
||||
If this is not enough, please
|
||||
link:http://bitbucket.org/medoc/recoll/issues/new[open a tracker issue] and
|
||||
attach or link to the log data, or just email me (jfd at recoll.org).
|
||||
|
||||
*recollindex* and *recollindex -i* usually have the same criteria to
|
||||
include a file or not (but see the _Path gotcha_ note below). It may
|
||||
happen that they behave differently, so it may sometimes be useful to run a
|
||||
full *recollindex* even for a specific file, but this will produce a
|
||||
big log file.
|
||||
|
||||
When you are done, it is better to reset the verbosity to a reasonable
|
||||
level (e.g.: +2+ : just errors, +4+ : basic traces).
|
||||
|
||||
=== Note: the path gotcha
|
||||
|
||||
*recollindex -i* will only index files under the directories defined by the
|
||||
+topdirs+ configuration variable (your home directory by
|
||||
default). Unfortunately, the test is done on the file path text, ignoring
|
||||
possible symbolic links. If you give a simple file name as a parameter to
|
||||
*recollindex -i* and there are symbolic links inside the +topdirs+
|
||||
entries, the comparison may fail. For example, if your home directory is
|
||||
'/home/me/' and '/home/' is a link to '/usr/home/', *recollindex -i
|
||||
somefilename* will actually try to index '/usr/home/somefilename/', and
|
||||
fail (because '/usr/home/me/' is not a subdirectory of '/home/me/'). This
|
||||
will manifest itself in the log by a message like the following.
|
||||
|
||||
----
|
||||
:4:../index/fsindexer.cpp:149:FsIndexer::indexFiles: skipping [/usr/home/me/somefile] (ntd)
|
||||
----
|
||||
|
||||
If this happens, give a full path consistent with what is found in the
|
||||
configuration file (e.g.: _recollindex -i /home/me/somefile_).
|
||||
|
||||
=== File system occupation
|
||||
|
||||
One of the possible reasons for failed indexing is a +maxfsoccup+
|
||||
parameter set too low. This is the value of file system occupation, not
|
||||
free space, where indexing will stop. It is set from the GUI indexing
|
||||
configuration or by editing 'recoll.conf'. A value of 0 implies no
|
||||
checking, but a very low, non-zero, value will just prevent indexing.
|
||||
65
website/faqsandhowtos/WikiIndex.txt
Normal file
65
website/faqsandhowtos/WikiIndex.txt
Normal file
@ -0,0 +1,65 @@
|
||||
== Recoll Wiki file index
|
||||
link:ElinksWeb.html[Extending the Recoll Firefox visited web page indexing mechanism to other browsers]
|
||||
|
||||
link:FaqsAndHowTos.html[Faqs and Howtos]
|
||||
|
||||
link:FilterArch.html[Recoll input filters ]
|
||||
|
||||
link:FilterRetrofit.html[Installing a filter for a new document type]
|
||||
|
||||
link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members]
|
||||
|
||||
link:GUIKeyboard.html[# Recoll GUI keyboard navigation]
|
||||
|
||||
link:HandleCustomField.html[Generating a custom field and using it to sort results]
|
||||
|
||||
link:Home.html[Welcome to the Recoll Wiki]
|
||||
|
||||
link:HotRecoll.html[Recoll hotkey: starting / hiding recoll with a keyboard shortcut]
|
||||
|
||||
link:IndexMailHeader.html[Indexing arbitrary mail headers ]
|
||||
|
||||
link:IndexMozillaCalendari.html[Indexing Mozilla calendar data ]
|
||||
|
||||
link:IndexOnAc.html[Laptops: automatically starting or stopping indexing according to AC power status]
|
||||
|
||||
link:IndexOutlook.html[Indexing Outlook archives]
|
||||
|
||||
link:IndexWebHistory.html[Indexing Web history with the Firefox extension ]
|
||||
|
||||
link:MultipleIndexes.html[Creating and using multiple indexes]
|
||||
|
||||
link:MuttAndRecoll.html[Interfacing Recoll and Mutt]
|
||||
|
||||
link:NonAsciiFileNames.html[Unix and non-ASCII file names, a summary of issues]
|
||||
|
||||
link:OpenHelperScript.html[Starting native applications ]
|
||||
|
||||
link:PreventIndexingDir.html[Preventing indexing in a directory]
|
||||
|
||||
link:ProblemSolvingData.html[Gathering useful data for asking help about or reporting a Recoll issue]
|
||||
|
||||
link:QpdfviewHelperScript.html[Starting native applications ]
|
||||
|
||||
link:QueryFromC.html[Querying Recoll from a C program]
|
||||
|
||||
link:ReplaceCategories.html[Replacing the Category filter controls]
|
||||
|
||||
link:ResultsThumbnails.html[Result list thumbnails and how to create them]
|
||||
|
||||
link:SavingConfig.html[User configuration backup]
|
||||
|
||||
link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens]
|
||||
|
||||
link:UsingOpenWith.html[Using the Open With context menu in recoll 1.20 and newe]
|
||||
|
||||
link:WhyIsMyFileNotIndexed.html[Using the log file to investigate indexing issues]
|
||||
|
||||
link:XDGBase.html[XDG: Tidying Recoll data storage]
|
||||
|
||||
link:ZDevCaseAndDiacritics1.html[Character case and diacritic marks (1), issues with stemming]
|
||||
|
||||
link:ZDevCaseAndDiacritics2.html[Character case and diacritic marks (2), user interface]
|
||||
|
||||
link:ZDevCaseAndDiacritics3.html[Character case and diacritic marks (3), implementation]
|
||||
|
||||
42
website/faqsandhowtos/XDGBase.txt
Normal file
42
website/faqsandhowtos/XDGBase.txt
Normal file
@ -0,0 +1,42 @@
|
||||
== XDG: Tidying Recoll data storage ==
|
||||
|
||||
The default storage structure of Recoll configuration and index data is
|
||||
quite at odds with what recommends the
|
||||
link:http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html[XDG
|
||||
Base Directory Specification], the reason being that it predates said spec.
|
||||
|
||||
By default, Recoll stores all its data in a single directory: '$HOME/.recoll'
|
||||
|
||||
This is not going to change, because it would be quite disturbing for
|
||||
current users.
|
||||
|
||||
However, the location of this directory can be modified using the
|
||||
+$RECOLL_CONFDIR+ environment variable.
|
||||
|
||||
Furthermore all significant Recoll data categories can be moved away from
|
||||
the configuration directory (maybe to '$HOME/.cache'), by setting
|
||||
configuration variables:
|
||||
|
||||
* _dbdir_ defines the location for storing the Xapian
|
||||
index. This could be set to, e.g., '$HOME/.cache/recoll/xapiandb'. It is
|
||||
quite recommended that
|
||||
this directory be dedicated to Xapian (don't store other things in
|
||||
there).
|
||||
* _mboxcachedir_ defines the location for caching access speedup information
|
||||
about mail folders in mbox format. e.g. '$HOME/.cache/recoll/mboxcache'
|
||||
* New in 1.22: you can use _aspellDictDir_ to define the storage
|
||||
location for the aspell spelling approximation
|
||||
dictionary. E.g. '$HOME/.cache/recoll'
|
||||
* _webcachedir_ may be used to define where the visited web pages
|
||||
archive is stored. E.g. '$HOME/.cache/recoll/webcache'. This is only used
|
||||
if you activate the Firefox plugin and web history indexing. You may
|
||||
want to think a bit more about where to store it, because, contrary to
|
||||
the above, this is not discardable data: your Recoll Web history goes
|
||||
away if you delete it.
|
||||
|
||||
If you use multiple Recoll configurations, each will have to be customized.
|
||||
|
||||
Once these are put away, there are still a few modifyiable files in the
|
||||
configuration directory, for example the 'recoll.pid' and 'history'
|
||||
files, but these are small files. Moving 'recoll.pid' away would be a
|
||||
serious headache because it is used by scripts.
|
||||
143
website/faqsandhowtos/ZDevCaseAndDiacritics1.txt
Normal file
143
website/faqsandhowtos/ZDevCaseAndDiacritics1.txt
Normal file
@ -0,0 +1,143 @@
|
||||
== Character case and diacritic marks (1), issues with stemming
|
||||
|
||||
=== Case and diacritics in Recoll
|
||||
|
||||
Recoll versions up to 1.17 almost fully ignore character case and diacritic
|
||||
marks.
|
||||
|
||||
All terms are converted to lower case and unaccented before they are
|
||||
written to the index. There are only two exceptions:
|
||||
|
||||
* File paths (as used in _dir:_ clauses) are not converted. This might
|
||||
be a bug or a feature, but the main reason is that we don't know how they
|
||||
are encoded.
|
||||
* It is possible to specify that some characters will keep their diacritic
|
||||
marks, because the entity formed by the character and the diacritic mark
|
||||
is considered to be a different letter, not a modified one. This is
|
||||
highly dependant on the language. For exemple, in Swedish, +å+ should
|
||||
be preserved, not turned into +a+.
|
||||
|
||||
As a necessary consequence, the same transformations are applied to search
|
||||
terms, and it is impossible to search for a specific capitalization of a
|
||||
word (+US+ is looked for as +us+), or a specific accented form
|
||||
(+café+ will be looked for as +cafe+).
|
||||
|
||||
However, there are some cases where you would like to be more specific:
|
||||
|
||||
* Searching for +US+ or +us+ should probably return different results.
|
||||
* Diacritics are seldom significant in English, but we can find a
|
||||
few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of
|
||||
course, there are many more cases in languages which use more diacritics.
|
||||
|
||||
On the other hand, accents are often mistyped or forgotten (résumé, résume,
|
||||
resume?), and capitalization is most often unsignificant, so that it is
|
||||
very important to retain the capability to ignore accent and character
|
||||
case differences, and that the discrimination can be easily switched on or
|
||||
off for each search (or even for specific terms).
|
||||
|
||||
This text and other pages which will follow will discuss issues in adding
|
||||
character case and diacritics sensitivity to Recoll, under the assumption
|
||||
that the main index will contain the raw source terms instead of
|
||||
case-folded and unaccented ones.
|
||||
|
||||
The following will use the _unaccent_ neologism to mean _remove
|
||||
diacritic marks_ (and not only accents).
|
||||
|
||||
English examples are used when possible, but given the limited use of
|
||||
diacritics in English, some French will probably creep in.
|
||||
|
||||
=== Diacritics and stemming
|
||||
|
||||
Stemming is the process by which we extend a search to terms related by
|
||||
grammatical inflexion, for example singular/plural, verb tenses, etc. For
|
||||
example a search for +floor+ is normally expanded by Recoll to +floors,
|
||||
floored, flooring, ...+
|
||||
|
||||
In practice Recoll has a separate data structure that has stemmed terms
|
||||
(stems) as keys pointing to a list of expansion terms
|
||||
{{{floor -> (floor,floors,floorings,...)}}}
|
||||
|
||||
Stemming should be applied to terms before they are stripped of
|
||||
diacritics. Accents may have a grammatical significance, and the accent may
|
||||
change how the term is stemmed. For example, in French the +âmes+ suffix
|
||||
generally marks a past conjugation but +ames+ does not. The standard
|
||||
Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem,
|
||||
but +évitames+ will be turned into +évitam+ (stripping
|
||||
plural and feminine suffixes).
|
||||
|
||||
When the search is set to ignore diacritics, this poses a specific problem:
|
||||
if the user enters the search term without accents (which is correct
|
||||
because the system is supposed to ignore them), there is no warranty that
|
||||
the term will be correctly expanded by stemming.
|
||||
|
||||
The diacritic mismatch breaks the family relationship between the stem
|
||||
siblings, and this is independant of the type of index: it will happen with
|
||||
an index where diacritics are stripped just as with a raw one.
|
||||
|
||||
The simpler case where diacritics in the original term only affects
|
||||
diacritics in the stem also necessitates specific processing, but it is
|
||||
easier to work around.
|
||||
|
||||
Two examples illustrating these issues follow.
|
||||
|
||||
==== The simple case: diacritics in the term only affect diacritics in the stem
|
||||
|
||||
Let's imagine that the document set contains the term +éviter+
|
||||
(infinitive of +to avoid+), but not +évite+ (present). The only term in
|
||||
the actual index is then +éviter+.
|
||||
|
||||
The user enters an unaccented +evite+, counting on the
|
||||
diacritics-insensitive search mode to deal with the accents. As +évite+
|
||||
is not present in the index, we have no way to guess that +evite+ is
|
||||
really +évite+.
|
||||
|
||||
The stemmer will turn +evite+ into +evit+. There is no way that this
|
||||
can be related to +éviter+, and this legitimate result can't be found.
|
||||
|
||||
There is a way around this: we can compute a separate
|
||||
stem expansion dictionary for unaccented terms. This dictionary, to be used
|
||||
with diacritic-unsensitive searches only, contains the relationship
|
||||
between +evit+ and +eviter+ (as +éviter+ is in the index). We can
|
||||
then relate +eviter+ and +éviter+ because they differ only by accents,
|
||||
and the search will find the document with +éviter+.
|
||||
|
||||
==== The bad case: diacritics in the term change the stem beyond diacritics
|
||||
|
||||
Some grammatically significant accents will cause unexpectedly missing
|
||||
search results when using a supposedly diacritics-insensitive search mode.
|
||||
|
||||
Let's imagine that the document set contains the term +éviter+
|
||||
(infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming
|
||||
expansion table has an entry for +évit+ -> +éviter+.
|
||||
|
||||
If the user enters an unaccented +evitames+, she would expect to find the
|
||||
documents containing +éviter+ in the results, because the latter term is
|
||||
a stemming sibling of +évitâmes+ and the search is supposedly not
|
||||
influenced by diacritics, so that +evitames+ and +évitâmes+ should be
|
||||
equivalent.
|
||||
|
||||
However, our search is now in trouble, because +évitâmes+ is not in any
|
||||
document, so that there is no data in the index which would inform us about
|
||||
how to transform the input term into something that differs only by accents
|
||||
but would yield a correct input for the stemmer.
|
||||
|
||||
If we try to feed the raw user input to the stemmer, it will propose
|
||||
an +evitam+ stem, which will not work, because the stem that actually
|
||||
exists is +évit+, and +evitam+ can not be related to +éviter+.
|
||||
|
||||
The only palliative approach I can think of would be a spelling correction
|
||||
of the input, performed independantly of the actual index contents, which
|
||||
would notice that +évitames+ is not a French word and propose a change or an
|
||||
expansion to +évitâmes+, which would correctly stem to +évit+ and allow
|
||||
us to find +éviter+.
|
||||
|
||||
This issue is not specific to Recoll or indeed to the fact that the index
|
||||
retains accent or not. As far as I can see, it is an intrinsic bad
|
||||
interaction between diacritics insensitivity and stemming.
|
||||
|
||||
It is also interesting to note that this case becomes less probable when
|
||||
the data set becomes bigger, because more term inflexions will then be
|
||||
present in the index.
|
||||
|
||||
We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate
|
||||
interface].
|
||||
122
website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
Normal file
122
website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
Normal file
@ -0,0 +1,122 @@
|
||||
== Character case and diacritic marks (2), user interface
|
||||
|
||||
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
|
||||
of the problems which arise when mixing case/diacritics sensitivity and
|
||||
stemming.
|
||||
|
||||
As of version 1.18, Recoll can create two types of indexes:
|
||||
* _Dumb_ indexes contain terms which are lowercased and stripped of
|
||||
diacritics. Searches using such an index are naturally case- and
|
||||
diacritics- insensitive: search terms are stripped before processing.
|
||||
* _Raw_ indexes contain terms which are just like they were found in the
|
||||
source document. Searching such an index is naturally sensitive to case
|
||||
and diacritics, and can be made insensitive by further processing.
|
||||
|
||||
The following explains how users can control these Recoll features.
|
||||
|
||||
=== Controlling the type of index we create: stripped or raw
|
||||
|
||||
The kind of index that recoll creates is determined by:
|
||||
|
||||
* A build-time *configure* switch: _--enable-stripchars_. If this is
|
||||
set, the code for case and diacritics sensitivity is not compiled in and
|
||||
recoll will work like the previous versions: unaccented and casefolded
|
||||
index, no runtime options for case or diacritics sensitivity
|
||||
|
||||
* An indexing configuration switch (in recoll.conf): if Recoll was built
|
||||
with _--disable-stripchars_, this will provide a dynamic way to return
|
||||
to the "traditional" index. The case and diacritics code will be present
|
||||
but inactive. Normally, a recoll installation with this switch set
|
||||
should behave exactly like one built with _--enable-stripchars_. When
|
||||
using multiple indexes, this switch MUST be consistent between
|
||||
indexes. There is no support whatsoever for mixing raw and dumb indexes.
|
||||
The option is named _indexStripChars_, and it is not settable from the
|
||||
GUI to avoid errors. This is something that would typically be set once
|
||||
and for all for a given installation. We need to decide what the default
|
||||
value will be for 1.18
|
||||
|
||||
* A number of query time switches. Using these it is also possible to
|
||||
perform a search insensitive to case and diacritics on a raw index. Note
|
||||
however, that, given the complexity of the issues involved, I give no
|
||||
guaranty at this time that this will yield exactly the same results as
|
||||
searching a dumb index. Details about query time behaviour follow.
|
||||
|
||||
|
||||
=== Controlling stem, case and diacritics expansion: user query interface
|
||||
|
||||
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
|
||||
needed to give the user a way to control stem expansion. This was done in
|
||||
three ways:
|
||||
|
||||
* Globally, by setting a menu option.
|
||||
* Globally, by setting the stemming language value to empty.
|
||||
* On a term by term basis by Capitalizing the term, or, in query language
|
||||
mode only, by using an 'l' clause modifier (_"term"l_).
|
||||
|
||||
After switching to an unstripped index, capable of case and diacritic
|
||||
sensitivity, we need ways to control what processing is performed among:
|
||||
|
||||
* Case expansion.
|
||||
* Diacritics expansion.
|
||||
* Stem expansion.
|
||||
|
||||
The default mode will be compatible with the previous version, because
|
||||
this is is most generally what we want to do: ignore case and diacritics,
|
||||
expand stems.
|
||||
|
||||
There are two easy approaches for controlling the parameters:
|
||||
* Global options set in the GUI menus or as *recollq* command line
|
||||
switches.
|
||||
* Per-clause options set by modifiers in the query language.
|
||||
|
||||
We would like, however to let the user entry automatically override the
|
||||
defaults in a sensible way. For example:
|
||||
|
||||
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
||||
(for this term only).
|
||||
* If a term is entered with upper-case characters, case sensitivity is
|
||||
turned on. In this case, we turn off stem expansion, because it makes
|
||||
really no sense with case sensitivity.
|
||||
|
||||
With this method we are stuck with 3 problems (only if the global mode is
|
||||
set to insensitive, and we're not using the query language):
|
||||
|
||||
* Turning off stemming without turning on case sensitivity.
|
||||
* Searching for an all lower-case term in case-sensitive mode.
|
||||
* Searching for a term without diacritics in diacritic-sensitive mode.
|
||||
|
||||
The two latter issues are relatively marginal and can be worked around easily
|
||||
by switching to query language mode or using negative clauses in the
|
||||
advanced search.
|
||||
|
||||
However, we need to be able to turn stemming off while remaining
|
||||
insensitive to case, and we need to stay reasonably compatible with the
|
||||
previous versions. This means that a term which has a capital first letter
|
||||
but is otherwise lowercase will turn stemming off, but not case sensitivity
|
||||
on.
|
||||
|
||||
So we're left with how to search for such a term in a case-sensitive way,
|
||||
and for this, you'll have to use global options or the query language.
|
||||
|
||||
The modified method is:
|
||||
|
||||
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
||||
(for this term only).
|
||||
* If the first letter in a term is upper-case and the rest is lower-case,
|
||||
we turn stem expansion off, but we do not become case-sensitive
|
||||
* If any letter in a term except the first is upper-case, case sensitivity
|
||||
is turned on. Stem expansion is also turned-off (even if the first
|
||||
letter is lower-case), because it makes really no sense with case
|
||||
sensitivity.
|
||||
* To search for an all lower-case or capitalized term in a case-sensitive
|
||||
way, use the query language: "Capitalized"C, "lowercase"C
|
||||
* Use the query language and the "D" modifier to turn on diacritics
|
||||
sensitivity.
|
||||
|
||||
It can be noted that some combinations of choices do not make sense and
|
||||
they are not allowed by Recoll: for example, diacritics or case sensitivity
|
||||
do not make sense with stem expansion (which cannot preserve diacritics in
|
||||
any meaningful general way).
|
||||
|
||||
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
|
||||
implementation in Recoll 1.18.
|
||||
67
website/faqsandhowtos/ZDevCaseAndDiacritics3.txt
Normal file
67
website/faqsandhowtos/ZDevCaseAndDiacritics3.txt
Normal file
@ -0,0 +1,67 @@
|
||||
== Character case and diacritic marks (3), implementation
|
||||
|
||||
In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics
|
||||
and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate
|
||||
interface] for switchable search sensitivity to diacritics and character
|
||||
case.
|
||||
|
||||
So you are in this mood again and you don't want to type accents (maybe you're
|
||||
stuck with a QWERTY American english keyboard), or conversely you're
|
||||
want to resume looking for your résumé, and you've told Recoll as much,
|
||||
using the appropriate interface. What happens then ?
|
||||
|
||||
The second case is easy if the index is raw, and mostly impossible if it is
|
||||
stripped. So we'll concentrate on the first case: how to achieve case and
|
||||
diacritics insensitivity on a raw index ?
|
||||
|
||||
Recoll uses three expansion tables:
|
||||
|
||||
* The first table has stripped and lowercased terms as keys and raw terms as
|
||||
data: +mate -> (mate, maté, MATE,...)+.
|
||||
|
||||
* The second table has lowercased stems as keys and original lowercase terms
|
||||
as data (when using multiple languages, there are several such tables):
|
||||
+évit -> (éviter, évite, évitâmes, ...)+.
|
||||
|
||||
* The third table has stripped and lowercased stems as keys and stripped
|
||||
lowercased terms as data:
|
||||
+evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+
|
||||
|
||||
The first table can be used for full case and diacritics expansion or for
|
||||
only one of those, by post-filtering the results of full expansion (e.g. if
|
||||
we only want diacritics expansion, we filter by stripping diacritics from
|
||||
each result term and check that it's identical to the input). For example
|
||||
if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to
|
||||
only perform case expansion for an input of +maté+, we apply case folding
|
||||
to the initial output and keep only +maté+, as +mate+ differs from the
|
||||
input.
|
||||
|
||||
We only perform stemming expansion when case and diacritics sensitivity is
|
||||
off. It is performed using the second and third tables, both on the
|
||||
lowercased and lowercased/stripped output of the first step, and each term
|
||||
in the output stemming is expanded again for case (using the first table).
|
||||
|
||||
A full example of the expansion occurring during an insensitive search
|
||||
for +resume+ using French stemming on a mixed English/French index
|
||||
follows. An important thing to remember is that the result of each
|
||||
expansion is a function of the terms actually present in the index, not
|
||||
some arbitrary computation (and so, of course, many of the possible but
|
||||
absent variations are missing).
|
||||
|
||||
# The case and diacritics expansion of +resume+ yields +RESUME Resume
|
||||
Résumé resumé résume résumé resume+
|
||||
|
||||
# The Stem expansion input list (lower-cased) is:
|
||||
+resume resumé résume résumé+, and the output is:
|
||||
+resum resume resumenes resumer resumes resumé resumée résum résumait
|
||||
résumant résume résumer résumerai résumerait résumes résumez résumé résumée
|
||||
résumées résumés+
|
||||
|
||||
# Each of the above terms is then fed to case and diacritics expansion (first
|
||||
table), for the final output:
|
||||
+resume résumé Résumé résumer résume Resume résumés RESUME resumes
|
||||
resumer résumant resúmenes resumé résumait résumes résumée resumee
|
||||
résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.
|
||||
|
||||
A Xapian OR query is finally constructed from the expanded term list.
|
||||
|
||||
20
website/faqsandhowtos/makeindex.sh
Normal file
20
website/faqsandhowtos/makeindex.sh
Normal file
@ -0,0 +1,20 @@
|
||||
#!/bin/sh
|
||||
WIDX=WikiIndex.txt
|
||||
|
||||
echo "== Recoll Wiki file index" > $WIDX
|
||||
for f in *.txt; do
|
||||
if test "$f" = $WIDX ; then continue; fi
|
||||
h="`basename $f .txt`.html"
|
||||
title=`head -1 "$f" | sed -e 's/=//g' -e 's/^ *//' -e 's/ *$//' -e 's/
//g'`
|
||||
echo 'link:'$h'['$title']' >> $WIDX
|
||||
echo >> $WIDX
|
||||
done
|
||||
|
||||
exit 0
|
||||
# Check and display what files are in the index but not in the contents table:
|
||||
|
||||
grep \| FaqsAndHowTos.txt | awk -F\| '{print $1}' | sed -e 's/\* \[\[//' -e 's/.wiki//' |sort > ctfiles.tmp
|
||||
grep '\[\[' WikiIndex.txt | awk -F\| '{print $1}' | sed -e 's/\[\[//' -e 's/.wiki//' -e 's/.md//' | sort > ixfiles.tmp
|
||||
echo 'diff ContentFiles IndexFiles:'
|
||||
diff ctfiles.tmp ixfiles.tmp
|
||||
rm ctfiles.tmp ixfiles.tmp
|
||||
Loading…
x
Reference in New Issue
Block a user