diff --git a/website/faqsandhowtos/ElinksWeb.txt b/website/faqsandhowtos/ElinksWeb.txt new file mode 100644 index 00000000..497bc139 --- /dev/null +++ b/website/faqsandhowtos/ElinksWeb.txt @@ -0,0 +1,35 @@ +== Extending the Recoll Firefox visited web page indexing mechanism to other browsers + +The *Recoll* _Web Queue_ function allows using WEB browser plug-ins +originally designed for indexing visited WEB pages with *Beagle* (rip). The +browser plug-ins works very simply by creating copies of the visited pages +in a designated directory. Two files are created for each page, one for the +contents, the other for the metadata. + +When activated, *Recoll* will visit the queue directory and index each HTML +page and its associated metadata. There is more detail about the mechanism +on the [[IndexWebHistory|page about the Recoll Web queue]], but mostly, you +just need to go to the _Indexing Preferences_ in the *recoll* GUI, open the +_Web history_ panel and check the top button. + +Franck, a *Recoll* and *Elinks* user from New Zealand, designed a method +and wrote a script to index the *Elinks* WEB history in this fashion. + +The script works by using *wget* to fetch the visited page into the queue +directory. This means that it would be reusable to index arbitrary WEB +pages in contexts other than *Elinks* visits. + +Recipee for *Elinks* and Recoll 1.18 and later: + +* Retrieve the + link:https://www.recoll.org/files/elinks_recoll.sh[elinks_recoll.sh] shell + script and make it executable (`chmod a+x elinks_recoll.sh`). +* In the Elinks Keyboard shortcut manager (k)/Main, add a shortcut to pass + the current URL to an external commande, e.g. _Ctrl-P_. +* In the Options manager (o) /Document/Uri Passing, add an action named for + example _ToIndex_ +* Modify the ToIndex action to execute `/path/to/the/script/elinks_recoll.sh %c` +* Save, you are done + +For Recoll 1.17, the method is analog, but the script is named +link:https://www.recoll.org/files/elinks_recoll.sh[elinks_beagle.sh]. diff --git a/website/faqsandhowtos/FaqsAndHowTos.txt b/website/faqsandhowtos/FaqsAndHowTos.txt new file mode 100644 index 00000000..6c0feed7 --- /dev/null +++ b/website/faqsandhowtos/FaqsAndHowTos.txt @@ -0,0 +1,37 @@ +== Faqs and Howtos + +=== Indexing +* link:WhyIsMyFileNotIndexed.html[Why is this file not indexed ? Investigating indexing issues] +* link:PreventIndexingDir.html[Preventing the indexing of a directory] +* link:IndexOnAc.html[Starting/stopping the indexer depending on power/battery status] +* link:IndexMozillaCalendari.html[Indexing Mozilla Sunbird / Lightning calendar data] +* link:MultipleIndexes.html[Creating and using multiple indexes] +* link:IndexWebHistory.html[Indexing Web history with the Firefox browser extension] +* link:ElinksWeb.html[Extending the Web queue mechanism to other browsers and general WEB indexing] +* link:IndexMailHeader.html[Indexing arbitrary mail headers] +* link:IndexOutlook.html[Indexing Outlook archives] +* link:HandleCustomField.html[Generating a custom field and using it to sort results] +* link:http://www.recoll.org/recoll_XMP/index.html.html[An example of filter/field customisation, using XMP metadata with PDFs] +* link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members] + +=== Searching +* link:GUIKeyboard.html[Recoll GUI keyboard navigation] +* link:HotRecoll.html[On the desktop: using a keyboard shortcut for starting/hiding recoll] +* link:OpenHelperScript.html[Handling issues for starting native apps, esp. email clients - getting Thunderbird to open message files] +* link:QpdfviewHelperScript.html[Another example open helper script - using qpdfview to open pdf and postscript files, with support for page and search options] +* link:UsingOpenWith.html[Using the new Open With menu in recoll 1.20 with a custom + app] +* link:ReplaceCategories.html[Replacing the document category filters] +* link:ResultsThumbnails.html[Result list thumbnails and how to create them] +* link:MuttAndRecoll.html[Interfacing Recoll and Mutt] +* link:QueryFromC.html[Querying from a C program] + +=== Administration and miscellaneous +* link:http://www.recoll.org/pages/recoll-webui-install-wsgi.html.html[Installation of the Recoll WebUI with Apache] +* link:FilterRetrofit.wiki.html[//Installing a filter for a new document type//] +* link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens] +* link:SavingConfig.wiki.html[Recoll configuration backup] +* link:XDGBase.wiki.html[Tidying Recoll data storage] +* link:ProblemSolvingData.html[Collecting diagnostic information] +* link:NonAsciiFileNames.html[Unix and non-ascii file names] +* link:FilterArch.html[Recoll filters] diff --git a/website/faqsandhowtos/FilterArch.txt b/website/faqsandhowtos/FilterArch.txt new file mode 100644 index 00000000..456cb37c --- /dev/null +++ b/website/faqsandhowtos/FilterArch.txt @@ -0,0 +1,82 @@ +== Recoll input handlers + +In the end, Recoll indexes plain UTF-8 text, remembering when it came +from. + +But of course, this is not how the source data looks like. +The text content of the original documents is encoded in many fashions +(ie pdf, ms-word, html, etc.), and it can also be stored in quite +involved ways (inside archives, email attachments ...). + +For getting to the data and converting it to plain text, Recoll uses a set +of modules which it calls input handlers (or filters), which either operate +on the storage structure (ie: a zip handler), or the storage format (ie a +pdf to text translator), or both. In addition, there is a tentative notion +of a higher level storage backend which we will ignore for now (for +reference there are currently two of those: the file system and the web +history cache). + +The basic task of filters is to take a document as input and produce a +series of subdocuments as output. The subdocument's format is defined +either dynamically (as part of the output data), or statically, in the +filter definition. + +=== Simple filters + +These are executed by a the **mh_exec** recoll module. They are the vast +majority. + +These filters are very simple. They are designed to perform a simple task +with minimal interface, they mostly don't know anything about each other, +and they don't know much about their context. This makes writing a filter +quite easy as there is not much to learn about their environment. + +Only one output document is produced and the format is fixed. + +In practise the filter, which is most generally a shell-script (but could +be any executable program), takes a file name on the command line and +outputs an html or plain text document on standard output, then exits. + +For example, the pdf filter takes one pdf file name as input on the command +line and produces one html document on stdout. The fact that the output is +html is statically defined in a configuration file. + +For filters which produce plain text, the output character set information +is in general defined in the configuration file. Else it will be obtained +from the locale (hoping that it makes sense). + +Filters that output html can produce metadata information in the html +header (ie author etc.). Filters that output plain text can only output +main text data, no metadata fields. + +Besides the file name, there is one other piece of input information, which +is in the form of an environment variable, and can be safely ignored: ++RECOLL_FILTER_FORPREVIEW+. This indicates if the filter is being used +for previewing or for indexing data. Some filters will elect to suppress +repetitive parts of the output text when indexing to avoid distorting the +term statistics. For exemple, the man filter suppresses the section +headers (NAME, SYNOPSIS...) when indexing. + +=== Multiple input filters + +These filters are more complex, but still quite easy to write, especially +if you can use Python, because they can then use a common module which +manages the communication with the indexer. + +Newer Recoll versions have converted many previously 'simple' filters to +this kind as part of the port to Windows. + +These filters are executed by the *mh_execm* Recoll module. + +They are persistent (one instance will persist through a whole indexing +pass), and will index successive multiple input files (the point being to +avoid startup performance penalty), and possibly multiple documents per +input file if this makes sense for their input format (ie: zip archive, chm +help file). + +They use a simple communication protocol over a pipe with the main recoll +or recollindex process, with file names and a few other parameters being +sent as input, and decoded data and attributes being sent in return. + +The shared Python module is 'filters/rclexecm.py'. You can look at 'rclzip' +or 'rclaudio' for reasonably straightforward exemples. diff --git a/website/faqsandhowtos/FilterRetrofit.txt b/website/faqsandhowtos/FilterRetrofit.txt new file mode 100644 index 00000000..9db4203e --- /dev/null +++ b/website/faqsandhowtos/FilterRetrofit.txt @@ -0,0 +1,62 @@ +== Installing a filter for a new document type + +It will sometimes happen that a newer Recoll release has support for a +document type which would be useful to you, but which your older release +does not support. + +It is in general easy to import support from the newer to the older +release: the Recoll input handler interface is very stable, so things should just +work. + +Input Handler updates are generally described on the Recoll web site +link:https://www.recoll.org/filters/filters.html[new filters pages]. They +may include notes about which versions need the new input handler, or specifics +about installing it. + +An up to date copy of input handlers and configuration files is also kept +link:https://www.recoll.org/filters/[at the same location]. + +We will take an example to make things more concrete: Tomboy and Gnote +files are directly supported by Recoll 1.19, but not in older Recoll +releases. The *rclxml* handler is needed to process them. + +The following procedure will allow you to retrofit support: + +- Retrieve the *rclxml* input handler from: + link:https://www.lesbonscomptes.com/recoll/filters/rclxml[] + +- Copy it to '/usr/share/recoll/filters' and make it executable: + `chmod +x rclxml` + The input handler needs *xsltproc*, but this is probably already on your + system (else get it with the package manager). + +- Edit '~/.recoll/mimemap', add the following line: + `.note = application/x-gnote` +- Edit '~/.recoll/mimeconf', add the following lines: ++ +---- +[index] +application/x-gnote = exec rclxml +---- +- Edit '~/.recoll/mimeview', add the following lines: ++ +---- +[view] +application/x-gnote = tomboy %f +---- + +- The easiest way to make sure the files are indexed with the new input + handlers may then be to just run a full indexing pass (`recollindex -z`). + +Notes: + +- The MIME type which is used is not crucial, you could prefer to use, + e.g., +application/x-tomboy+ instead, it just has to be consistent. To + avoid future trouble, it's better to use the type used by newer Recoll + releases though. +- The 'mimeview' entry is necessary even if you are using the desktop + preferences to open files. The value will not be used, but it has to be + there. + + + diff --git a/website/faqsandhowtos/FilteringOutZipArchiveMembers.txt b/website/faqsandhowtos/FilteringOutZipArchiveMembers.txt new file mode 100644 index 00000000..dc1b2af9 --- /dev/null +++ b/website/faqsandhowtos/FilteringOutZipArchiveMembers.txt @@ -0,0 +1,34 @@ +== Filtering out Zip archive members == + +The *rclzip* Zip archive extraction input handler does not use the general +configuration variables which define what file system objects should be +skipped, but it has an equivalent internal function. + +The name-skipping code depends on a recent member of the the Recoll Python +package. This will become standard for release 1.20, but for earlier +releases, you need to do two things to use this function: + +- Fetch 'python/recoll/recoll/rclconfig.py' and 'filters/rclzip' from the + source repository. +- Copy both to '/usr/share/recoll/filters' and make 'rclzip' executable. + +You can then set a variable named +zipSkippedNames+ inside +'recoll.conf'. +zipSkippedNames+ should be a space-separated list of +patterns which will be passed to the Python fnmatch() function. The +/+ +characters are not special (matched as any character). + +You can't use embedded spaces in patterns (no double-quote quoting for now) + +This can be redefined for file system directories using the usual section +indicators (Zip archives in different file-system directories can have +different skip lists). + +Example: + +---- +zipSkippedNames = *.txt +[/path/to/the/dir] +zipSkippedNames = somedir/*/*.html +---- + + diff --git a/website/faqsandhowtos/GUIKeyboard.txt b/website/faqsandhowtos/GUIKeyboard.txt new file mode 100644 index 00000000..c54e1844 --- /dev/null +++ b/website/faqsandhowtos/GUIKeyboard.txt @@ -0,0 +1,60 @@ +== Recoll GUI keyboard navigation + +Using Recoll without the mouse is not completely straightforward, but it is +mostly feasible. Here follows a description of the usable shortcuts. + +=== Anywhere + +`Ctrl+q` should exit Recoll from anywhere. + +=== Main window and result list === + +When Recoll starts up, the focus is in the simple search entry. The main +window tab order is as follows: + +* Clear +* Search +* Search type combo +* Search entry (Initial focus) +* Result list (scrolling etc) +* Result list 1st link +* Result list next links... +* Back to Clear + +Each result list entry has 3 links: the icon link is not active, but its +value is the URL, so that it can be dragged and dropped to another +application. The 2 other links are _Preview_ and _Open_ and can be +activated by typing _Enter_. + +Typing _Ctrl+Shift+s_ anywhere in the main window should return the focus to the search entry. So will _Ctrl+l_ in future versions (for compatibility with WEB browser usage). + +For pure keyboard usage, you can improve this by: + +- Disabling the icon link: use _Preferences->GUI configuration->Result + List->Edit result paragraph_ and remove the `` and `` + around the `` tag. +- Making the active link more visible by adding the following code to the + result page HTML header insert (same preferences tab). Feel free to + adjust the color :=) : + +---- + +---- + +=== Result table + +The same _Ctrl+Shift+s_ will return the focus to the search entry when +working with the result table. + +_Ctrl+r_ will move the focus from the entry to the spreadsheet. When in +there the arrow keys will navigate the lines. + +When a line is selected: + +* _Ctrl+o_ will _Open_ the document. +* _Ctrl+Shift+o_ will _Open_ the document and exit Recoll. +* _Ctrl+d_ (detail) will start a _Preview_ + +_Esc_ will deselect the current line so that mouse hovering will work again. diff --git a/website/faqsandhowtos/HandleCustomField.txt b/website/faqsandhowtos/HandleCustomField.txt new file mode 100644 index 00000000..bc107800 --- /dev/null +++ b/website/faqsandhowtos/HandleCustomField.txt @@ -0,0 +1,69 @@ +== Generating a custom field and using it to sort results + +We are going to show how to generate a custom field from a Recoll filter, +and use it for sorting results. The example chosen comes from an actual +user request: sorting results on pdf page counts. + +The details here are obsolete, as the +pdf+ input handler is now a quite +different python program, but the general idea is still relevant. + +The page count from a pdf file can be displayed by the pdfinfo command +(xpdf or poppler tools). + +We first modify a copy of the rclpdf filter +('/usr/[local/]share/recoll/filters/rclpdf'), to compute the pdf page count, +and output the value as an html meta field. This is a not very interesting +bit of shell/awk magic. Another approach would be to just rewrite the +rclpdf filter in your favorite scripting language (ie: perl, python...), as +all it does is execute pdftotext and pdfinfo and output html, nothing +complicated. Here follows the rclpdf modification as a pseudo patch: + +---- +# compute the page count and format it so that it's alphabetically sortable ++set `pdfinfo "$infile" | egrep ^Pages:` ++pages=`printf "%04d" $2` +[skip...] +# Pass the page count value to awk +-awk 'BEGIN'\ ++awk -v Pages="$pages" 'BEGIN'\ +[skip...] +# Inside the awk program startup section: compute the "meta" field line ++ pagemeta = "\n" +[skip...] +# Then print it as part of the header: ++ $0 = part1 charsetmeta pagemeta part2 +[skip...] +---- + +You can execute your own version of rclpdf by modifying '~/.recoll/mimeconf': + +---- +[index] +application/pdf = exec /path/to/my/own/rclpdf +---- + +At this point, recollindex would receive and extract a +pdfpages+ field, +but it would not know what to do with it. We are going to tell it to store +the value inside the document data record so that it can be displayed in +the results, and sorted on. For this we modify the '~/.recoll/fields' file: + +---- +[stored] +pdfpages= +---- + +That's it ! After reindexing, you can now display +pdfpages+ inside the +result list (add a +%(pdfpages)+ value to the paragraph format), and display ++pdfpages+ inside the result table (right-click the table header), and sort +the results on page count (click the column header). + +Note that +pdfpages+ has not been defined as searchable (this would not make +much sense). For this, you'd have to define a prefix and add it to the +[prefixes] fields file section: + +---- +[prefixes] +pdfpages = XYPDFP +---- + +Have a look at the comments inside the 'fields' file for more information. diff --git a/website/faqsandhowtos/Home.txt b/website/faqsandhowtos/Home.txt new file mode 100644 index 00000000..46020746 --- /dev/null +++ b/website/faqsandhowtos/Home.txt @@ -0,0 +1,13 @@ +== Welcome to the Recoll Faqs and Recipees + +link:FaqsAndHowTos.html[FAQs and Howtos] are stored here, but +the main source for Recoll user documentation is +link:https://www.recoll.org/doc.html[the _Recoll user manual_] on the +link:https://www.recoll.org/[Recoll Web site] where you will also find a +lot of other Recoll information, source code tarballs and contact +information. + +If you want to make your problem report as useful as possible, you may want +to take a look at link:ProblemSolvingData.html[this page]. + +link:WikiIndex.html[Full file index] diff --git a/website/faqsandhowtos/HotRecoll.txt b/website/faqsandhowtos/HotRecoll.txt new file mode 100644 index 00000000..f0acc6c7 --- /dev/null +++ b/website/faqsandhowtos/HotRecoll.txt @@ -0,0 +1,79 @@ +== Recoll hotkey: starting / hiding recoll with a keyboard shortcut + +Type a key (ie: F12) and have recoll appear or disappear. On the first +occurrence, recoll is started if it's not already running. Further +occurrences toggle recoll between visible and minimized states. Never +thought this would be useful until someone asked for it. Can't do without +it anymore :) + +This works well with both Gnome and KDE, but is implemented using a gnome +library (*libwnck*) and its python interface, which you may have to install +on your system if you are a pure KDE user. The library most probably exists +in the package repositories for your distribution, so this should not be +too complicated. + +This should also work with other window managers, because it is based on a +standard window manager interface extension (EWMH) that most modern window +managers implement. + +=== Installing the script (all desktops): + +- You will need the libwnck library and its python interface. These are + usually part of a gnome installation, otherwise check and possibly + install them. For OpenSuse, the library should already be there but you + need to install gnome-python-desktop. +- Download the + link:https://www.recoll.org/files/hotrecoll.py[http://www.recoll.org/files/hotrecoll.py + script]. If you have a recent recoll installation (1.14.3 and + further), it's already in the recoll filters directory + ('/usr/[local/]share/recoll/filters') +- Copy the script to some permanent place (ie: '~/bin') and make it + executable (you can leave it in the filters dirs if it's there). In a + shell window: `chmod +x hotrecoll.py`. +- You can check that the script works (or not) by executing it on the + command line. It does not need an argument. Recoll should appear or + disappear every time you execute the script. A few warning messages may + be considered normal. If the script says that it does not find the wnck + library or some other module, you'll have to install them. + +=== Installing the keyboard shortcut (Gnome): + +- _System->Preferences->Keyboard shortcuts_, or execute + *gnome-keybinding-properties* +- Click add, Name, ie: StartRecoll, Action: /path/to/hotrecoll.py +- This will add the shortcut to the "Custom shortcuts" section. You can + then click in the "Shortcut" column for "StartRecoll", and type any key + combination (ie: push F12) to assign a key shortcut. + +=== Installing the keyboard shortcut (KDE): + +Under KDE installing a global custom keyboard shortcut like we need is most +helpfully not under "Keyboard Shortcuts" but under "Input Actions". + +- _Kmenu -> Configure Desktop -> Input Actions -> Edit -> New -> Global + Shortcut -> Command/Url_ +- A new Action appears, named _New Action_. You can rename it something + like +hotrecoll+ for clarity. +- Click the _Trigger_ tab, click the input area and press your preferred + key combination (ie: F12) +- Click the _Action_ tab, and enter +hotrecoll.py+ (if it's in your PATH), + or else the full path to the command (e.g.: + '/usr/share/recoll/filters/hotrecoll.py'). +- Click _Apply_. + +=== Installing the keyboard shortcut (XFCE): + +Open the settings manager, and add the shortcut in the +_Application Shortcuts_ panel inside the _Keyboard_ tool. + + +=== Other environments + +Many window managers have a way to set up a keyboard shortcut for running +an arbitrary command. You'll need to look at the documentation for yours, +or search the web for a solution. + +An alternative independant of the environment would be to use the XBindKeys +utility. See this link:http://www.linux.com/archive/feed/59494[linux.com +article] for helpful instructions. + diff --git a/website/faqsandhowtos/IndexMailHeader.txt b/website/faqsandhowtos/IndexMailHeader.txt new file mode 100644 index 00000000..3e1627be --- /dev/null +++ b/website/faqsandhowtos/IndexMailHeader.txt @@ -0,0 +1,33 @@ +== Indexing arbitrary mail headers + +By default the Recoll mail handler only processes a subset of email headers +(+From+, +To+, +Cc+, +Date+, +Subject+). It is possible to index additional +headers by specifying them inside the 'fields' configuration file, inside +the configuration directory (typically '~/.recoll/'). + +Lengthy explanations are not really needed here, and I'll just show an +example (duplicated from the configuration section of the manual): + +---- +[prefixes] +# Index mailmytag contents (with the given prefix) +mailmytag = XMTAG + +[stored] +# Store mailmytag inside the document data record (so that it can be +# displayed - as %(mailmytag) - in result lists). +mailmytag = + +[mail] +# Extract the X-My-Tag mail header, and use it internally with the +# mailmytag field name +x-my-tag = mailmytag + +---- + +Limitations: + +- The mail filter will only process the first instance for a header + occurring several times. +- No decoding will take place (ie for non-ascii headers which would have + some kind of encoding). diff --git a/website/faqsandhowtos/IndexMozillaCalendari.txt b/website/faqsandhowtos/IndexMozillaCalendari.txt new file mode 100644 index 00000000..2b6ba901 --- /dev/null +++ b/website/faqsandhowtos/IndexMozillaCalendari.txt @@ -0,0 +1,32 @@ +== Indexing Mozilla calendar data + +Mozilla calendar programs (*Sunbird*, *Lightning*) do not store their +data in +ics+ files natively. They use an *SQLite* database (the +'storage.sdb' file inside the profile). This means that calendar data +cannot be indexed directly. + +To get Recoll to index calendar data, you need to export it to an +ics+ +file. This can be done manually, from the application menus, or, by +installing the +link:https://addons.mozilla.org/en-US/sunbird/addon/3740[Automatic Export +extension]. + +The extension can be configured to export the data when exiting the +program, or at regular time intervals. You can even set up a command to be +executed after the export. If you are not using real time indexing, this +can usefully be *recollindex*. + +In _Tools->Add Ons->Automatic Export preferences_, in the _Start an +application after export_ subpanel, set _Path of application_ to +'/usr/[local/]bin/recollindex' and _Parameters of application_ to +something like _-i;/home/me/path/to/nameofexportedcal.ics_ + +This will ensure that the calendar is indexed every time it is exported +(this is not necessary though, you can let the next batch indexing pass +take care of it). + +It may happen that the exported data has some syntax errors which will +prevent indexing with the *rclics* filter which was distributed up to +Recoll 1.13.04 (included). You may get an updated filter from the +link:https://www.recoll.org/download.html[Recoll download page]. + diff --git a/website/faqsandhowtos/IndexOnAc.txt b/website/faqsandhowtos/IndexOnAc.txt new file mode 100644 index 00000000..850556bd --- /dev/null +++ b/website/faqsandhowtos/IndexOnAc.txt @@ -0,0 +1,24 @@ +== Laptops: starting or stopping indexing according to AC power status + +For people using real time indexing on a laptop, kind user "The Doctor" +contributed a script to automatically start and stop indexing according to +power status. The script can be found here: +link:https://bitbucket.org/medoc/recoll/src/tip/src/desktop/recoll_index_on_ac.sh[recoll_index_on_ac.sh] + +To use it, you need to copy it somewhere (e.g.: '/usr/bin', but any place +will do), make it executable (`chmod a+x recoll_index_on_ac.sh`), and edit +'~/.config/autostart/recollindex.desktop' + +Change the following line: + + Exec=recollindex -w 60 -m + +to something like the following (depending where you copied the script): + + Exec=/usr/bin/recoll_index_on_ac.sh + +You may also want to change +'/usr/share/recoll/examples/recollindex.desktop', otherwise your change +will be reverted the next time you toggle real time indexing through the +GUI. And, yes, sorry about it, _this_ change will be lost on the next +Recoll update, so save a copy. diff --git a/website/faqsandhowtos/IndexOutlook.txt b/website/faqsandhowtos/IndexOutlook.txt new file mode 100644 index 00000000..12f48cd0 --- /dev/null +++ b/website/faqsandhowtos/IndexOutlook.txt @@ -0,0 +1,11 @@ +== Indexing Outlook archives == + +Recoll has no direct support for indexing Microsoft Outlook data, because, +if you are a Windows user, you probably are not a good customer for Linux +desktop indexing... + +However, if you have a need to index Outlook data at some point, I can +recommend the excellent link:http://www.five-ten-sg.com/libpst/[libpst] +library and its link:http://www.five-ten-sg.com/libpst/rn01re01.html[readpst] +utility. Using this you can very easily convert the Outlook data into MH or +mbox format, and then index the result with Recoll. diff --git a/website/faqsandhowtos/IndexWebHistory.txt b/website/faqsandhowtos/IndexWebHistory.txt new file mode 100644 index 00000000..5f7364b5 --- /dev/null +++ b/website/faqsandhowtos/IndexWebHistory.txt @@ -0,0 +1,29 @@ +== Indexing Web history with the Firefox extension == + +Note: this document is valid for Recoll versions from 1.18. + +The link:http://sourceforge.net/projects/recollfirefox/[Recoll Firefox +extension] +works together with Recoll to index the Web pages that you visit. The +extension is based on an older one which was initially written for the +Beagle indexer. + +The extension works by copying the data for the visited pages to a queue +directory ('~/.recollweb/ToIndex' by default), from which they are +indexed and removed by Recoll, and then stored in a local cache. + +The extension is now hosted on the Mozilla add-ons site, so you can install +it very simply in Firefox: link:https://addons.mozilla.org/fr/firefox/addon/recoll-indexer-1/[Recoll Firefox add-on page]. + +This feature can be enabled in the Recoll GUI index configuration panel +(Web history section), or by editing the configuration file (set ++processwebqueue+ to 1). + +Please remember that Recoll only stores a limited amount of cached web data +(adjustable from the GUI Index Configuration section), and that old pages +will be purged from the index. Pages that you want to archive permanently +need to be saved elsewhere, as they will otherwise eventually disappear +from the Recoll results. + +Recoll will index +.maff+ files, which may be a better choice for archival +usage. diff --git a/website/faqsandhowtos/Makefile b/website/faqsandhowtos/Makefile new file mode 100644 index 00000000..4c5d8aa0 --- /dev/null +++ b/website/faqsandhowtos/Makefile @@ -0,0 +1,9 @@ +.SUFFIXES: .txt .html + +.txt.html: + asciidoc $< + +all: $(addsuffix .html,$(basename $(wildcard *.txt))) + +clean: + rm *.html diff --git a/website/faqsandhowtos/MultipleIndexes.txt b/website/faqsandhowtos/MultipleIndexes.txt new file mode 100644 index 00000000..9b149003 --- /dev/null +++ b/website/faqsandhowtos/MultipleIndexes.txt @@ -0,0 +1,96 @@ +== Creating and using multiple indexes + +=== Why would you want to do this ? + +- Easy adjustment of search areas: you can filter results by using the + directory filter in the advanced search panel, but, if you have + separate well defined places where you store different kind of data, + it is easier to maintain separate index and use the External indexes + dialog to switch them on or off, and it will also yield much better + search performance. +- Shared indexes: it may be useful to maintain one or several indexes + for shared data, and separate personal indexes for each user. Indexes + can be shared over the network. +- Creating separate indexes for removable volumes. + +=== How to do it + +As an example we'll suppose that you have Recoll installed and indexing +your home directory, and that you would like to have a separate index for +/usr/shared/doc. + +You need to create a separate configuration for the new index, then add it +to the external indexes list in the user interface, and activate it as +needed. + +. Create a directory for the new index, and create an empty configuration + file ++ +---- +cd +mkdir .recoll-sharedoc +touch .recoll-sharedoc/recoll.conf +---- +. Either edit the new configuration by hand or start recoll to use the GUI + configuration editor. ++ +---- +cd .recoll-sharedoc +echo "topdirs = /usr/share/doc" > recoll.conf +# OR +recoll -c ~/.recoll-sharedoc +---- ++ +If using the GUI, click _Cancel_ when asked, to start the configuration +editor. + +. Perform initial indexing. If you chose the GUI route, indexing will + start as soon as you leave the configuration editor. Else, on the + command line: ++ +---- +recollindex -c ~/.recoll-sharedoc +---- +. Optionally set up *cron* to perform nightly indexing, use +crontab -e+ + and insert a line like the following: ++ +---- +45 20 * * * recollindex -c ~/.recoll-sharedoc +---- ++ +This would start the indexing at 20:45. `crontab -e` will use the *vi* +editor by default, you can change this by using the EDITOR +environment variable. Exemple: `EDITOR=kate crontab -e` +Your favorite desktop may also have a dedicated tool to add crontab entries. + +. Start recoll and choose the _Preferences->External_ index dialog menu + entry, then click the Browse button (near the bottom), and select the + new index Xapian database directory '~/.recoll-sharedoc/xapiandb' + Then click _Add index_. + +. You can then activate or deactivate the new index by clicking the box + in front of the directory name in the list. + +When adding an index shared by multiple users, it may be helpful to use the +RECOLL_EXTRA_DBS environment variable instead of editing individual +configurations, see the manual for more details. + +=== Paths adjustments + +When sharing indexes over a network, in most cases, the indexed data will +be accessible through different paths on the different hosts. This will +prevent the Preview and Open functions to work because the paths they get +from the index do not match the ones which are usable from the local +host. + +For example my home directory is accessed as '/home/me' on my home +machine, and as '/net/myhost/home/me' on other hosts. By default, trying +to access a result from a remote host would use the first path, when the +second is the one that would work. + +As of release 1.19 **Recoll** has a facility to perform index-dependant +path translations. This facility is accessible from the _external index +dialog_ in the GUI preferences. Paths translations can be set for the main +index if no index is selected (rarely useful), or for the selected +additional index. + diff --git a/website/faqsandhowtos/MuttAndRecoll.txt b/website/faqsandhowtos/MuttAndRecoll.txt new file mode 100644 index 00000000..cc6cf681 --- /dev/null +++ b/website/faqsandhowtos/MuttAndRecoll.txt @@ -0,0 +1,77 @@ +== Interfacing Recoll and Mutt + +It is possible to either use Mutt as a Recoll search result viewer, or +start Recoll from the Mutt search. + +=== Starting Mutt to view Recoll search results + +This method and the associated +link:http://www.recoll.org/files/recoll2mutt[recoll2mutt script] were kindly +contributed by Morten Langlo. + +This allows finding mail messages in recoll and then calling *mutt* +or *mutt-kz* to read or process the mail. + +Installation: + +- Copy the [[http://www.recoll.org/files/recoll2mutt|recoll2mutt script]] + somewhere in your PATH, and make it executable. +- In the **recoll** GUI menus: +_Preferences->GUI configuration->User interface->Choose editor applications_ +change the entry for "message/rfc822" to: +recoll2mutt %f+ + +The script has options for setting a number of parameters, you may not need +to set any of them, the defaults are: + +- -c mutt +- -F .muttrc +- -m Mail +- -x "-fn 10*20 -geometry 115x40" + +Example: + +---- +recoll2mutt -c mutt-kz -F .mutt_kzrc -m Mail -x "-fn 10*20 -geometry 115x40" %f +---- + +The option +-x+ is passed to *xterm*, which is used to call *mutt* or +*mutt-kz*. + +The script works for both _mbox_ and _maildir_ mail boxes, and it +expects the configuration file for mutt and the mail directory to reside in +your $HOME and the spool file to be '/var/spool/mail/$USER' if it is +not in your mail directory. But it is easy to change the values in the +script if you need to. + +*mutt* is opened with the right mailbox and limit set to _Date_ and +_Sender_. In theory you could set limit to _Message-Id_, but very often +*mutt* reports, that there are invalid patterns in _Message-Id_, so do it +safe, even though all emails in the opened mail box with the same date from +the sender are shown. + + +=== Starting Recoll from the Mutt search + +This will work only when using maildir storage (messages in individual +files). It will not work with mailbox files. The latter would probably be +possible by extracting the individual result messages using the Python +interface, but I did not try. + +The classic way to interface Mutt and a search application is to create a +shortcut to an external command which creates a temporary Maildir +containing the search results. + +There is such a script for Recoll, you will find it link:https://bitbucket.org/medoc/recoll/raw/41d41799dbac4c69a34db985b3ab9f1597c9c742/src/python/samples/mutt-recoll.py[here]. + +Copy the script somewhere in your PATH, and make it executable, then add +the following line to your '.muttrc': + + +---- + +macro index S "unset wait_keymutt-recoll.py -G~/.cache/mutt_results" \ + "search mail (using recoll)" + +---- + +Obviously, you can replace the 'S' letter with whatever will suit you (e.g:/) diff --git a/website/faqsandhowtos/NonAsciiFileNames.txt b/website/faqsandhowtos/NonAsciiFileNames.txt new file mode 100644 index 00000000..e60cb256 --- /dev/null +++ b/website/faqsandhowtos/NonAsciiFileNames.txt @@ -0,0 +1,85 @@ +== Unix and non-ASCII file names, a summary of issues + +Unix/Linux file and directory names are binary byte C strings. Only the +null byte and the slash character (/) are forbidden inside a name, +nowhere does the kernel interpret the strings as meaningful or +printable. + +In the old times, all utilities that would display to the user were +ASCII-based, and people would use pure printable ASCII file names (even +using space characters inside names was a cause for trouble). Non +alphanumeric characters were exclusively used for playing tricks on +colleagues. And all was well. + +Then the devil came under the guise of accented 8 bit characters. The +system has no problem with them, file names are still binary C strings, but +the utilities have to display them or take them as input, and, because +there is no encoding specification stored with the file names, they can +only do this according to the character encoding taken from the user's +current locale. + +For example fr_FR.UTF-8, and fr_FR.ISO8859-1 could be used simultaneously +on the same system (by different users), but they are completely +uncompatible: ISO-8859-1 strings are illegal when viewed in an UTF-8 locale +(will display as interrogation points or some other conventional error +marker). UTF-8 strings will display as gibberish in an ISO-8859-1 locale. + +This means that the file names created by an UTF-8 user are displayed as +garbage to the ISO-8859 one... + +If you ever change your locale, your old files are still there and named +the same (in the binary sense), but the names display badly and you have +great trouble inputing them. If you add distributed (NFS) file system +issues, things become totally unmanageable. Also think about archives sent +from another system with a different encoding. + +For what concerns Recoll: + +- The file names inside recoll.conf are not transcoded, they are taken as + binary strings (mostly, only +\n+ and +space+ are a bit special), and + passed as is to the system. So if you edit 'recoll.conf' with a text + editor, inside the same locale that is or has been used for file names, + you'll be fine. +- There was a bug in the GUI configuration tool, up to 1.12, it should + transcode between the internal Qt format and locale-dependant strings, + but it doesn't or does it badly. +- There is also an exception for the +unac_except_trans+ variable, this + *has* to be UTF-8, so if the rest of the file uses another encoding, + you'll need to edit two separate files and concatenate them. + +As of version 1.13, Recoll uses local8Bit()/fromLocal8Bit() to convert +recoll.conf file names from/to QStrings (it uses UTF-8 for all string +values which are not file names). + +The Qt file dialog is broken (at least was, I have not checked this on +recent versions). It should consider file paths as almost-binary data, not +QStrings, but doesn't. In consequence, things are even more broken than +necessary as seen from there: + +With LANG="C", no non-ASCII paths can't be used at all: + +- Strings read from recoll.conf are stripped of 8bit characters before display. +- Directory entries with 8bit characters are not displayed at all in the + selection dialog. + +With LANG="fr_FR.UTF-8", only UTF-8 paths can be used: + +- Strings read from recoll.conf are damaged when converted to QString + (except those that were actually UTF-8) +- Only the UTF-8 directory entries are displayed in the selection dialog. + + +With LANG="fr_FR.iso8859-1", everything works ok. + +- Strings read from recoll.conf are displayed with weird characters if + they use another encoding such as UTF-8, but are correctly maintained + and can be read back from the dialogs and rewritten without damage. +- Directory entries with 8 bit characters are displayed weirdly (normal), + but can be manipulated without trouble (this includes utf-8 names of + course). + +In conclusion, only the iso-8859 locales can be used for handling mixed +encoding situations. This is a possible workaround for people who need it. + +More data about path encoding issues: +http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html diff --git a/website/faqsandhowtos/OpenHelperScript.txt b/website/faqsandhowtos/OpenHelperScript.txt new file mode 100644 index 00000000..5e4d4b1c --- /dev/null +++ b/website/faqsandhowtos/OpenHelperScript.txt @@ -0,0 +1,71 @@ +== Starting native applications + +It is sometimes difficult to start a native application on a result +document, especially when the result comes from a container file (ie: email +folder file, chm file). + +The problem is that native applications usually expect at most a file name +on the command line, and sometimes not even that (emailers). + +The _Open parent documents_ link in the result list right click menu is +sometimes useful in this situation (e.g.: +chm+ files). + +In some other cases it may help that Recoll does make a lot of data +available to the application. This data may have to be pre-processed in a +script before calling the actual application. + +Details about configuring how the native application or script are called +are given with the +link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description of the mimeview configuration file] + +Information about +link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.FIELDS[configuring +customised fields] may also be useful in combination. + +=== Example + +This is a simple example, because it does not need to use special +fields. It just shows how to solve a simple issue by using an intermediary +script. The problem is due to the fact that thunderbird's +-file+ option +won't open a file if the extension is not '.eml'. Jorge, the kind Recoll +user who supplied the example stores his email in Maildir++ format, the +file names have no extension, so an intermediary script is necessary to get +thunderbird to open them: + +Note that this only works with messages stored in Maildir or MH format (one +message per file). As far as I know, there is no way to get Thunderbird to +open an arbitrary mbox file. + +The 'recoll-thunderbird-open-file' script: + +---- +#!/bin/sh +cp $1 /tmp/$$.eml +thunderbird -file /tmp/$$.eml +---- + +Create the file in an editor, save it somewhere, and make it executable +(`chmod +x recoll-thunderbird-open-file`). + +The mail line in the '~/.recoll/mimeview' file: + +---- +[view] +message/rfc822 = recoll-thunderbird-open-file %f +---- + +If the place where you saved the script is not in your PATH, you will need +to use the full path instead of just the script name, as in + +---- +[view] +message/rfc822 = /home/me/somewhere/recoll-thunderbird-open-file %f +---- + +You should then be able to open the messages in Thunderbird, which is +useful, for example, to handle the attachments. + +With recent Recoll versions, if using the normal option of letting the +Desktop chose the _Open_ application to use (_Use Desktop default_), +you should also add +message/rfc822+ to the exceptions, and the whole +thing is probably more easily done from the Recoll GUI. diff --git a/website/faqsandhowtos/PreventIndexingDir.txt b/website/faqsandhowtos/PreventIndexingDir.txt new file mode 100644 index 00000000..edef1880 --- /dev/null +++ b/website/faqsandhowtos/PreventIndexingDir.txt @@ -0,0 +1,27 @@ +== Preventing indexing in a directory + +=== Why would you want to do this ? + +By default, recollindex (or the indexing thread inside the recoll QT user +interface) will process your home directories and most its subdirectories, +at the exception of some well known places (thumbnails, beagle and web +browser caches, etc.) + +You may want to prevent indexing in some directories where you don't expect +interesting search results. This will avoid polluting the search result +lists, speed up indexing times and make the index smaller. + +=== How to do it + +There are two ways to block indexing at certain points: either by listing +specific paths, or by directory name pattern matches. + +- Blocking specific paths: this is controlled by the skippedPaths variable + in the main configuration file. You can adjust the value either by + editing the file or by using the indexing configuration dialog: + _Preferences->Indexing configuration->Global parameters->Skipped paths_ +- Using pattern matches: these are listed in the skippedNames variable in + the main configuration file. You can adjust the value either by editing + the file or by using the GUI: _Preferences->Indexing configuration->Local + parameters->Skipped names_ + diff --git a/website/faqsandhowtos/ProblemSolvingData.txt b/website/faqsandhowtos/ProblemSolvingData.txt new file mode 100644 index 00000000..28a936bd --- /dev/null +++ b/website/faqsandhowtos/ProblemSolvingData.txt @@ -0,0 +1,157 @@ +== Gathering useful data for asking help about or reporting a Recoll issue + +Once in a while it will happen that a Recoll program will either signal an +error, or even crash (either the *recoll* graphical interface or the +*recollindex* command line indexing command). + +Reporting errors and crashes is very useful. It can help others, and it can +get your own problem solved. + +Any problem report should include the exact Recoll and system versions. + +If at all possible, reading the following and performing part of the +suggested steps will be useful. This is not a condition for obtaining help +though ! If you have any problem and have a difficulty with the following, +just contact the mailing list or the developers (see contacts on +link:https://www.recoll.org/support.html[the Recoll site support page]). + +If the problem concerns indexing, and was initially found using the +*recoll* GUI, you should try to reproduce it using the +*recollindex* command-line indexer, which is much simpler and easier to +debug. + +There are then two sources of useful information to diagnose the issue: the +debug log file and, possibly, in case of a crash, a stack trace. + +Crash and other problem reports are of very high value to me, and I am +willing to help you with any of the steps described below if it is not +familiar to you. I do realize that not everybody is a programmer or a +system administrator. + +=== Obtaining information from the log file + +All Recoll commands write a varying amount of information to a common log file. + +_All commands use the same log, and the file is reset every time a command +is started: so it is important to make a copy right after the problem +occurs (for example, do not start *recoll* after a *recollindex* +crash, this would reset the log). A workaround for this issue is to let the +messages go to the default +stderr+, and redirect this._ + +By default, the messages are output to +stderr+, and you probably don't even +see them if Recoll is started from the desktop. In this case, you need to +set the parameters so that output goes to a file, and the appropriate +verbosity level is set. When using the command-line, you may actually +prefer to redirect stderr to avoid the log-truncating issue described +above. + +You can set the log parameters from the GUI _Indexing parameters_ +section or by editing the '~/.recoll/recoll.conf' file: set the ++loglevel+ and +logfilename+ parameters. E.g.: + +---- +loglevel = 6 +logfilename = /tmp/recolltrace +---- + +The log file can become very big if you need a big indexing run to +reproduce the problem. Choose a file system with enough space available +(possibly a few gigabytes). + +Then run the sequence that leads to the problem, and make a copy of the log +file just after. If the log is too big, it will usually be sufficient to +use the last 500 lines or so (tail -500). + +==== Single file indexing issues + +When the problem concerns, or can be reproduced with, a single file it is +very cumbersome to have to run a full indexing pass to reproduce it. There +are two ways around this: + +- Set up an ad hoc configuration with only the file of interest, or its + parent directory: +---- +cd +mkdir recoll-test +cd recoll-test +echo /path/to/my/file/or/its/parent/dir > recoll.conf +echo 'loglevel = 6' >> recoll.conf +echo 'logfilename = /tmp/recolltrace' >> recoll.conf +recollindex -z -c . +---- +- Use the -e and -i options to recollindex to erase/reindex a single + file. Set up the log, then: +---- +recollindex -e /path/to/my/file +recollindex -i /path/to/my/file +---- + +When using the second approach, you must take care that the path used is +consistent with the paths listed/used in the configuration (ie: if '/home' is +a link to '/usr/home', and '/usr/home/me' is used in the configuration ++topdirs+, `recollindex -i /home/me/myfile` will not work, you need +to use `recollindex -i /usr/home/me/myfile`. + + +=== Obtaining a stack trace + +If the program actually crashes, and in order to maximize usefulness, a +crash report should also include a so-called stack trace, something that +indicates what the program was doing when it crashed. Getting a useful +stack trace is not very difficult, but it may need a little work on your +part (which will then enable me do my part of the work). + +If your distribution includes a separate package for Recoll debugging +symbols, it probably also has a page on its web site explaining how to use +them to get a stack trace. You should follow these instructions. If there +is no debugging package, you should follow the instructions below. A little +familiarity with the command line will be necessary. + +==== Compiling and installing a debugging version + +- Obtain the recoll source for the version you are using (www.recoll.org), + and extract the source tree. +- Follow the + link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.install.building.html[instructions + for building Recoll from source] with the following modifications: +- Before running configure, edit the mk/localdefs.in file and remove the + -O2 option(s). +- When running configure, specify the standard installation location for + your system as a prefix (to avoid ending up with two installed versions, + which would almost certainly end in confusion). On Linux this would + typically be: `configure --prefix=/usr` +- When installing, arrange for the installed executables not to be stripped + of debugging symbols by specifying a value for the STRIP environment + variable (ie: *echo* or *ls*): `sudo make install STRIP=ls` + +==== Getting a core dump + +You will need to run the operation that caused the crash inside a writable +directory, and tell the system that you accept core dumps. The commands +need to be run in a shell inside a terminal window. E.g.: + +---- +cd +ulimit -c unlimited +recoll #(or recollindex or whatever you want to run). +---- + +Hopefuly, you will succeed in getting the command to crash, and you will +get a core file. A possible approach then would be to make both the +executable and the core files available to me by uploading it to a file +sharing site (the core file may be quite big). You should be aware though +that the core file may contain some of the data that was being indexed, +which may be a privacy issue. Another approach is to generate the stack +trace yourself. + +=== Using gdb to get a stack trace + +- Install gdb if it is not already on the system. +- Run gdb on the command that crashed and the core file (depending on the + system, the core file may be named "core" or something else, like + recollindex.core, or core.pid), ie: {{{gdb /usr/bin/recollindex core}}} +- Inside gdb, you need to use different commands to get a stack trace for + recoll and recollindex. For recollindex you can use the bt command. For + recoll use `thread apply all bt full` +- Copy/paste the output to your report email :), and quit gdb ("q"). + diff --git a/website/faqsandhowtos/QpdfviewHelperScript.txt b/website/faqsandhowtos/QpdfviewHelperScript.txt new file mode 100644 index 00000000..3695234c --- /dev/null +++ b/website/faqsandhowtos/QpdfviewHelperScript.txt @@ -0,0 +1,61 @@ +== Starting native applications == + +Another example of using an intermediary script for an application with a +command line syntax which can't be directly defined in mimeview. + +We use a script to preprocess and adapt the options before calling the +actual command. + +Details about configuring how the native application or script are called +are given with the +link:http://www.recoll.org/usermanual/usermanual.html#RCL.INSTALL.CONFIG.MIMEVIEW[description +of the mimeview configuration file]. + +*qpdfview* (link:http://launchpad.net/qpdfview[web site]) is a very +lightweight tabbed PDF viewer with great search performance and result +highlighting. + +It does support parsing the search term and page number from the command +line with the following syntax: + +---- +qpdfview --unique "%f"#%p --search "%s" +---- + +However, qpdfview will not launch if either %p or %s are empty in the +command above. To accommodate for that, Recoll user Florian has written a +small wrapper shell script: + +---- +#!/bin/bash + +qpdfviewpath=qpdfview + +if [ -z $2 ] +then + page="" + +else + page="#"$2"" +fi + +if [ -z $3 ] +then + search="" + +else + search="--search "$3"" +fi + +$qpdfviewpath --unique "$1"$page $search >&0 2>&0 & +---- + + +The corresponding handler line for Recoll would be (depending on how you +name the script and where you store it): + +---- + qpdfviewwrapper %f %p %s +---- + + diff --git a/website/faqsandhowtos/QueryFromC.txt b/website/faqsandhowtos/QueryFromC.txt new file mode 100644 index 00000000..82898c54 --- /dev/null +++ b/website/faqsandhowtos/QueryFromC.txt @@ -0,0 +1,18 @@ +== Querying Recoll from a C program + +The easiest way to query Recoll from a C or C++ program is to execute an +external search command (`recollq` or `recoll -t`). + +I have written a simple C module which deals with the related housekeeping +and presents an easy to use API to the rest of the code. You will find it +here: + + https://bitbucket.org/medoc/recoll-capi + +It is a bit experimental and will only work with recoll 1.20 for now +(because it uses a new option for recollq). However it would be trivial to +modify for working with 1.19, get in touch with me if you need this. + +The other approach is to link with the Recoll library. This has no official +API, but in practise, the internal one is fairly stable, and if you want to +choose this approach, you should start from the code in recollq.cpp diff --git a/website/faqsandhowtos/ReplaceCategories.txt b/website/faqsandhowtos/ReplaceCategories.txt new file mode 100644 index 00000000..5dcd25d8 --- /dev/null +++ b/website/faqsandhowtos/ReplaceCategories.txt @@ -0,0 +1,58 @@ +== Replacing the Category filter controls + +The document category filter controls normally appear at the top of the +*recoll* GUI, either as checkboxes just above the result list, or as a +dropbox in the tool area. + +By default, they are labeled _Media_, _Message_, _Spreadsheet_, _Text_, +etc. and each map to a document category. + +The mapping used to be fixed. You could change the number and composition +of categories by redefining them inside the {{{mimeconf}}} configuration +file (you still can), but the filters always used document categories. + +Categories can also be selected from the query language by using an ++rclcat:+ selector. E.g.: _rclcat:message_. + +As of Recoll release 1.17, the filters are not hard-wired any more. They +map to query language fragments. This means that you can freely redefine +what they do. + +The associations are configured inside the 'mimeconf' file, in the ++[guifilters]+ section. Most GUI parameters are stored in the *Qt* +configuration file, so this is not entirely consistent, and you will have +to bear with my lazyness here. + +A simple exemple will hopefuly make things clearer. If you add the +following to your '~/.recoll/mimeconf' file: + +---- +[guifilters] + +Big Books = dir:"~/My Books" size>10K +My Docs = dir:"~/My Documents" +Small Books = dir:"~/My Books" size<10K +System Docs = dir:/usr/share/doc + +---- + +You will have four filter checkboxes, labelled _Big Books_, _My Docs_, etc. + +The text after the equal sign must be a valid query language fragment, and +will be translated to a *Recoll* query and combined with the rest of the +query with an AND conjunction. + +Any name text before a colon character will be erased in the display, but +used for sorting. You can use this to display the checkboxes in any order +you like. For exemple, the following would do exactly the same as above, +but ordering the checkboxes in the reverse order. + +---- +[guifilters] + +d:Big Books = dir:"~/My Books" size>10K +c:My Docs = dir:"~/My Documents" +b:Small Books = dir:"~/My Books" size<10K +a:System Docs = dir:/usr/share/doc + +---- diff --git a/website/faqsandhowtos/ResultsThumbnails.txt b/website/faqsandhowtos/ResultsThumbnails.txt new file mode 100644 index 00000000..40c325c3 --- /dev/null +++ b/website/faqsandhowtos/ResultsThumbnails.txt @@ -0,0 +1,23 @@ +== Result list thumbnails and how to create them + +Recoll will display thumbnails for the results if the images exist in the +standard location ('$HOME/.thumbnails' or '$HOME/.cache/thumbnails' depending +on the xdg version). + +But it will not create thumbnails, mainly because it is very hard to do +portably. + +Thumbnails are most commonly created when you visit a directory with your +file manager, but visiting the whole file tree just to create thumbnails is +a bit fastidious. + +One simple trick to create thumbnails from the recoll GUI is to visit the +parent directory for a result by using the _Open parent document/folder_ +entry in the right-click menu. + +You can also find tools for the systematic creation of thumbnails for a +directory tree. Three such tools are discussed on this +link:http://askubuntu.com/questions/199110/how-can-i-instruct-nautilus-to-pre-generate-pdf-thumbnails[askubuntu.com discussion] + +Also please note that no thumbnails can currently be generated or displayed +for embedded documents (attachments, archive members, etc.). diff --git a/website/faqsandhowtos/SavingConfig.txt b/website/faqsandhowtos/SavingConfig.txt new file mode 100644 index 00000000..c95a8792 --- /dev/null +++ b/website/faqsandhowtos/SavingConfig.txt @@ -0,0 +1,61 @@ +== User configuration backup + +=== Why you would want to do this + +If you are going to reinstall your system, and have some custom +configuration, you may save some time by making a backup of your +configuration and restoring it on the new system, rather than going through +the menus to recreate it. + +=== How to do it + +==== Index/search configuration + +The main recoll configuration data is normally kept inside '~/.recoll' or +whatever *$RECOLL_CONFDIR* is set to. + +This directory contains both configuration files and generated index +data.In a standard configuration, the following files and directories +contain generated data: + +- 'xapiandb' contains the Xapian index, which normally consumes most of the + total space. +- 'aspdict.en.rws' contains the aspell dictionary used for spelling + corrections. +- 'mboxcache' contains cached offset data for email messages inside mbox + folders. +- 'webcache' contains saved web pages. This is more than a cache as + destroying it will purge the corresponding data during the next + indexing. + +The other files are either very small or contain configuration data. + +If you want to only save configuration, using minimum space, you can +destroy the above files and directories (with the possible exception of +'webcache'). Then taking a copy of the '.recoll' directory and adding the +GUI configuration data described in the next will get you a full +configuration data backup. + +==== GUI configuration + +The parameters set from the _Query configuration_ Qt menus are stored in +Qt standard places: + +- '~/.qt/recollrc' for Qt 3.x +- '~/.config/Recoll.org/recoll.conf' for Qt 4 and later + + +==== Other data + +If you wish to save index data in addition to the customisation files, +which only makes sense if the document access paths do not change after +reinstallation, you can just take a backup of the full '.recoll' +directory, taking care that the storage locations for some data elements +can be changed (not be inside '.recoll'): + +- The index data is normally kept inside '~/.recoll/xapiandb', but the + location of this directory can be modified by the +dbdir+ + configuration parameter if it is set (check 'recoll.conf'). +- If you use the Firefox Recoll plugin, the WEB history cache is normally + kept inside '~/.recoll/webcache', but the location can be modified by + the +webcachedir+ configuration parameter. diff --git a/website/faqsandhowtos/UnityLens.txt b/website/faqsandhowtos/UnityLens.txt new file mode 100644 index 00000000..bcd4c44b --- /dev/null +++ b/website/faqsandhowtos/UnityLens.txt @@ -0,0 +1,109 @@ +== Building and Installing the Ubuntu Unity Recoll Lens + +Important preliminary notes: + +- This only makes sense for Ubuntu versions using the Unity environment: + Natty (11.04), Oneiric (11.10), Precise (12.04), and later. +- _Remember that you still need to use the recoll GUI (or the recollindex + //command) to get the indexing going !_ +- The Lens is artificially limited to showing at most 20 results. Use the + recoll GUI for more complete capabilities (or edit rclsearch.py, change + the "if actual_results >= 20:" line). + + +=== The Lens with Recoll 1.17 and later + +If you are willing to install or upgrade to Recoll version 1.17, all +necessary packages are on the Recoll PPA, you just need to add the +repository to your system sources and add or upgrade the packages: *_/This +is the recommended approach!_* + +---- +sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on +sudo apt-get update +sudo apt-get install recoll-lens recoll +---- + +This document may still be useful if you want to modify the lens source +code. + +=== The Lens with older Recoll versions + +If, for some reason, you wish to test the Lens with an older Recoll +version, read the following. + +Please not that such an installation is somewhat crippled: you will not be +able to display results for embedded documents (emails inside an mbox, +attachments etc.). This requires a recoll command line option which is only +available in 1.17 + +The Lens is based on the Recoll Python module which is not built by default +for versions prior to 1.17, so so you will first need to pull the Recoll +source code (for you version), then untar and proceed with the +configure/build instructions below. + +The following uses --prefix=/usr. I have no real reason to believe +that this would not work with /usr/local (lenses are also searched there by +default). If you confirm that things work with another prefix, please drop +me a line. + +When doing this over a previous Recoll compilation, run a "make clean" to +get rid of the non-PIC objects. + +Note that the following instructions change nothing to your existing Recoll +installation, they only install the Python module and the Unity Lens, +recoll, recollindex etc. are unaffected. + +'/TOP/OF/RECOLL/SRC' designates the top of the recoll source tree. + +=== Configure and build the recoll library and python module, install the module + +The following needs the development packages for Xapian, Python and zlib. + +---- +cd /TOP/OF/RECOLL/SRC +# May fail if no previous build was performed +make clean + +# the gui/x11 disabling is just here to avoid having to install the +# development libraries for Qt. +configure --prefix=/usr --enable-pic --without-x --disable-qtgui +make + +cd python/recoll +python setup.py build +sudo python setup.py install +---- + +=== Build and install the Unity Lens + +---- +cd /TOP/OF/RECOLL/SRC +cd desktop/unity-lens-recoll +configure --prefix=/usr --sysconfdir=/etc +sudo make install + +---- + +Voilà, it should work... + +Try to start the Dash, you should see the Recoll checkerboard (or +whatever...) in the Lens list. + +The Recoll Lens expects a Recoll query language string, so you can use +field searches, directory, size, and date filtering (see the +link:http://www.lesbonscomptes.com/recoll/usermanual/rcl.search.lang.html[Recoll +manual] for a description of the query language). + +If you want to disable the Lens, I think that you just have to delete +'/usr/share/unity/lenses/recoll' + +Other installed files: + +---- +/usr/libexec/unity-recoll-daemon +/usr/share/dbus-1/services/unity-lens-recoll.service +/usr/share/doc/unity-lens-recoll +/usr/share/unity-lens-recoll +---- + diff --git a/website/faqsandhowtos/UsingOpenWith.txt b/website/faqsandhowtos/UsingOpenWith.txt new file mode 100644 index 00000000..c9d681f1 --- /dev/null +++ b/website/faqsandhowtos/UsingOpenWith.txt @@ -0,0 +1,68 @@ +== Using the _Open With_ context menu in recoll 1.20 and newer + +Recoll versions and newer have an _Open With_ entry in the result list +context menu (the thing which pops up on a right click). + +This allows choosing the application used to edit the document, instead of +using the default one. + +The list of applications is built from the desktop files found inside +'/usr/share/applications'. For each application on the system, these +files lists the mime types that the application can process. + +If the application which you would want listed does not appear, the most +probable cause is that it has no desktop file, which could happen due to a +number of reasons. + +This can be fixed very easily: just add a +.desktop+ file to +'/usr/share/applications', starting from an existing one as a template. + +As an example, based on an original idea from Recoll user +florianbw+, +the following describes setting up a script for editing a PDF document +title found in the recoll result list. + +The script uses the *zenity* shell script dialog box tool to let you +enter the new title, and then executes *exiftool* to actually change +the document. + +---- +#!/bin/sh + +PDF=$1 +TITLE=`exiftool -Title -s3 "$PDF"` + +RES=`zenity --entry \ + --title="Change PDF Title" \ + --text="Enter the Title:" \ + --entry-text "$TITLE"` + +if [ "$RES" != "" ]; then + echo -n "Changing title to $RES ... " && \ + exiftool -Title="$RES" "$PDF" && \ + recollindex -i "$PDF" && echo "Done!" +else + echo "No title entered" +fi +---- + +Name it, for example, 'pdf-edit-title.sh', and make it executable +(`chmod a+x pdf-edit-title.sh`). + +Then create a file named 'pdf-edit-title.desktop' inside +'/usr/share/applications'. The file name does not need to be the same as the +script's, this is just to make things clearer: + +---- +[Desktop Entry] +Name=PDF Title Editor +Comment=Small script based on exiftool used to edit a pdf document title +Exec=/home/dockes/bin/pdf-edit-title.sh %F +Type=Application +MimeType=application/pdf; +---- + +You're done ! Restart Recoll, perform a search and right-click on a PDF +result: you should see an entry named _PDF Title Editor_ in the _Open +With_ list. Click on it, and you will be able to edit the title. + + diff --git a/website/faqsandhowtos/WhyIsMyFileNotIndexed.txt b/website/faqsandhowtos/WhyIsMyFileNotIndexed.txt new file mode 100644 index 00000000..15970e6d --- /dev/null +++ b/website/faqsandhowtos/WhyIsMyFileNotIndexed.txt @@ -0,0 +1,99 @@ +== Using the log file to investigate indexing issues + +All *Recoll* processes print trace messages. By default these go to the +standard error output, and you may not ever see them (in the case, for +example, of the *recoll* GUI started from the desktop interface). + +There are a number of potential issues with indexing that may need +investigation, such as: + +- A file can't be found by searching even if it appears that it should have + be indexed (this could happen because the file is not selected at all or + because a filter program crashes). +- The indexing process gets stuck and never finishes. +- The indexing process ends up with an error. +- The indexing process seems to be using too much system capacity. + +The right way to approach these problems is to use the *recollindex* +command line tool (instead of the *recoll* GUI), and to set up the +trace log to provide information about what indexing is actually doing. + +Trace log parameters can be set either from the GUI _Preferences->Indexing +Configuration->Global Parameters_ panel, or by editing the configuration +file '~/.recoll/recoll.conf'. You should set the following parameters: + +---- +loglevel = 6 +logfilename = stderr +thrQSizes = -1 -1 -1 +---- + +We use _stderr_ instead of an actual file in order to capture direct filter +messages (such as a *python* stack trace) along with normal +*recollindex* messages. + +The last line sets recollindex for single-threaded operation, which will +make the log much more readable. + +You should then check that no *recoll* or *recollindex* process is +currently running, and kill any you find. + +Then, if this is an issue about an identified file, try indexing it only: + +---- +recollindex -i myunfindablefile.xxx > /tmp/myindexlog 2>&1 +---- + +If this is a general issue with indexing (process not finishing properly), +just start it: + +---- +recollindex > /tmp/myindexlog 2>&1 +---- + +Usually, having a look at the trace will allow to see what is wrong (e.g.: +a configuration issue or missing filter), and solve the problem. + +In case of indexer misbehaviour (e.g. using too much memory, you should run +_tail -f_ on the log to see what is going on. + +If this is not enough, please +link:http://bitbucket.org/medoc/recoll/issues/new[open a tracker issue] and +attach or link to the log data, or just email me (jfd at recoll.org). + +*recollindex* and *recollindex -i* usually have the same criteria to +include a file or not (but see the _Path gotcha_ note below). It may +happen that they behave differently, so it may sometimes be useful to run a +full *recollindex* even for a specific file, but this will produce a +big log file. + +When you are done, it is better to reset the verbosity to a reasonable +level (e.g.: +2+ : just errors, +4+ : basic traces). + +=== Note: the path gotcha + +*recollindex -i* will only index files under the directories defined by the ++topdirs+ configuration variable (your home directory by +default). Unfortunately, the test is done on the file path text, ignoring +possible symbolic links. If you give a simple file name as a parameter to +*recollindex -i* and there are symbolic links inside the +topdirs+ +entries, the comparison may fail. For example, if your home directory is +'/home/me/' and '/home/' is a link to '/usr/home/', *recollindex -i +somefilename* will actually try to index '/usr/home/somefilename/', and +fail (because '/usr/home/me/' is not a subdirectory of '/home/me/'). This +will manifest itself in the log by a message like the following. + +---- +:4:../index/fsindexer.cpp:149:FsIndexer::indexFiles: skipping [/usr/home/me/somefile] (ntd) +---- + +If this happens, give a full path consistent with what is found in the +configuration file (e.g.: _recollindex -i /home/me/somefile_). + +=== File system occupation + +One of the possible reasons for failed indexing is a +maxfsoccup+ +parameter set too low. This is the value of file system occupation, not +free space, where indexing will stop. It is set from the GUI indexing +configuration or by editing 'recoll.conf'. A value of 0 implies no +checking, but a very low, non-zero, value will just prevent indexing. diff --git a/website/faqsandhowtos/WikiIndex.txt b/website/faqsandhowtos/WikiIndex.txt new file mode 100644 index 00000000..3b168337 --- /dev/null +++ b/website/faqsandhowtos/WikiIndex.txt @@ -0,0 +1,65 @@ +== Recoll Wiki file index +link:ElinksWeb.html[Extending the Recoll Firefox visited web page indexing mechanism to other browsers] + +link:FaqsAndHowTos.html[Faqs and Howtos] + +link:FilterArch.html[Recoll input filters ] + +link:FilterRetrofit.html[Installing a filter for a new document type] + +link:FilteringOutZipArchiveMembers.html[Filtering out Zip archive members] + +link:GUIKeyboard.html[# Recoll GUI keyboard navigation] + +link:HandleCustomField.html[Generating a custom field and using it to sort results] + +link:Home.html[Welcome to the Recoll Wiki] + +link:HotRecoll.html[Recoll hotkey: starting / hiding recoll with a keyboard shortcut] + +link:IndexMailHeader.html[Indexing arbitrary mail headers ] + +link:IndexMozillaCalendari.html[Indexing Mozilla calendar data ] + +link:IndexOnAc.html[Laptops: automatically starting or stopping indexing according to AC power status] + +link:IndexOutlook.html[Indexing Outlook archives] + +link:IndexWebHistory.html[Indexing Web history with the Firefox extension ] + +link:MultipleIndexes.html[Creating and using multiple indexes] + +link:MuttAndRecoll.html[Interfacing Recoll and Mutt] + +link:NonAsciiFileNames.html[Unix and non-ASCII file names, a summary of issues] + +link:OpenHelperScript.html[Starting native applications ] + +link:PreventIndexingDir.html[Preventing indexing in a directory] + +link:ProblemSolvingData.html[Gathering useful data for asking help about or reporting a Recoll issue] + +link:QpdfviewHelperScript.html[Starting native applications ] + +link:QueryFromC.html[Querying Recoll from a C program] + +link:ReplaceCategories.html[Replacing the Category filter controls] + +link:ResultsThumbnails.html[Result list thumbnails and how to create them] + +link:SavingConfig.html[User configuration backup] + +link:UnityLens.html[Building and Installing the Ubuntu Unity Recoll Lens] + +link:UsingOpenWith.html[Using the Open With context menu in recoll 1.20 and newe] + +link:WhyIsMyFileNotIndexed.html[Using the log file to investigate indexing issues] + +link:XDGBase.html[XDG: Tidying Recoll data storage] + +link:ZDevCaseAndDiacritics1.html[Character case and diacritic marks (1), issues with stemming] + +link:ZDevCaseAndDiacritics2.html[Character case and diacritic marks (2), user interface] + +link:ZDevCaseAndDiacritics3.html[Character case and diacritic marks (3), implementation] + diff --git a/website/faqsandhowtos/XDGBase.txt b/website/faqsandhowtos/XDGBase.txt new file mode 100644 index 00000000..478cac7e --- /dev/null +++ b/website/faqsandhowtos/XDGBase.txt @@ -0,0 +1,42 @@ +== XDG: Tidying Recoll data storage == + +The default storage structure of Recoll configuration and index data is +quite at odds with what recommends the +link:http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html[XDG +Base Directory Specification], the reason being that it predates said spec. + +By default, Recoll stores all its data in a single directory: '$HOME/.recoll' + +This is not going to change, because it would be quite disturbing for +current users. + +However, the location of this directory can be modified using the ++$RECOLL_CONFDIR+ environment variable. + +Furthermore all significant Recoll data categories can be moved away from +the configuration directory (maybe to '$HOME/.cache'), by setting +configuration variables: + +* _dbdir_ defines the location for storing the Xapian + index. This could be set to, e.g., '$HOME/.cache/recoll/xapiandb'. It is + quite recommended that + this directory be dedicated to Xapian (don't store other things in + there). +* _mboxcachedir_ defines the location for caching access speedup information + about mail folders in mbox format. e.g. '$HOME/.cache/recoll/mboxcache' +* New in 1.22: you can use _aspellDictDir_ to define the storage + location for the aspell spelling approximation + dictionary. E.g. '$HOME/.cache/recoll' +* _webcachedir_ may be used to define where the visited web pages + archive is stored. E.g. '$HOME/.cache/recoll/webcache'. This is only used + if you activate the Firefox plugin and web history indexing. You may + want to think a bit more about where to store it, because, contrary to + the above, this is not discardable data: your Recoll Web history goes + away if you delete it. + +If you use multiple Recoll configurations, each will have to be customized. + +Once these are put away, there are still a few modifyiable files in the +configuration directory, for example the 'recoll.pid' and 'history' +files, but these are small files. Moving 'recoll.pid' away would be a +serious headache because it is used by scripts. diff --git a/website/faqsandhowtos/ZDevCaseAndDiacritics1.txt b/website/faqsandhowtos/ZDevCaseAndDiacritics1.txt new file mode 100644 index 00000000..bdd5425e --- /dev/null +++ b/website/faqsandhowtos/ZDevCaseAndDiacritics1.txt @@ -0,0 +1,143 @@ +== Character case and diacritic marks (1), issues with stemming + +=== Case and diacritics in Recoll + +Recoll versions up to 1.17 almost fully ignore character case and diacritic +marks. + +All terms are converted to lower case and unaccented before they are +written to the index. There are only two exceptions: + + * File paths (as used in _dir:_ clauses) are not converted. This might + be a bug or a feature, but the main reason is that we don't know how they + are encoded. + * It is possible to specify that some characters will keep their diacritic + marks, because the entity formed by the character and the diacritic mark + is considered to be a different letter, not a modified one. This is + highly dependant on the language. For exemple, in Swedish, +å+ should + be preserved, not turned into +a+. + +As a necessary consequence, the same transformations are applied to search +terms, and it is impossible to search for a specific capitalization of a +word (+US+ is looked for as +us+), or a specific accented form +(+café+ will be looked for as +cafe+). + +However, there are some cases where you would like to be more specific: + + * Searching for +US+ or +us+ should probably return different results. + * Diacritics are seldom significant in English, but we can find a + few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of + course, there are many more cases in languages which use more diacritics. + +On the other hand, accents are often mistyped or forgotten (résumé, résume, +resume?), and capitalization is most often unsignificant, so that it is +very important to retain the capability to ignore accent and character +case differences, and that the discrimination can be easily switched on or +off for each search (or even for specific terms). + +This text and other pages which will follow will discuss issues in adding +character case and diacritics sensitivity to Recoll, under the assumption +that the main index will contain the raw source terms instead of +case-folded and unaccented ones. + +The following will use the _unaccent_ neologism to mean _remove +diacritic marks_ (and not only accents). + +English examples are used when possible, but given the limited use of +diacritics in English, some French will probably creep in. + +=== Diacritics and stemming + +Stemming is the process by which we extend a search to terms related by +grammatical inflexion, for example singular/plural, verb tenses, etc. For +example a search for +floor+ is normally expanded by Recoll to +floors, +floored, flooring, ...+ + +In practice Recoll has a separate data structure that has stemmed terms +(stems) as keys pointing to a list of expansion terms +{{{floor -> (floor,floors,floorings,...)}}} + +Stemming should be applied to terms before they are stripped of +diacritics. Accents may have a grammatical significance, and the accent may +change how the term is stemmed. For example, in French the +âmes+ suffix +generally marks a past conjugation but +ames+ does not. The standard +Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem, +but +évitames+ will be turned into +évitam+ (stripping +plural and feminine suffixes). + +When the search is set to ignore diacritics, this poses a specific problem: +if the user enters the search term without accents (which is correct +because the system is supposed to ignore them), there is no warranty that +the term will be correctly expanded by stemming. + +The diacritic mismatch breaks the family relationship between the stem +siblings, and this is independant of the type of index: it will happen with +an index where diacritics are stripped just as with a raw one. + +The simpler case where diacritics in the original term only affects +diacritics in the stem also necessitates specific processing, but it is +easier to work around. + +Two examples illustrating these issues follow. + +==== The simple case: diacritics in the term only affect diacritics in the stem + +Let's imagine that the document set contains the term +éviter+ +(infinitive of +to avoid+), but not +évite+ (present). The only term in +the actual index is then +éviter+. + +The user enters an unaccented +evite+, counting on the +diacritics-insensitive search mode to deal with the accents. As +évite+ +is not present in the index, we have no way to guess that +evite+ is +really +évite+. + +The stemmer will turn +evite+ into +evit+. There is no way that this +can be related to +éviter+, and this legitimate result can't be found. + +There is a way around this: we can compute a separate +stem expansion dictionary for unaccented terms. This dictionary, to be used +with diacritic-unsensitive searches only, contains the relationship +between +evit+ and +eviter+ (as +éviter+ is in the index). We can +then relate +eviter+ and +éviter+ because they differ only by accents, +and the search will find the document with +éviter+. + +==== The bad case: diacritics in the term change the stem beyond diacritics + +Some grammatically significant accents will cause unexpectedly missing +search results when using a supposedly diacritics-insensitive search mode. + +Let's imagine that the document set contains the term +éviter+ +(infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming +expansion table has an entry for +évit+ -> +éviter+. + +If the user enters an unaccented +evitames+, she would expect to find the +documents containing +éviter+ in the results, because the latter term is +a stemming sibling of +évitâmes+ and the search is supposedly not +influenced by diacritics, so that +evitames+ and +évitâmes+ should be +equivalent. + +However, our search is now in trouble, because +évitâmes+ is not in any +document, so that there is no data in the index which would inform us about +how to transform the input term into something that differs only by accents +but would yield a correct input for the stemmer. + +If we try to feed the raw user input to the stemmer, it will propose +an +evitam+ stem, which will not work, because the stem that actually +exists is +évit+, and +evitam+ can not be related to +éviter+. + +The only palliative approach I can think of would be a spelling correction +of the input, performed independantly of the actual index contents, which +would notice that +évitames+ is not a French word and propose a change or an +expansion to +évitâmes+, which would correctly stem to +évit+ and allow +us to find +éviter+. + +This issue is not specific to Recoll or indeed to the fact that the index +retains accent or not. As far as I can see, it is an intrinsic bad +interaction between diacritics insensitivity and stemming. + +It is also interesting to note that this case becomes less probable when +the data set becomes bigger, because more term inflexions will then be +present in the index. + +We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate +interface]. diff --git a/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt b/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt new file mode 100644 index 00000000..6e2744ea --- /dev/null +++ b/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt @@ -0,0 +1,122 @@ +== Character case and diacritic marks (2), user interface + +In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some +of the problems which arise when mixing case/diacritics sensitivity and +stemming. + +As of version 1.18, Recoll can create two types of indexes: +* _Dumb_ indexes contain terms which are lowercased and stripped of + diacritics. Searches using such an index are naturally case- and + diacritics- insensitive: search terms are stripped before processing. +* _Raw_ indexes contain terms which are just like they were found in the + source document. Searching such an index is naturally sensitive to case + and diacritics, and can be made insensitive by further processing. + +The following explains how users can control these Recoll features. + +=== Controlling the type of index we create: stripped or raw + +The kind of index that recoll creates is determined by: + + * A build-time *configure* switch: _--enable-stripchars_. If this is + set, the code for case and diacritics sensitivity is not compiled in and + recoll will work like the previous versions: unaccented and casefolded + index, no runtime options for case or diacritics sensitivity + + * An indexing configuration switch (in recoll.conf): if Recoll was built + with _--disable-stripchars_, this will provide a dynamic way to return + to the "traditional" index. The case and diacritics code will be present + but inactive. Normally, a recoll installation with this switch set + should behave exactly like one built with _--enable-stripchars_. When + using multiple indexes, this switch MUST be consistent between + indexes. There is no support whatsoever for mixing raw and dumb indexes. + The option is named _indexStripChars_, and it is not settable from the + GUI to avoid errors. This is something that would typically be set once + and for all for a given installation. We need to decide what the default + value will be for 1.18 + + * A number of query time switches. Using these it is also possible to + perform a search insensitive to case and diacritics on a raw index. Note + however, that, given the complexity of the issues involved, I give no + guaranty at this time that this will yield exactly the same results as + searching a dumb index. Details about query time behaviour follow. + + +=== Controlling stem, case and diacritics expansion: user query interface + +Recoll versions up to 1.17 were insensitive to case and diacritics. We only +needed to give the user a way to control stem expansion. This was done in +three ways: + + * Globally, by setting a menu option. + * Globally, by setting the stemming language value to empty. + * On a term by term basis by Capitalizing the term, or, in query language + mode only, by using an 'l' clause modifier (_"term"l_). + +After switching to an unstripped index, capable of case and diacritic +sensitivity, we need ways to control what processing is performed among: + + * Case expansion. + * Diacritics expansion. + * Stem expansion. + +The default mode will be compatible with the previous version, because +this is is most generally what we want to do: ignore case and diacritics, +expand stems. + +There are two easy approaches for controlling the parameters: + * Global options set in the GUI menus or as *recollq* command line + switches. + * Per-clause options set by modifiers in the query language. + +We would like, however to let the user entry automatically override the +defaults in a sensible way. For example: + + * If a term is entered with diacritics, diacritic sensitivity is turned on + (for this term only). + * If a term is entered with upper-case characters, case sensitivity is + turned on. In this case, we turn off stem expansion, because it makes + really no sense with case sensitivity. + +With this method we are stuck with 3 problems (only if the global mode is +set to insensitive, and we're not using the query language): + + * Turning off stemming without turning on case sensitivity. + * Searching for an all lower-case term in case-sensitive mode. + * Searching for a term without diacritics in diacritic-sensitive mode. + +The two latter issues are relatively marginal and can be worked around easily +by switching to query language mode or using negative clauses in the +advanced search. + +However, we need to be able to turn stemming off while remaining +insensitive to case, and we need to stay reasonably compatible with the +previous versions. This means that a term which has a capital first letter +but is otherwise lowercase will turn stemming off, but not case sensitivity +on. + +So we're left with how to search for such a term in a case-sensitive way, +and for this, you'll have to use global options or the query language. + +The modified method is: + + * If a term is entered with diacritics, diacritic sensitivity is turned on + (for this term only). + * If the first letter in a term is upper-case and the rest is lower-case, + we turn stem expansion off, but we do not become case-sensitive + * If any letter in a term except the first is upper-case, case sensitivity + is turned on. Stem expansion is also turned-off (even if the first + letter is lower-case), because it makes really no sense with case + sensitivity. + * To search for an all lower-case or capitalized term in a case-sensitive + way, use the query language: "Capitalized"C, "lowercase"C + * Use the query language and the "D" modifier to turn on diacritics + sensitivity. + +It can be noted that some combinations of choices do not make sense and +they are not allowed by Recoll: for example, diacritics or case sensitivity +do not make sense with stem expansion (which cannot preserve diacritics in +any meaningful general way). + +The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual +implementation in Recoll 1.18. diff --git a/website/faqsandhowtos/ZDevCaseAndDiacritics3.txt b/website/faqsandhowtos/ZDevCaseAndDiacritics3.txt new file mode 100644 index 00000000..32e0f664 --- /dev/null +++ b/website/faqsandhowtos/ZDevCaseAndDiacritics3.txt @@ -0,0 +1,67 @@ +== Character case and diacritic marks (3), implementation + +In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics +and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate +interface] for switchable search sensitivity to diacritics and character +case. + +So you are in this mood again and you don't want to type accents (maybe you're +stuck with a QWERTY American english keyboard), or conversely you're +want to resume looking for your résumé, and you've told Recoll as much, +using the appropriate interface. What happens then ? + +The second case is easy if the index is raw, and mostly impossible if it is +stripped. So we'll concentrate on the first case: how to achieve case and +diacritics insensitivity on a raw index ? + +Recoll uses three expansion tables: + +* The first table has stripped and lowercased terms as keys and raw terms as + data: +mate -> (mate, maté, MATE,...)+. + +* The second table has lowercased stems as keys and original lowercase terms + as data (when using multiple languages, there are several such tables): + +évit -> (éviter, évite, évitâmes, ...)+. + +* The third table has stripped and lowercased stems as keys and stripped + lowercased terms as data: + +evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+ + +The first table can be used for full case and diacritics expansion or for +only one of those, by post-filtering the results of full expansion (e.g. if +we only want diacritics expansion, we filter by stripping diacritics from +each result term and check that it's identical to the input). For example +if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to +only perform case expansion for an input of +maté+, we apply case folding +to the initial output and keep only +maté+, as +mate+ differs from the +input. + +We only perform stemming expansion when case and diacritics sensitivity is +off. It is performed using the second and third tables, both on the +lowercased and lowercased/stripped output of the first step, and each term +in the output stemming is expanded again for case (using the first table). + +A full example of the expansion occurring during an insensitive search +for +resume+ using French stemming on a mixed English/French index +follows. An important thing to remember is that the result of each +expansion is a function of the terms actually present in the index, not +some arbitrary computation (and so, of course, many of the possible but +absent variations are missing). + +# The case and diacritics expansion of +resume+ yields +RESUME Resume + Résumé resumé résume résumé resume+ + +# The Stem expansion input list (lower-cased) is: + +resume resumé résume résumé+, and the output is: + +resum resume resumenes resumer resumes resumé resumée résum résumait + résumant résume résumer résumerai résumerait résumes résumez résumé résumée + résumées résumés+ + +# Each of the above terms is then fed to case and diacritics expansion (first + table), for the final output: + +resume résumé Résumé résumer résume Resume résumés RESUME resumes + resumer résumant resúmenes resumé résumait résumes résumée resumee + résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+. + +A Xapian OR query is finally constructed from the expanded term list. + diff --git a/website/faqsandhowtos/makeindex.sh b/website/faqsandhowtos/makeindex.sh new file mode 100644 index 00000000..f293b0c7 --- /dev/null +++ b/website/faqsandhowtos/makeindex.sh @@ -0,0 +1,20 @@ +#!/bin/sh +WIDX=WikiIndex.txt + +echo "== Recoll Wiki file index" > $WIDX +for f in *.txt; do + if test "$f" = $WIDX ; then continue; fi + h="`basename $f .txt`.html" + title=`head -1 "$f" | sed -e 's/=//g' -e 's/^ *//' -e 's/ *$//' -e 's/ //g'` + echo 'link:'$h'['$title']' >> $WIDX + echo >> $WIDX +done + +exit 0 +# Check and display what files are in the index but not in the contents table: + +grep \| FaqsAndHowTos.txt | awk -F\| '{print $1}' | sed -e 's/\* \[\[//' -e 's/.wiki//' |sort > ctfiles.tmp +grep '\[\[' WikiIndex.txt | awk -F\| '{print $1}' | sed -e 's/\[\[//' -e 's/.wiki//' -e 's/.md//' | sort > ixfiles.tmp +echo 'diff ContentFiles IndexFiles:' +diff ctfiles.tmp ixfiles.tmp +rm ctfiles.tmp ixfiles.tmp