10756 lines
521 KiB
HTML
10756 lines
521 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<meta name="generator" content=
|
|
"HTML Tidy for HTML5 for Linux version 5.6.0">
|
|
<meta http-equiv="Content-Type" content=
|
|
"text/html; charset=utf-8">
|
|
<title>Recoll user manual</title>
|
|
<link rel="stylesheet" type="text/css" href="docbook-xsl.css">
|
|
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
|
|
<meta name="description" content=
|
|
"Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license can be found at the following location: GNU web site. This document introduces full text search notions and describes the installation and use of the Recoll application. This version describes Recoll 1.29.">
|
|
</head>
|
|
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084"
|
|
alink="#0000FF">
|
|
<div lang="en" class="book">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="idm1" id="idm1"></a>Recoll
|
|
user manual</h1>
|
|
</div>
|
|
<div>
|
|
<div class="author">
|
|
<h3 class="author"><span class=
|
|
"firstname">Jean-Francois</span> <span class=
|
|
"surname">Dockes</span></h3>
|
|
<div class="affiliation">
|
|
<div class="address">
|
|
<p><code class="email"><<a class="email" href=
|
|
"mailto:jfd@recoll.org">jfd@recoll.org</a>></code></p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<p class="copyright">Copyright © 2005-2020 Jean-Francois
|
|
Dockes</p>
|
|
</div>
|
|
<div>
|
|
<div class="abstract">
|
|
<p><code class="literal">Permission is granted to copy,
|
|
distribute and/or modify this document under the terms
|
|
of the GNU Free Documentation License, Version 1.3 or
|
|
any later version published by the Free Software
|
|
Foundation; with no Invariant Sections, no Front-Cover
|
|
Texts, and no Back-Cover Texts. A copy of the license
|
|
can be found at the following location: <a class=
|
|
"ulink" href="http://www.gnu.org/licenses/fdl.html"
|
|
target="_top">GNU web site</a>.</code></p>
|
|
<p>This document introduces full text search notions
|
|
and describes the installation and use of the
|
|
<span class="application">Recoll</span> application.
|
|
This version describes <span class=
|
|
"application">Recoll</span> 1.29.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<hr>
|
|
</div>
|
|
<div class="toc">
|
|
<p><b>Table of Contents</b></p>
|
|
<dl class="toc">
|
|
<dt><span class="chapter">1. <a href=
|
|
"#RCL.INTRODUCTION">Introduction</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">1.1. <a href=
|
|
"#RCL.INTRODUCTION.TRYIT">Giving it a
|
|
try</a></span></dt>
|
|
<dt><span class="sect1">1.2. <a href=
|
|
"#RCL.INTRODUCTION.SEARCH">Full text
|
|
search</a></span></dt>
|
|
<dt><span class="sect1">1.3. <a href=
|
|
"#RCL.INTRODUCTION.RECOLL">Recoll
|
|
overview</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="chapter">2. <a href=
|
|
"#RCL.INDEXING">Indexing</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">2.1. <a href=
|
|
"#RCL.INDEXING.INTRODUCTION">Introduction</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.1.1. <a href=
|
|
"#RCL.INDEXING.INTRODUCTION.MODES">Indexing
|
|
modes</a></span></dt>
|
|
<dt><span class="sect2">2.1.2. <a href=
|
|
"#RCL.INDEXING.INTRODUCTION.CONFIG">Configurations,
|
|
multiple indexes</a></span></dt>
|
|
<dt><span class="sect2">2.1.3. <a href=
|
|
"#idm235">Document types</a></span></dt>
|
|
<dt><span class="sect2">2.1.4. <a href=
|
|
"#idm284">Indexing failures</a></span></dt>
|
|
<dt><span class="sect2">2.1.5. <a href=
|
|
"#idm294">Recovery</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">2.2. <a href=
|
|
"#RCL.INDEXING.STORAGE">Index storage</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.2.1. <a href=
|
|
"#RCL.INDEXING.STORAGE.FORMAT"><span class=
|
|
"application">Xapian</span> index
|
|
formats</a></span></dt>
|
|
<dt><span class="sect2">2.2.2. <a href=
|
|
"#RCL.INDEXING.STORAGE.SECURITY">Security
|
|
aspects</a></span></dt>
|
|
<dt><span class="sect2">2.2.3. <a href=
|
|
"#RCL.INDEXING.STORAGE.BIG">Special considerations
|
|
for big indexes</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">2.3. <a href=
|
|
"#RCL.INDEXING.CONFIG">Index
|
|
configuration</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.3.1. <a href=
|
|
"#RCL.INDEXING.CONFIG.MULTIPLE">Multiple
|
|
indexes</a></span></dt>
|
|
<dt><span class="sect2">2.3.2. <a href=
|
|
"#RCL.INDEXING.CONFIG.SENS">Index case and
|
|
diacritics sensitivity</a></span></dt>
|
|
<dt><span class="sect2">2.3.3. <a href=
|
|
"#RCL.INDEXING.CONFIG.THREADS">Indexing threads
|
|
configuration (<span class=
|
|
"application">Unix</span>-like
|
|
systems)</a></span></dt>
|
|
<dt><span class="sect2">2.3.4. <a href=
|
|
"#RCL.INDEXING.CONFIG.GUI">The index configuration
|
|
GUI</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">2.4. <a href=
|
|
"#RCL.INDEXING.REMOVABLE">Removable
|
|
volumes</a></span></dt>
|
|
<dt><span class="sect1">2.5. <a href=
|
|
"#RCL.INDEXING.WebQUEUE"><span class=
|
|
"application">Unix</span>-like systems: indexing
|
|
visited Web pages</a></span></dt>
|
|
<dt><span class="sect1">2.6. <a href=
|
|
"#RCL.INDEXING.EXTATTR"><span class=
|
|
"application">Unix</span>-like systems: using extended
|
|
attributes</a></span></dt>
|
|
<dt><span class="sect1">2.7. <a href=
|
|
"#RCL.INDEXING.EXTTAGS"><span class=
|
|
"application">Unix</span>-like systems: importing
|
|
external tags</a></span></dt>
|
|
<dt><span class="sect1">2.8. <a href=
|
|
"#RCL.INDEXING.PDF">The PDF input
|
|
handler</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">2.8.1. <a href=
|
|
"#RCL.INDEXING.PDF.XMP">XMP fields
|
|
extraction</a></span></dt>
|
|
<dt><span class="sect2">2.8.2. <a href=
|
|
"#RCL.INDEXING.PDF.ATTACH">PDF attachment
|
|
indexing</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">2.9. <a href=
|
|
"#RCL.INDEXING.OCR">Recoll and OCR</a></span></dt>
|
|
<dt><span class="sect1">2.10. <a href=
|
|
"#RCL.INDEXING.PERIODIC">Periodic
|
|
indexing</a></span></dt>
|
|
<dt><span class="sect1">2.11. <a href=
|
|
"#RCL.INDEXING.MONITOR"><span class=
|
|
"application">Unix</span>-like systems: real time
|
|
indexing</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="chapter">3. <a href=
|
|
"#RCL.SEARCH">Searching</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">3.1. <a href=
|
|
"#RCL.SEARCH.INTRODUCTION">Introduction</a></span></dt>
|
|
<dt><span class="sect1">3.2. <a href=
|
|
"#RCL.SEARCH.GUI">Searching with the Qt graphical user
|
|
interface</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.2.1. <a href=
|
|
"#RCL.SEARCH.GUI.SIMPLE">Simple
|
|
search</a></span></dt>
|
|
<dt><span class="sect2">3.2.2. <a href=
|
|
"#RCL.SEARCH.GUI.RESLIST">The result
|
|
list</a></span></dt>
|
|
<dt><span class="sect2">3.2.3. <a href=
|
|
"#RCL.SEARCH.GUI.RESTABLE">The result
|
|
table</a></span></dt>
|
|
<dt><span class="sect2">3.2.4. <a href=
|
|
"#RCL.SEARCH.GUI.RUNSCRIPT"><span class=
|
|
"application">Unix</span>-like systems: running
|
|
arbitrary commands on result files</a></span></dt>
|
|
<dt><span class="sect2">3.2.5. <a href=
|
|
"#RCL.SEARCH.GUI.THUMBNAILS"><span class=
|
|
"application">Unix</span>-like systems: displaying
|
|
thumbnails</a></span></dt>
|
|
<dt><span class="sect2">3.2.6. <a href=
|
|
"#RCL.SEARCH.GUI.PREVIEW">The preview
|
|
window</a></span></dt>
|
|
<dt><span class="sect2">3.2.7. <a href=
|
|
"#RCL.SEARCH.GUI.FRAGBUTS">The Query Fragments
|
|
window</a></span></dt>
|
|
<dt><span class="sect2">3.2.8. <a href=
|
|
"#RCL.SEARCH.GUI.COMPLEX">Complex/advanced
|
|
search</a></span></dt>
|
|
<dt><span class="sect2">3.2.9. <a href=
|
|
"#RCL.SEARCH.GUI.TERMEXPLORER">The term explorer
|
|
tool</a></span></dt>
|
|
<dt><span class="sect2">3.2.10. <a href=
|
|
"#RCL.SEARCH.GUI.MULTIDB">Multiple
|
|
indexes</a></span></dt>
|
|
<dt><span class="sect2">3.2.11. <a href=
|
|
"#RCL.SEARCH.GUI.HISTORY">Document
|
|
history</a></span></dt>
|
|
<dt><span class="sect2">3.2.12. <a href=
|
|
"#RCL.SEARCH.GUI.SORT">Sorting search results and
|
|
collapsing duplicates</a></span></dt>
|
|
<dt><span class="sect2">3.2.13. <a href=
|
|
"#RCL.SEARCH.GUI.SHORTCUTS">Keyboard
|
|
shortcuts</a></span></dt>
|
|
<dt><span class="sect2">3.2.14. <a href=
|
|
"#RCL.SEARCH.GUI.TIPS">Search tips</a></span></dt>
|
|
<dt><span class="sect2">3.2.15. <a href=
|
|
"#RCL.SEARCH.SAVING">Saving and restoring queries
|
|
(1.21 and later)</a></span></dt>
|
|
<dt><span class="sect2">3.2.16. <a href=
|
|
"#RCL.SEARCH.GUI.CUSTOM">Customizing the search
|
|
interface</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">3.3. <a href=
|
|
"#RCL.SEARCH.KIO">Searching with the KDE KIO
|
|
slave</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.3.1. <a href=
|
|
"#RCL.SEARCH.KIO.INTRO">What's this</a></span></dt>
|
|
<dt><span class="sect2">3.3.2. <a href=
|
|
"#RCL.SEARCH.KIO.SEARCHABLEDOCS">Searchable
|
|
documents</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">3.4. <a href=
|
|
"#RCL.SEARCH.COMMANDLINE">Searching on the command
|
|
line</a></span></dt>
|
|
<dt><span class="sect1">3.5. <a href=
|
|
"#RCL.SEARCH.LANG">The query language</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.5.1. <a href=
|
|
"#RCL.SEARCH.LANG.RANGES">Range
|
|
clauses</a></span></dt>
|
|
<dt><span class="sect2">3.5.2. <a href=
|
|
"#RCL.SEARCH.LANG.MODIFIERS">Modifiers</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">3.6. <a href=
|
|
"#RCL.SEARCH.ANCHORWILD">Anchored searches and
|
|
wildcards</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.6.1. <a href=
|
|
"#RCL.SEARCH.WILDCARDS">More about
|
|
wildcards</a></span></dt>
|
|
<dt><span class="sect2">3.6.2. <a href=
|
|
"#RCL.SEARCH.ANCHOR">Anchored
|
|
searches</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">3.7. <a href=
|
|
"#RCL.SEARCH.SYNONYMS">Using Synonyms
|
|
(1.22)</a></span></dt>
|
|
<dt><span class="sect1">3.8. <a href=
|
|
"#RCL.SEARCH.PTRANS">Path translations</a></span></dt>
|
|
<dt><span class="sect1">3.9. <a href=
|
|
"#RCL.SEARCH.CASEDIAC">Search case and diacritics
|
|
sensitivity</a></span></dt>
|
|
<dt><span class="sect1">3.10. <a href=
|
|
"#RCL.SEARCH.DESKTOP">Desktop
|
|
integration</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">3.10.1. <a href=
|
|
"#RCL.SEARCH.SHORTCUT">Hotkeying
|
|
recoll</a></span></dt>
|
|
<dt><span class="sect2">3.10.2. <a href=
|
|
"#RCL.KICKER-APPLET">The KDE Kicker Recoll
|
|
applet</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="chapter">4. <a href=
|
|
"#RCL.PROGRAM">Programming interface</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">4.1. <a href=
|
|
"#RCL.PROGRAM.FILTERS">Writing a document input
|
|
handler</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">4.1.1. <a href=
|
|
"#RCL.PROGRAM.FILTERS.SIMPLE">Simple input
|
|
handlers</a></span></dt>
|
|
<dt><span class="sect2">4.1.2. <a href=
|
|
"#RCL.PROGRAM.FILTERS.MULTIPLE">"Multiple"
|
|
handlers</a></span></dt>
|
|
<dt><span class="sect2">4.1.3. <a href=
|
|
"#RCL.PROGRAM.FILTERS.ASSOCIATION">Telling
|
|
<span class="application">Recoll</span> about the
|
|
handler</a></span></dt>
|
|
<dt><span class="sect2">4.1.4. <a href=
|
|
"#RCL.PROGRAM.FILTERS.HTML">Input handler
|
|
output</a></span></dt>
|
|
<dt><span class="sect2">4.1.5. <a href=
|
|
"#RCL.PROGRAM.FILTERS.PAGES">Page
|
|
numbers</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">4.2. <a href=
|
|
"#RCL.PROGRAM.FIELDS">Field data
|
|
processing</a></span></dt>
|
|
<dt><span class="sect1">4.3. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI">Python API</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">4.3.1. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI.INTRO">Introduction</a></span></dt>
|
|
<dt><span class="sect2">4.3.2. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI.ELEMENTS">Interface
|
|
elements</a></span></dt>
|
|
<dt><span class="sect2">4.3.3. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI.LOG">Log messages for
|
|
Python scripts</a></span></dt>
|
|
<dt><span class="sect2">4.3.4. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI.SEARCH">Python search
|
|
interface</a></span></dt>
|
|
<dt><span class="sect2">4.3.5. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI.UPDATE">Creating Python
|
|
external indexers</a></span></dt>
|
|
<dt><span class="sect2">4.3.6. <a href=
|
|
"#RCL.PROGRAM.PYTHONAPI.COMPAT">Package
|
|
compatibility with the previous
|
|
version</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="chapter">5. <a href=
|
|
"#RCL.INSTALL">Installation and
|
|
configuration</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect1">5.1. <a href=
|
|
"#RCL.INSTALL.BINARY">Installing a binary
|
|
copy</a></span></dt>
|
|
<dt><span class="sect1">5.2. <a href=
|
|
"#RCL.INSTALL.EXTERNAL">Supporting
|
|
packages</a></span></dt>
|
|
<dt><span class="sect1">5.3. <a href=
|
|
"#RCL.INSTALL.BUILDING">Building from
|
|
source</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">5.3.1. <a href=
|
|
"#RCL.INSTALL.BUILDING.PREREQS">Prerequisites</a></span></dt>
|
|
<dt><span class="sect2">5.3.2. <a href=
|
|
"#RCL.INSTALL.BUILDING.BUILDING">Building</a></span></dt>
|
|
<dt><span class="sect2">5.3.3. <a href=
|
|
"#RCL.INSTALL.BUILDING.INSTALL">Installing</a></span></dt>
|
|
<dt><span class="sect2">5.3.4. <a href=
|
|
"#RCL.INSTALL.BUILDING.PYTHON">Python API
|
|
package</a></span></dt>
|
|
<dt><span class="sect2">5.3.5. <a href=
|
|
"#RCL.INSTALL.BUILDING.SOLARIS">Building on
|
|
Solaris</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
<dt><span class="sect1">5.4. <a href=
|
|
"#RCL.INSTALL.CONFIG">Configuration
|
|
overview</a></span></dt>
|
|
<dd>
|
|
<dl>
|
|
<dt><span class="sect2">5.4.1. <a href=
|
|
"#RCL.INSTALL.CONFIG.ENVIR">Environment
|
|
variables</a></span></dt>
|
|
<dt><span class="sect2">5.4.2. <a href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF">Recoll main
|
|
configuration file, recoll.conf</a></span></dt>
|
|
<dt><span class="sect2">5.4.3. <a href=
|
|
"#RCL.INSTALL.CONFIG.FIELDS">The fields
|
|
file</a></span></dt>
|
|
<dt><span class="sect2">5.4.4. <a href=
|
|
"#RCL.INSTALL.CONFIG.MIMEMAP">The mimemap
|
|
file</a></span></dt>
|
|
<dt><span class="sect2">5.4.5. <a href=
|
|
"#RCL.INSTALL.CONFIG.MIMECONF">The mimeconf
|
|
file</a></span></dt>
|
|
<dt><span class="sect2">5.4.6. <a href=
|
|
"#RCL.INSTALL.CONFIG.MIMEVIEW">The mimeview
|
|
file</a></span></dt>
|
|
<dt><span class="sect2">5.4.7. <a href=
|
|
"#RCL.INSTALL.CONFIG.PTRANS">The <code class=
|
|
"filename">ptrans</code> file</a></span></dt>
|
|
<dt><span class="sect2">5.4.8. <a href=
|
|
"#RCL.INSTALL.CONFIG.EXAMPLES">Examples of
|
|
configuration adjustments</a></span></dt>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<div class="list-of-tables">
|
|
<p><b>List of Tables</b></p>
|
|
<dl>
|
|
<dt>3.1. <a href="#idm1465">Keyboard shortcuts</a></dt>
|
|
</dl>
|
|
</div>
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.INTRODUCTION" id=
|
|
"RCL.INTRODUCTION"></a>Chapter 1. Introduction</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This document introduces full text search notions and
|
|
describes the installation and use of the <span class=
|
|
"application">Recoll</span> application. It is updated for
|
|
<span class="application">Recoll</span> 1.29.</p>
|
|
<p><span class="application">Recoll</span> was for a long
|
|
time dedicated to Unix-like systems. It was only lately
|
|
(2015) ported to <span class="application">MS-Windows</span>.
|
|
Many references in this manual, especially file locations,
|
|
are specific to Unix, and not valid on <span class=
|
|
"application">Windows</span>, where some described features
|
|
are also not available. The manual will be progressively
|
|
updated. Until this happens, on <span class=
|
|
"application">Windows</span>, most references to shared files
|
|
can be translated by looking under the Recoll installation
|
|
directory (Typically <code class="filename">C:/Program Files
|
|
(x86)/Recoll</code>, esp. anything referenced in <code class=
|
|
"filename">/usr/share</code> in this document will be found
|
|
int the <code class="filename">Share</code> subdirectory).
|
|
The user configuration is stored by default under
|
|
<code class="filename">AppData/Local/Recoll</code> inside the
|
|
user directory, along with the index itself.</p>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INTRODUCTION.TRYIT" id=
|
|
"RCL.INTRODUCTION.TRYIT"></a>1.1. Giving it a
|
|
try</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>If you do not like reading manuals (who does?) but wish
|
|
to give <span class="application">Recoll</span> a try, just
|
|
<a class="link" href="#RCL.INSTALL.BINARY" title=
|
|
"5.1. Installing a binary copy">install</a> the
|
|
application and start the <span class=
|
|
"command"><strong>recoll</strong></span> graphical user
|
|
interface (GUI), which will ask permission to index your
|
|
home directory, allowing you to search immediately after
|
|
indexing completes.</p>
|
|
<p>Do not do this if your home directory contains a huge
|
|
number of documents and you do not want to wait or are very
|
|
short on disk space. In this case, you may first want to
|
|
customize the <a class="link" href="#RCL.INDEXING.CONFIG"
|
|
title="2.3. Index configuration">configuration</a> to
|
|
restrict the indexed area (shortcut: from the <span class=
|
|
"command"><strong>recoll</strong></span> GUI go to:
|
|
<span class="guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Indexing configuration</span>, then adjust
|
|
the <span class="guilabel">Top directories</span>
|
|
section).</p>
|
|
<p>On <span class="application">Unix</span>-like systems,
|
|
you may need to install the appropriate <a class="link"
|
|
href="#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">supporting applications</a>
|
|
for document types that need them (for example <span class=
|
|
"application">antiword</span> for <span class=
|
|
"application">Microsoft Word</span> files). The
|
|
<span class="application">Recoll</span> for <span class=
|
|
"application">Windows</span> package is self-contained and
|
|
includes most useful auxiliary programs.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INTRODUCTION.SEARCH" id=
|
|
"RCL.INTRODUCTION.SEARCH"></a>1.2. Full text
|
|
search</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> is a full text
|
|
search application, which means that it finds your data by
|
|
content rather than by external attributes (like the file
|
|
name). You specify words (terms) which should or should not
|
|
appear in the text you are looking for, and receive in
|
|
return a list of matching documents, ordered so that the
|
|
most <span class="emphasis"><em>relevant</em></span>
|
|
documents will appear first.</p>
|
|
<p>You do not need to remember in what file or email
|
|
message you stored a given piece of information. You just
|
|
ask for related terms, and the tool will return a list of
|
|
documents where these terms are prominent, in a similar way
|
|
to Internet search engines.</p>
|
|
<p>Full text search applications try to determine which
|
|
documents are most relevant to the search terms you
|
|
provide. Computer algorithms for determining relevance can
|
|
be very complex, and in general are inferior to the power
|
|
of the human mind to rapidly determine relevance. The
|
|
quality of relevance guessing is probably the most
|
|
important aspect when evaluating a search application.
|
|
<span class="application">Recoll</span> relies on the
|
|
<span class="application">Xapian</span> probabilistic
|
|
information retrieval library to determine relevance.</p>
|
|
<p>In many cases, you are looking for all the forms of a
|
|
word, including plurals, different tenses for a verb, or
|
|
terms derived from the same root or <span class=
|
|
"emphasis"><em>stem</em></span> (example: <em class=
|
|
"replaceable"><code>floor, floors, floored,
|
|
flooring...</code></em>). Queries are usually automatically
|
|
expanded to all such related terms (words that reduce to
|
|
the same stem). This can be prevented for searching for a
|
|
specific form.</p>
|
|
<p>Stemming, by itself, does not accommodate for
|
|
misspellings or phonetic searches. A full text search
|
|
application may also support this form of approximation.
|
|
For example, a search for <em class=
|
|
"replaceable"><code>aliterattion</code></em> returning no
|
|
result might propose <em class=
|
|
"replaceable"><code>alliteration, alteration, alterations,
|
|
or altercation</code></em> as possible replacement terms.
|
|
<span class="application">Recoll</span> bases its
|
|
suggestions on the actual index contents, so that
|
|
suggestions may be made for words which would not appear in
|
|
a standard dictionary.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INTRODUCTION.RECOLL" id=
|
|
"RCL.INTRODUCTION.RECOLL"></a>1.3. Recoll
|
|
overview</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> uses the
|
|
<a class="ulink" href="http://www.xapian.org" target=
|
|
"_top"><span class="application">Xapian</span></a>
|
|
information retrieval library as its storage and retrieval
|
|
engine. <span class="application">Xapian</span> is a very
|
|
mature package using <a class="ulink" href=
|
|
"http://www.xapian.org/docs/intro_ir.html" target="_top">a
|
|
sophisticated probabilistic ranking model</a>.</p>
|
|
<p>The <span class="application">Xapian</span> library
|
|
manages an index database which describes where terms
|
|
appear in your document files. It efficiently processes the
|
|
complex queries which are produced by the <span class=
|
|
"application">Recoll</span> query expansion mechanism, and
|
|
is in charge of the all-important relevance computation
|
|
task.</p>
|
|
<p><span class="application">Recoll</span> provides the
|
|
mechanisms and interface to get data into and out of the
|
|
index. This includes translating the many possible document
|
|
formats into pure text, handling term variations (using
|
|
<span class="application">Xapian</span> stemmers), and
|
|
spelling approximations (using the <span class=
|
|
"application">aspell</span> speller), interpreting user
|
|
queries and presenting results.</p>
|
|
<p>In a shorter way, <span class=
|
|
"application">Recoll</span> does the dirty footwork,
|
|
<span class="application">Xapian</span> deals with the
|
|
intelligent parts of the process.</p>
|
|
<p>The <span class="application">Xapian</span> index can be
|
|
big (roughly the size of the original document set), but it
|
|
is not a document archive. <span class=
|
|
"application">Recoll</span> can only display documents that
|
|
still exist at the place from which they were indexed.</p>
|
|
<p><span class="application">Recoll</span> stores all
|
|
internal data in <span class="application">Unicode
|
|
UTF-8</span> format, and it can index many types of files
|
|
with different character sets, encodings, and languages
|
|
into the same index. It can process documents embedded
|
|
inside other documents (for example a PDF document stored
|
|
inside a Zip archive sent as an email attachment...), down
|
|
to an arbitrary depth.</p>
|
|
<p>Stemming is the process by which <span class=
|
|
"application">Recoll</span> reduces words to their radicals
|
|
so that searching does not depend, for example, on a word
|
|
being singular or plural (floor, floors), or on a verb
|
|
tense (flooring, floored). Because the mechanisms used for
|
|
stemming depend on the specific grammatical rules for each
|
|
language, there is a separate <span class=
|
|
"application">Xapian</span> stemmer module for most common
|
|
languages where stemming makes sense.</p>
|
|
<p><span class="application">Recoll</span> stores the
|
|
unstemmed versions of terms in the main index and uses
|
|
auxiliary databases for term expansion (one for each
|
|
stemming language), which means that you can switch
|
|
stemming languages between searches, or add a language
|
|
without needing a full reindex.</p>
|
|
<p>Storing documents written in different languages in the
|
|
same index is possible, and commonly done. In this
|
|
situation, you can specify several stemming languages for
|
|
the index.</p>
|
|
<p><span class="application">Recoll</span> currently makes
|
|
no attempt at automatic language recognition, which means
|
|
that the stemmer will sometimes be applied to terms from
|
|
other languages with potentially strange results. In
|
|
practise, even if this introduces possibilities of
|
|
confusion, this approach has been proven quite useful, and
|
|
it is much less cumbersome than separating your documents
|
|
according to what language they are written in.</p>
|
|
<p>By default, <span class="application">Recoll</span>
|
|
strips most accents and diacritics from terms, and converts
|
|
them to lower case before either storing them in the index
|
|
or searching for them. As a consequence, it is impossible
|
|
to search for a particular capitalization of a term
|
|
(<code class="literal">US</code> / <code class=
|
|
"literal">us</code>), or to discriminate two terms based on
|
|
diacritics (<code class="literal">sake</code> /
|
|
<code class="literal">saké</code>, <code class=
|
|
"literal">mate</code> / <code class=
|
|
"literal">maté</code>).</p>
|
|
<p><span class="application">Recoll</span> can optionally
|
|
store the raw terms, without accent stripping or case
|
|
conversion. In this configuration, default searches will
|
|
behave as before, but it is possible to perform searches
|
|
sensitive to case and diacritics. This is described in more
|
|
detail in the section about <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.SENS" title=
|
|
"2.3.2. Index case and diacritics sensitivity">index
|
|
case and diacritics sensitivity</a>.</p>
|
|
<p><span class="application">Recoll</span> uses many
|
|
parameters to define exactly what to index, and how to
|
|
classify and decode the source documents. These are kept in
|
|
<a class="link" href="#RCL.INDEXING.CONFIG" title=
|
|
"2.3. Index configuration">configuration files</a>. A
|
|
default configuration is copied into a standard location
|
|
(usually something like <code class=
|
|
"filename">/usr/share/recoll/examples</code>) during
|
|
installation. The default values set by the configuration
|
|
files in this directory may be overridden by values set
|
|
inside your personal configuration. With the default
|
|
configuration, <span class="application">Recoll</span> will
|
|
index your home directory with generic parameters. Most
|
|
common parameters can be set by using configuration menus
|
|
in the <span class="command"><strong>recoll</strong></span>
|
|
GUI. Some less common parameters can only be set by editing
|
|
the text files (the new values will be preserved by the
|
|
GUI).</p>
|
|
<p>The <a class="link" href="#RCL.INDEXING.PERIODIC.EXEC"
|
|
title="Running the indexer">indexing process</a> is started
|
|
automatically (after asking permission), the first time you
|
|
execute the <span class=
|
|
"command"><strong>recoll</strong></span> GUI. Indexing can
|
|
also be performed by executing the <span class=
|
|
"command"><strong>recollindex</strong></span> command.
|
|
<span class="application">Recoll</span> indexing is
|
|
multithreaded by default when appropriate hardware
|
|
resources are available, and can perform in parallel
|
|
multiple tasks for text extraction, segmentation and index
|
|
updates.</p>
|
|
<p><a class="link" href="#RCL.SEARCH" title=
|
|
"Chapter 3. Searching">Searches</a> are usually
|
|
performed inside the <span class=
|
|
"command"><strong>recoll</strong></span> GUI, which has
|
|
many options to help you find what you are looking for.
|
|
However, there are other ways to query the index:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>A <a class="link" href="#RCL.SEARCH.COMMANDLINE"
|
|
title=
|
|
"3.4. Searching on the command line">command
|
|
line interface</a>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>A <a class="link" href="#RCL.PROGRAM.PYTHONAPI"
|
|
title="4.3. Python API"><span class=
|
|
"application">Python</span> programming
|
|
interface</a></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>A <a class="link" href="#RCL.SEARCH.KIO" title=
|
|
"3.3. Searching with the KDE KIO slave"><span class="application">
|
|
KDE</span> KIO slave module</a>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>A Ubuntu Unity <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/pages/download.html"
|
|
target="_top">Scope</a> module.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>A Gnome Shell <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/pages/download.html"
|
|
target="_top">Search Provider</a>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>A <a class="ulink" href=
|
|
"https://framagit.org/medoc92/recollwebui" target=
|
|
"_top">Web interface</a>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.INDEXING" id=
|
|
"RCL.INDEXING"></a>Chapter 2. Indexing</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.INTRODUCTION" id=
|
|
"RCL.INDEXING.INTRODUCTION"></a>2.1. Introduction</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Indexing is the process by which the set of documents is
|
|
analyzed and the data entered into the database.
|
|
<span class="application">Recoll</span> indexing is
|
|
normally incremental: documents will only be processed if
|
|
they have been modified since the last run. On the first
|
|
execution, all documents will need processing. A full index
|
|
build can be forced later by specifying an option to the
|
|
indexing command (<span class=
|
|
"command"><strong>recollindex</strong></span> <code class=
|
|
"option">-z</code> or <code class="option">-Z</code>).</p>
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span> skips files
|
|
which caused an error during a previous pass. This is a
|
|
performance optimization, and the command line option
|
|
<code class="option">-k</code> can be set to retry failed
|
|
files, for example after updating an input handler.</p>
|
|
<p>The following sections give an overview of different
|
|
aspects of the indexing processes and configuration, with
|
|
links to detailed sections.</p>
|
|
<p>Depending on your data, temporary files may be needed
|
|
during indexing, some of them possibly quite big. You can
|
|
use the <code class="envar">RECOLL_TMPDIR</code> or
|
|
<code class="envar">TMPDIR</code> environment variables to
|
|
determine where they are created (the default is to use
|
|
<code class="filename">/tmp</code>). Using <code class=
|
|
"envar">TMPDIR</code> has the nice property that it may
|
|
also be taken into account by auxiliary commands executed
|
|
by <span class=
|
|
"command"><strong>recollindex</strong></span>.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.INTRODUCTION.MODES" id=
|
|
"RCL.INDEXING.INTRODUCTION.MODES"></a>2.1.1. Indexing
|
|
modes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> indexing can
|
|
be performed along two main modes:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><b><a class="link" href="#RCL.INDEXING.PERIODIC"
|
|
title="2.10. Periodic indexing">Periodic (or
|
|
batch) indexing</a> . </b><span class=
|
|
"command"><strong>recollindex</strong></span> is
|
|
executed at discrete times. On <span class=
|
|
"application">Unix</span>-like systems, the typical
|
|
usage is to have a nightly run <a class="link"
|
|
href="#RCL.INDEXING.PERIODIC.AUTOMAT" title=
|
|
"Linux: using cron to automate indexing">programmed</a>
|
|
into your <span class=
|
|
"command"><strong>cron</strong></span> file. On
|
|
<span class="application">Windows</span>, this is
|
|
the only mode available, and the Windows Task
|
|
Scheduler can be used to run indexing. In both
|
|
cases, the GUI includes an easy interface to the
|
|
system batch scheduler.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b><a class="link" href="#RCL.INDEXING.MONITOR"
|
|
title=
|
|
"2.11. Unix-like systems: real time indexing">Real
|
|
time indexing</a> . </b>(Only available on
|
|
<span class="application">Unix</span>-like
|
|
systems). <span class=
|
|
"command"><strong>recollindex</strong></span> runs
|
|
permanently as a daemon and uses a file system
|
|
alteration monitor (e.g. <span class=
|
|
"application">inotify</span>) to detect file
|
|
changes. New or updated files are indexed at once.
|
|
Monitoring a big file system tree can consume
|
|
significant system resources.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name="idm190" id=
|
|
"idm190"></a><span class=
|
|
"application">Unix</span>-like systems: choosing
|
|
an indexing mode</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The choice between the two methods is mostly a
|
|
matter of preference, and they can be combined by
|
|
setting up multiple indexes (ie: use periodic indexing
|
|
on a big documentation directory, and real time
|
|
indexing on a small home directory), or, with
|
|
<span class="application">Recoll</span> 1.24 and newer,
|
|
by <a class="link" href="#RCL.INDEXING.MONITOR" title=
|
|
"2.11. Unix-like systems: real time indexing">configuring
|
|
the index so that only a subset of the tree will be
|
|
monitored.</a></p>
|
|
<p>The choice of method and the parameters used can be
|
|
configured from the <span class=
|
|
"command"><strong>recoll</strong></span> GUI:
|
|
<span class="guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Indexing schedule</span> dialog.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.INTRODUCTION.CONFIG" id=
|
|
"RCL.INDEXING.INTRODUCTION.CONFIG"></a>2.1.2. Configurations,
|
|
multiple indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> supports
|
|
defining multiple indexes, each defined by its own
|
|
configuration directory. A configuration directory
|
|
contains <a class="link" href="#RCL.INDEXING.CONFIG"
|
|
title="2.3. Index configuration">several files</a>
|
|
which describe what should be indexed and how.</p>
|
|
<p>When <span class=
|
|
"command"><strong>recoll</strong></span> or <span class=
|
|
"command"><strong>recollindex</strong></span> is first
|
|
executed, it creates a default configuration directory.
|
|
This configuration is the one used for indexing and
|
|
querying when no specific configuration is specified. It
|
|
is located in <code class=
|
|
"filename">$HOME/.recoll/</code> for <span class=
|
|
"application">Unix</span>-like systems and <code class=
|
|
"filename">%LOCALAPPDATA%\Recoll</code> on <span class=
|
|
"application">Windows</span> (typically <code class=
|
|
"filename">C:\Users\[me]\Appdata\Local\Recoll</code>).</p>
|
|
<p>All configuration parameters have defaults, defined in
|
|
system-wide files. Without further customisation, the
|
|
default configuration will process your complete home
|
|
directory, with a reasonable set of defaults. It can be
|
|
adjusted to process a different area of the file system,
|
|
select files in different ways, and many other
|
|
things.</p>
|
|
<p>In some cases, it may be useful to create additional
|
|
configuration directories, for example, to separate
|
|
personal and shared indexes, or to take advantage of the
|
|
organization of your data to improve search
|
|
precision.</p>
|
|
<p>In order to do this, you would create an empty
|
|
directory in a location of your choice, and then instruct
|
|
<span class="command"><strong>recoll</strong></span> or
|
|
<span class="command"><strong>recollindex</strong></span>
|
|
to use it by setting either a command line option
|
|
(<code class="literal">-c</code> <em class=
|
|
"replaceable"><code>/some/directory</code></em>), or an
|
|
environment variable (<code class=
|
|
"envar">RECOLL_CONFDIR</code>=<em class=
|
|
"replaceable"><code>/some/directory</code></em>). Any
|
|
modification performed by the commands (e.g.
|
|
configuration customisation or searches by <span class=
|
|
"command"><strong>recoll</strong></span> or index
|
|
creation by <span class=
|
|
"command"><strong>recollindex</strong></span>) would then
|
|
apply to the new directory and not to the default
|
|
one.</p>
|
|
<p>Once multiple indexes are created, you can use each of
|
|
them separately by setting the <code class=
|
|
"literal">-c</code> option or the <code class=
|
|
"envar">RECOLL_CONFDIR</code> environment variable when
|
|
starting a command, to select the desired index.</p>
|
|
<p>It is also possible to instruct one configuration to
|
|
query one or several other indexes in addition to its
|
|
own, by using the <span class="guimenuitem">External
|
|
index</span> function in the <span class=
|
|
"command"><strong>recoll</strong></span> GUI, or some
|
|
other functions in the command line and programming
|
|
tools.</p>
|
|
<p>A plausible usage scenario for the multiple index
|
|
feature would be for a system administrator to set up a
|
|
central index for shared data, that you choose to search
|
|
or not in addition to your personal data. Of course,
|
|
there are other possibilities. for example, there are
|
|
many cases where you know the subset of files that should
|
|
be searched, and where narrowing the search can improve
|
|
the results. You can achieve approximately the same
|
|
effect with the directory filter in advanced search, but
|
|
multiple indexes may have better performance and may be
|
|
worth the trouble in some cases.</p>
|
|
<p>A more advanced use case would be to use multiple
|
|
index to improve indexing performance, by updating
|
|
several indexes in parallel (using multiple CPU cores and
|
|
disks, or possibly several machines), and then merging
|
|
them, or querying them in parallel.</p>
|
|
<p>See the section about <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.MULTIPLE" title=
|
|
"2.3.1. Multiple indexes">configuring multiple
|
|
indexes</a> for more detail</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="idm235" id=
|
|
"idm235"></a>2.1.3. Document types</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> knows about
|
|
quite a few different document types. The parameters for
|
|
document types recognition and processing are set in
|
|
<a class="link" href="#RCL.INDEXING.CONFIG" title=
|
|
"2.3. Index configuration">configuration
|
|
files</a>.</p>
|
|
<p>Most file types, like HTML or word processing files,
|
|
only hold one document. Some file types, like email
|
|
folders or zip archives, can hold many individually
|
|
indexed documents, which may themselves be compound ones.
|
|
Such hierarchies can go quite deep, and <span class=
|
|
"application">Recoll</span> can process, for example, a
|
|
<span class="application">LibreOffice</span> document
|
|
stored as an attachment to an email message inside an
|
|
email folder archived in a zip file...</p>
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span> processes
|
|
plain text, HTML, OpenDocument (Open/LibreOffice), email
|
|
formats, and a few others internally.</p>
|
|
<p>Other file types (ie: postscript, pdf, ms-word, rtf
|
|
...) need external applications for preprocessing. The
|
|
list is in the <a class="link" href=
|
|
"#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">installation</a> section.
|
|
After every indexing operation, <span class=
|
|
"application">Recoll</span> updates a list of commands
|
|
that would be needed for indexing existing files types.
|
|
This list can be displayed by selecting the menu option
|
|
<span class="guimenu">File</span> → <span class=
|
|
"guimenuitem">Show Missing Helpers</span> in the
|
|
<span class="command"><strong>recoll</strong></span> GUI.
|
|
It is stored in the <code class="filename">missing</code>
|
|
text file inside the configuration directory.</p>
|
|
<p>After installing a missing handler, you may need to
|
|
tell <span class=
|
|
"command"><strong>recollindex</strong></span> to retry
|
|
the failed files, by adding option <code class=
|
|
"literal">-k</code> to the command line, or by using the
|
|
GUI <span class="guimenu">File</span> → <span class=
|
|
"guimenuitem">Special indexing</span> menu. This is
|
|
because <span class=
|
|
"command"><strong>recollindex</strong></span>, in its
|
|
default operation mode, will not retry files which caused
|
|
an error during an earlier pass. In special cases, it may
|
|
be useful to reset the data for a category of files
|
|
before indexing. See the <span class=
|
|
"command"><strong>recollindex</strong></span> manual
|
|
page. If your index is not too big, it may be simpler to
|
|
just reset it.</p>
|
|
<p>By default, <span class="application">Recoll</span>
|
|
will try to index any file type that it has a way to
|
|
read. This is sometimes not desirable, and there are ways
|
|
to either exclude some types, or on the contrary define a
|
|
positive list of types to be indexed. In the latter case,
|
|
any type not in the list will be ignored.</p>
|
|
<p>Excluding files by name can be done by adding wildcard
|
|
name patterns to the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES">skippedNames</a>
|
|
list, which can be done from the GUI Index configuration
|
|
menu. Excluding by type can be done by setting the
|
|
<a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES">excludedmimetypes</a>
|
|
list in the configuration file (1.20 and later). This can
|
|
be redefined for subdirectories.</p>
|
|
<p>You can also define an exclusive list of MIME types to
|
|
be indexed (no others will be indexed), by setting the
|
|
<a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.INDEXEDMIMETYPES">indexedmimetypes</a>
|
|
configuration variable. Example:</p>
|
|
<pre class="programlisting">
|
|
indexedmimetypes = text/html application/pdf
|
|
</pre>
|
|
<p>It is possible to redefine this parameter for
|
|
subdirectories. Example:</p>
|
|
<pre class="programlisting">
|
|
[/path/to/my/dir]
|
|
indexedmimetypes = application/pdf
|
|
</pre>
|
|
<p>(When using sections like this, don't forget that they
|
|
remain in effect until the end of the file or another
|
|
section indicator).</p>
|
|
<p><code class="literal">excludedmimetypes</code> or
|
|
<code class="literal">indexedmimetypes</code>, can be set
|
|
either by editing the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF" title=
|
|
"5.4.2. Recoll main configuration file, recoll.conf">
|
|
configuration file (<code class=
|
|
"filename">recoll.conf</code>)</a> for the index, or by
|
|
using the GUI index configuration tool.</p>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note about MIME types</h3>
|
|
<p>When editing the <code class=
|
|
"literal">indexedmimetypes</code> or <code class=
|
|
"literal">excludedmimetypes</code> lists, you should
|
|
use the MIME values listed in the <code class=
|
|
"filename">mimemap</code> file or in Recoll result
|
|
lists in preference to <code class="literal">file
|
|
-i</code> output: there are a number of differences.
|
|
The <code class="literal">file -i</code> output should
|
|
only be used for files without extensions, or for which
|
|
the extension is not listed in <code class=
|
|
"filename">mimemap</code></p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="idm284" id=
|
|
"idm284"></a>2.1.4. Indexing failures</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Indexing may fail for some documents, for a number of
|
|
reasons: a helper program may be missing, the document
|
|
may be corrupt, we may fail to uncompress a file because
|
|
no file system space is available, etc.</p>
|
|
<p>The <span class="application">Recoll</span> indexer in
|
|
versions 1.21 and later does not retry failed files by
|
|
default, because some indexing failures can be quite
|
|
costly (for example failing to uncompress a big file
|
|
because of insufficient disk space). Retrying will only
|
|
occur if an explicit option (<code class=
|
|
"option">-k</code>) is set on the <span class=
|
|
"command"><strong>recollindex</strong></span> command
|
|
line, or if a script executed when <span class=
|
|
"command"><strong>recollindex</strong></span> starts up
|
|
says so. The script is defined by a configuration
|
|
variable (<code class=
|
|
"literal">checkneedretryindexscript</code>), and makes a
|
|
rather lame attempt at deciding if a helper command may
|
|
have been installed, by checking if any of the common
|
|
<code class="filename">bin</code> directories have
|
|
changed.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="idm294" id=
|
|
"idm294"></a>2.1.5. Recovery</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>In the rare case where the index becomes corrupted
|
|
(which can signal itself by weird search results or
|
|
crashes), the index files need to be erased before
|
|
restarting a clean indexing pass. Just delete the
|
|
<code class="filename">xapiandb</code> directory (see
|
|
<a class="link" href="#RCL.INDEXING.STORAGE" title=
|
|
"2.2. Index storage">next section</a>), or,
|
|
alternatively, start the next <span class=
|
|
"command"><strong>recollindex</strong></span> with the
|
|
<code class="option">-z</code> option, which will reset
|
|
the database before indexing. The difference between the
|
|
two methods is that the second will not change the
|
|
current index format, which may be undesirable if a newer
|
|
format is supported by the <span class=
|
|
"application">Xapian</span> version.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.STORAGE" id=
|
|
"RCL.INDEXING.STORAGE"></a>2.2. Index
|
|
storage</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The default location for the index data is the
|
|
<code class="filename">xapiandb</code> subdirectory of the
|
|
<span class="application">Recoll</span> configuration
|
|
directory, typically <code class=
|
|
"filename">$HOME/.recoll/xapiandb/</code>. This can be
|
|
changed via two different methods (with different
|
|
purposes):</p>
|
|
<div class="orderedlist">
|
|
<ol class="orderedlist" type="1">
|
|
<li class="listitem">
|
|
<p>For a given configuration directory, you can
|
|
specify a non-default storage location for the index
|
|
by setting the <code class="varname">dbdir</code>
|
|
parameter in the configuration file (see the
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
|
title=
|
|
"5.4.2. Recoll main configuration file, recoll.conf">
|
|
configuration section</a>). This method would mainly
|
|
be of use if you wanted to keep the configuration
|
|
directory in its default location, but desired
|
|
another location for the index, typically out of disk
|
|
occupation or performance concerns.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>You can specify a different configuration
|
|
directory by setting the <code class=
|
|
"envar">RECOLL_CONFDIR</code> environment variable,
|
|
or using the <code class="option">-c</code> option to
|
|
the <span class="application">Recoll</span> commands.
|
|
This method would typically be used to index
|
|
different areas of the file system to different
|
|
indexes. For example, if you were to issue the
|
|
following command:</p>
|
|
<pre class=
|
|
"programlisting">recoll -c ~/.indexes-email</pre>
|
|
<p>Then <span class="application">Recoll</span> would
|
|
use configuration files stored in <code class=
|
|
"filename">~/.indexes-email/</code> and, (unless
|
|
specified otherwise in <code class=
|
|
"filename">recoll.conf</code>) would look for the
|
|
index in <code class=
|
|
"filename">~/.indexes-email/xapiandb/</code>.</p>
|
|
<p>Using multiple configuration directories and
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.RECOLLCONF"
|
|
title=
|
|
"5.4.2. Recoll main configuration file, recoll.conf">
|
|
configuration options</a> allows you to tailor
|
|
multiple configurations and indexes to handle
|
|
whatever subset of the available data you wish to
|
|
make searchable.</p>
|
|
</li>
|
|
</ol>
|
|
</div>
|
|
<p>The size of the index is determined by the size of the
|
|
set of documents, but the ratio can vary a lot. For a
|
|
typical mixed set of documents, the index size will often
|
|
be close to the data set size. In specific cases (a set of
|
|
compressed mbox files for example), the index can become
|
|
much bigger than the documents. It may also be much smaller
|
|
if the documents contain a lot of images or other
|
|
non-indexed data (an extreme example being a set of mp3
|
|
files where only the tags would be indexed).</p>
|
|
<p>Of course, images, sound and video do not increase the
|
|
index size, which means that in most cases, the space used
|
|
by the index will be negligible compared to the total
|
|
amount of data on the computer.</p>
|
|
<p>The index data directory (<code class=
|
|
"filename">xapiandb</code>) only contains data that can be
|
|
completely rebuilt by an index run (as long as the original
|
|
documents exist), and it can always be destroyed
|
|
safely.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.STORAGE.FORMAT" id=
|
|
"RCL.INDEXING.STORAGE.FORMAT"></a>2.2.1. <span class="application">Xapian</span>
|
|
index formats</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Xapian</span> versions
|
|
usually support several formats for index storage. A
|
|
given major <span class="application">Xapian</span>
|
|
version will have a current format, used to create new
|
|
indexes, and will also support the format from the
|
|
previous major version.</p>
|
|
<p><span class="application">Xapian</span> will not
|
|
convert automatically an existing index from the older
|
|
format to the newer one. If you want to upgrade to the
|
|
new format, or if a very old index needs to be converted
|
|
because its format is not supported any more, you will
|
|
have to explicitly delete the old index (typically
|
|
<code class="filename">~/.recoll/xapiandb</code>), then
|
|
run a normal indexing command. Using <span class=
|
|
"command"><strong>recollindex</strong></span> option
|
|
<code class="option">-z</code> would not work in this
|
|
situation.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.STORAGE.SECURITY" id=
|
|
"RCL.INDEXING.STORAGE.SECURITY"></a>2.2.2. Security
|
|
aspects</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <span class="application">Recoll</span> index does
|
|
not hold complete copies of the indexed documents (it
|
|
almost does after version 1.24). But it does hold enough
|
|
data to allow for an almost complete reconstruction. If
|
|
confidential data is indexed, access to the database
|
|
directory should be restricted.</p>
|
|
<p><span class="application">Recoll</span> will create
|
|
the configuration directory with a mode of 0700 (access
|
|
by owner only). As the index data directory is by default
|
|
a sub-directory of the configuration directory, this
|
|
should result in appropriate protection.</p>
|
|
<p>If you use another setup, you should think of the kind
|
|
of protection you need for your index, set the directory
|
|
and files access modes appropriately, and also maybe
|
|
adjust the <code class="literal">umask</code> used during
|
|
index updates.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.STORAGE.BIG" id=
|
|
"RCL.INDEXING.STORAGE.BIG"></a>2.2.3. Special
|
|
considerations for big indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This only needs concern you if your index is going to
|
|
be bigger than around 5 GBytes. Beyond 10 GBytes, it
|
|
becomes a serious issue. Most people have much smaller
|
|
indexes. For reference, 5 GBytes would be around 2000
|
|
bibles, a lot of text. If you have a huge text dataset
|
|
(remember: images don't count, the text content of PDFs
|
|
is typically less than 5% of the file size), read on.</p>
|
|
<p>The amount of writing performed by Xapian during index
|
|
creation is not linear with the index size (it is
|
|
somewhere between linear and quadratic). For big indexes
|
|
this becomes a performance issue, and may even be an SSD
|
|
disk wear issue.</p>
|
|
<p>The problem can be mitigated by observing the
|
|
following rules:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Partition the data set and create several
|
|
indexes of reasonable size rather than a huge one.
|
|
These indexes can then be queried in parallel
|
|
(using the <span class="application">Recoll</span>
|
|
external indexes facility), or merged using
|
|
<span class=
|
|
"command"><strong>xapian-compact</strong></span>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Have a lot of RAM available and set the
|
|
<code class="literal">idxflushmb</code>
|
|
<span class="application">Recoll</span>
|
|
configuration parameter as high as you can without
|
|
swapping (experimentation will be needed). 200
|
|
would be a minimum in this context.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Use Xapian 1.4.10 or newer, as this version
|
|
brought a significant improvement in the amount of
|
|
writes.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.CONFIG" id=
|
|
"RCL.INDEXING.CONFIG"></a>2.3. Index
|
|
configuration</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Variables stored inside the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview"><span class=
|
|
"application">Recoll</span> configuration files</a> control
|
|
which areas of the file system are indexed, and how files
|
|
are processed. The values can be set by editing the text
|
|
files. Most of the more commonly used ones can also be
|
|
adjusted by using the <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.GUI" title=
|
|
"2.3.4. The index configuration GUI">dialogs in the
|
|
<span class="command"><strong>recoll</strong></span>
|
|
GUI</a>.</p>
|
|
<p>The first time you start <span class=
|
|
"command"><strong>recoll</strong></span>, you will be asked
|
|
whether or not you would like it to build the index. If you
|
|
want to adjust the configuration before indexing, just
|
|
click <span class="guilabel">Cancel</span> at this point,
|
|
which will get you into the configuration interface. If you
|
|
exit at this point, <code class="filename">recoll</code>
|
|
will have created a default configuration directory with
|
|
empty configuration files, which you can then edit.</p>
|
|
<p>The configuration is documented inside the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">installation chapter</a>
|
|
of this document, or in the <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/manpages/recoll.conf.5.html"
|
|
target="_top"><span class="citerefentry"><span class=
|
|
"refentrytitle">recoll.conf</span>(5)</span></a> manual
|
|
page. Both documents are automatically generated from the
|
|
comments inside the configuration file.</p>
|
|
<p>The most immediately useful variable is probably
|
|
<a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"><code class=
|
|
"varname">topdirs</code></a>, which lists the subtrees and
|
|
files to be indexed.</p>
|
|
<p>The applications needed to index file types other than
|
|
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
|
described in the <a class="link" href=
|
|
"#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">external packages
|
|
section</a>.</p>
|
|
<p>There are two incompatible types of Recoll indexes,
|
|
depending on the treatment of character case and
|
|
diacritics. A <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.SENS" title=
|
|
"2.3.2. Index case and diacritics sensitivity">further
|
|
section</a> describes the two types in more detail. The
|
|
default type is appropriate in most cases.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.CONFIG.MULTIPLE" id=
|
|
"RCL.INDEXING.CONFIG.MULTIPLE"></a>2.3.1. Multiple
|
|
indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Multiple <span class="application">Recoll</span>
|
|
indexes can be created by using several configuration
|
|
directories which are typically set to index different
|
|
areas of the file system.</p>
|
|
<p>A specific index can be selected by setting the
|
|
<code class="envar">RECOLL_CONFDIR</code> environment
|
|
variable or giving the <code class="option">-c</code>
|
|
option to <span class=
|
|
"command"><strong>recoll</strong></span> and <span class=
|
|
"command"><strong>recollindex</strong></span>.</p>
|
|
<p>The <span class=
|
|
"command"><strong>recollindex</strong></span> program,
|
|
used for creating or updating indexes, always works on a
|
|
single index. The different configurations are entirely
|
|
independent (no parameters are ever shared between
|
|
configurations when indexing).</p>
|
|
<p>All the search interfaces (<span class=
|
|
"command"><strong>recoll</strong></span>, <span class=
|
|
"command"><strong>recollq</strong></span>, the Python
|
|
API, etc.) operate with a main configuration, from which
|
|
both configuration and index data are used, and can also
|
|
query data from multiple additional indexes. Only the
|
|
index data from the latter is used, their configuration
|
|
parameters are ignored. This implies that some parameters
|
|
should be consistent among index configurations which are
|
|
to be used together.</p>
|
|
<p>When searching, the current main index (defined by
|
|
<code class="envar">RECOLL_CONFDIR</code> or <code class=
|
|
"option">-c</code>) is always active. If this is
|
|
undesirable, you can set up your base configuration to
|
|
index an empty directory.</p>
|
|
<p>Index configuration parameters can be set either by
|
|
using a text editor on the files, or, for most
|
|
parameters, by using the <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.GUI" title=
|
|
"2.3.4. The index configuration GUI"><span class=
|
|
"command"><strong>recoll</strong></span> index
|
|
configuration GUI</a>. In the latter case, the
|
|
configuration directory for which parameters are modified
|
|
is the one which was selected by <code class=
|
|
"envar">RECOLL_CONFDIR</code> or the <code class=
|
|
"option">-c</code> parameter, and there is no way to
|
|
switch configurations within the GUI.</p>
|
|
<p>See the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF" title=
|
|
"5.4.2. Recoll main configuration file, recoll.conf">
|
|
configuration section</a> for a detailed description of
|
|
the parameters</p>
|
|
<p>Some configuration parameters must be consistent among
|
|
a set of multiple indexes used together for searches.
|
|
Most importantly, all indexes to be queried concurrently
|
|
must have the same option concerning character case and
|
|
diacritics stripping, but there are other constraints.
|
|
Most of the relevant parameters affect the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.RECOLLCONF.TERMS" title=
|
|
"Parameters affecting how we generate terms and organize the index">
|
|
term generation</a>.</p>
|
|
<p>Using multiple configurations implies a small level of
|
|
command line or file manager usage. The user must
|
|
explicitly create additional configuration directories,
|
|
the GUI will not do it. This is to avoid mistakenly
|
|
creating additional directories when an argument is
|
|
mistyped. Also, the GUI or the indexer must be launched
|
|
with a specific option or environment to work on the
|
|
right configuration.</p>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name="idm415" id=
|
|
"idm415"></a>In practise: creating and using an
|
|
additional index</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Initially creating the configuration and index:</p>
|
|
<pre class="programlisting">
|
|
mkdir <em class=
|
|
"replaceable"><code>/path/to/my/new/config</code></em></pre>
|
|
<p>Configuring the new index can be done from the
|
|
<span class="command"><strong>recoll</strong></span>
|
|
GUI, launched from the command line to pass the
|
|
<code class="literal">-c</code> option (you could
|
|
create a desktop file to do it for you), and then using
|
|
the <a class="link" href="#RCL.INDEXING.CONFIG.GUI"
|
|
title="2.3.4. The index configuration GUI">GUI
|
|
index configuration tool</a> to set up the index.</p>
|
|
<pre class="programlisting">
|
|
recoll -c <em class=
|
|
"replaceable"><code>/path/to/my/new/config</code></em></pre>
|
|
<p>Alternatively, you can just start a text editor on
|
|
the main configuration file:</p>
|
|
<pre class="programlisting">
|
|
<em class="replaceable"><code>someEditor</code></em> <em class=
|
|
"replaceable"><code>/path/to/my/new/config</code></em>/<a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.RECOLLCONF" title=
|
|
"5.4.2. Recoll main configuration file, recoll.conf"><code class="filename">recoll.conf</code></a>
|
|
</pre>
|
|
<p>Creating and updating the index can be done from the
|
|
command line:</p>
|
|
<pre class="programlisting">recollindex -c <em class=
|
|
"replaceable"><code>/path/to/my/new/config</code></em>
|
|
</pre>
|
|
<p>or from the File menu of a GUI launched with the
|
|
same option (<span class=
|
|
"command"><strong>recoll</strong></span>, see
|
|
above).</p>
|
|
<p>The same GUI would also let you set up batch
|
|
indexing for the new index. Real time indexing can only
|
|
be set up from the GUI for the default index (the menu
|
|
entry will be inactive if the GUI was started with a
|
|
non-default <code class="literal">-c</code>
|
|
option).</p>
|
|
<p>The new index can be queried alone with</p>
|
|
<pre class="programlisting">
|
|
recoll -c <em class=
|
|
"replaceable"><code>/path/to/my/new/config</code></em></pre>
|
|
<p>Or, in parallel with the default index, by starting
|
|
<span class="command"><strong>recoll</strong></span>
|
|
without a <code class="literal">-c</code> option, and
|
|
using the <span class="guimenu">Preferences</span> →
|
|
<span class="guimenuitem">External Index Dialog</span>
|
|
menu.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.CONFIG.SENS" id=
|
|
"RCL.INDEXING.CONFIG.SENS"></a>2.3.2. Index
|
|
case and diacritics sensitivity</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>As of <span class="application">Recoll</span> version
|
|
1.18 you have a choice of building an index with terms
|
|
stripped of character case and diacritics, or one with
|
|
raw terms. For a source term of <code class=
|
|
"literal">Résumé</code>, the former will store
|
|
<code class="literal">resume</code>, the latter
|
|
<code class="literal">Résumé</code>.</p>
|
|
<p>Each type of index allows performing searches
|
|
insensitive to case and diacritics: with a raw index, the
|
|
user entry will be expanded to match all case and
|
|
diacritics variations present in the index. With a
|
|
stripped index, the search term will be stripped before
|
|
searching.</p>
|
|
<p>A raw index allows using case and diacritics to
|
|
discriminate between terms, e.g., returning different
|
|
results when searching for <code class=
|
|
"literal">US</code> and <code class="literal">us</code>
|
|
or <code class="literal">resume</code> and <code class=
|
|
"literal">résumé</code>. Read the <a class="link" href=
|
|
"#RCL.SEARCH.CASEDIAC" title=
|
|
"3.9. Search case and diacritics sensitivity">section
|
|
about search case and diacritics sensitivity</a> for more
|
|
details.</p>
|
|
<p>The type of index to be created is controlled by the
|
|
<code class="literal">indexStripChars</code>
|
|
configuration variable which can only be changed by
|
|
editing the configuration file. Any change implies an
|
|
index reset (not automated by <span class=
|
|
"application">Recoll</span>), and all indexes in a search
|
|
must be set in the same way (again, not checked by
|
|
<span class="application">Recoll</span>).</p>
|
|
<p><span class="application">Recoll</span> creates a
|
|
stripped index by default if <code class=
|
|
"literal">indexStripChars</code> is not set.</p>
|
|
<p>As a cost for added capability, a raw index will be
|
|
slightly bigger than a stripped one (around 10%). Also,
|
|
searches will be more complex, so probably slightly
|
|
slower, and the feature is relatively little used, so
|
|
that a certain amount of weirdness cannot be
|
|
excluded.</p>
|
|
<p>One of the most adverse consequence of using a raw
|
|
index is that some phrase and proximity searches may
|
|
become impossible: because each term needs to be
|
|
expanded, and all combinations searched for, the
|
|
multiplicative expansion may become unmanageable.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.CONFIG.THREADS" id=
|
|
"RCL.INDEXING.CONFIG.THREADS"></a>2.3.3. Indexing
|
|
threads configuration (<span class=
|
|
"application">Unix</span>-like systems)</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <span class="application">Recoll</span> indexing
|
|
process <span class=
|
|
"command"><strong>recollindex</strong></span> can use
|
|
multiple threads to speed up indexing on multiprocessor
|
|
systems. The work done to index files is divided in
|
|
several stages and some of the stages can be executed by
|
|
multiple threads. The stages are:</p>
|
|
<div class="orderedlist">
|
|
<ol class="orderedlist" type="1">
|
|
<li class="listitem">
|
|
<p>File system walking: this is always performed by
|
|
the main thread.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>File conversion and data extraction.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Text processing (splitting, stemming, etc.).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="application">Xapian</span> index
|
|
update.</p>
|
|
</li>
|
|
</ol>
|
|
</div>
|
|
<p>You can also read a <a class="ulink" href=
|
|
"http://www.recoll.org/pages/idxthreads/threadingRecoll.html"
|
|
target="_top">longer document</a> about the
|
|
transformation of <span class="application">Recoll</span>
|
|
indexing to multithreading.</p>
|
|
<p>The threads configuration is controlled by two
|
|
configuration file parameters.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">thrQSizes</code></span></dt>
|
|
<dd>
|
|
<p>This variable defines the job input queues
|
|
configuration. There are three possible queues for
|
|
stages 2, 3 and 4, and this parameter should give
|
|
the queue depth for each stage (three integer
|
|
values). If a value of -1 is used for a given
|
|
stage, no queue is used, and the thread will go on
|
|
performing the next stage. In practise, deep queues
|
|
have not been shown to increase performance. A
|
|
value of 0 for the first queue tells <span class=
|
|
"application">Recoll</span> to perform
|
|
autoconfiguration (no need for anything else in
|
|
this case, thrTCounts is not used) - this is the
|
|
default configuration.</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">thrTCounts</code></span></dt>
|
|
<dd>
|
|
<p>This defines the number of threads used for each
|
|
stage. If a value of -1 is used for one of the
|
|
queue depths, the corresponding thread count is
|
|
ignored. It makes no sense to use a value other
|
|
than 1 for the last stage because updating the
|
|
<span class="application">Xapian</span> index is
|
|
necessarily single-threaded (and protected by a
|
|
mutex).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p>If the first value in <code class=
|
|
"varname">thrQSizes</code> is 0, <code class=
|
|
"varname">thrTCounts</code> is ignored.</p>
|
|
</div>
|
|
<p>The following example would use three queues (of depth
|
|
2), and 4 threads for converting source documents, 2 for
|
|
processing their text, and one to update the index. This
|
|
was tested to be the best configuration on the test
|
|
system (quadri-processor with multiple disks).</p>
|
|
<pre class="programlisting">
|
|
thrQSizes = 2 2 2
|
|
thrTCounts = 4 2 1
|
|
</pre>
|
|
<p>The following example would use a single queue, and
|
|
the complete processing for each document would be
|
|
performed by a single thread (several documents will
|
|
still be processed in parallel in most cases). The
|
|
threads will use mutual exclusion when entering the index
|
|
update stage. In practise the performance would be close
|
|
to the precedent case in general, but worse in certain
|
|
cases (e.g. a Zip archive would be performed purely
|
|
sequentially), so the previous approach is preferred.
|
|
YMMV... The 2 last values for thrTCounts are ignored.</p>
|
|
<pre class="programlisting">
|
|
thrQSizes = 2 -1 -1
|
|
thrTCounts = 6 1 1
|
|
</pre>
|
|
<p>The following example would disable multithreading.
|
|
Indexing will be performed by a single thread.</p>
|
|
<pre class="programlisting">
|
|
thrQSizes = -1 -1 -1
|
|
</pre>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.INDEXING.CONFIG.GUI"
|
|
id="RCL.INDEXING.CONFIG.GUI"></a>2.3.4. The
|
|
index configuration GUI</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Most parameters for a given index configuration can be
|
|
set from a <span class=
|
|
"command"><strong>recoll</strong></span> GUI running on
|
|
this configuration (either as default, or by setting
|
|
<code class="envar">RECOLL_CONFDIR</code> or the
|
|
<code class="option">-c</code> option.)</p>
|
|
<p>The interface is started from the <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Index Configuration</span> menu entry. It
|
|
is divided in four tabs, <span class="guilabel">Global
|
|
parameters</span>, <span class="guilabel">Local
|
|
parameters</span>, <span class="guilabel">Web
|
|
history</span> (which is explained in the next section)
|
|
and <span class="guilabel">Search parameters</span>.</p>
|
|
<p>The <span class="guilabel">Global parameters</span>
|
|
tab allows setting global variables, like the lists of
|
|
top directories, skipped paths, or stemming
|
|
languages.</p>
|
|
<p>The <span class="guilabel">Local parameters</span> tab
|
|
allows setting variables that can be redefined for
|
|
subdirectories. This second tab has an initially empty
|
|
list of customisation directories, to which you can add.
|
|
The variables are then set for the currently selected
|
|
directory (or at the top level if the empty line is
|
|
selected).</p>
|
|
<p>The <span class="guilabel">Search parameters</span>
|
|
section defines parameters which are used at query time,
|
|
but are global to an index and affect all search tools,
|
|
not only the GUI.</p>
|
|
<p>The meaning for most entries in the interface is
|
|
self-evident and documented by a <code class=
|
|
"literal">ToolTip</code> popup on the text label. For
|
|
more detail, you will need to refer to the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">configuration
|
|
section</a> of this guide.</p>
|
|
<p>The configuration tool normally respects the comments
|
|
and most of the formatting inside the configuration file,
|
|
so that it is quite possible to use it on hand-edited
|
|
files, which you might nevertheless want to backup
|
|
first...</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.REMOVABLE" id=
|
|
"RCL.INDEXING.REMOVABLE"></a>2.4. Removable
|
|
volumes</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> used to have no
|
|
support for indexing removable volumes (portable disks, USB
|
|
keys, etc.). Recent versions have improved the situation
|
|
and support indexing removable volumes in two different
|
|
ways:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>By indexing the volume in the main, fixed, index,
|
|
and ensuring that the volume data is not purged if
|
|
the indexing runs while the volume is mounted. (since
|
|
<span class="application">Recoll</span> 1.25.2).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>By storing a volume index on the volume itself
|
|
(since <span class="application">Recoll</span>
|
|
1.24).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.REMOVABLE.MAIN" id=
|
|
"RCL.INDEXING.REMOVABLE.MAIN"></a>Indexing
|
|
removable volumes in the main index</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>As of version 1.25.2, <span class=
|
|
"application">Recoll</span> provides a simple way to
|
|
ensure that the index data for an absent volume will not
|
|
be purged. Two conditions must be met:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The volume mount point must be a member of the
|
|
<code class="literal">topdirs</code> list.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The mount directory must be empty (when the
|
|
volume is not mounted).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>If <span class=
|
|
"command"><strong>recollindex</strong></span> finds that
|
|
one of the <code class="literal">topdirs</code> is empty
|
|
when starting up, any existing data for the tree will be
|
|
preserved by the indexing pass (no purge for this
|
|
area).</p>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.REMOVABLE.SELF" id=
|
|
"RCL.INDEXING.REMOVABLE.SELF"></a>Self contained
|
|
volumes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>As of <span class="application">Recoll</span> 1.24, it
|
|
has become possible to build self-contained datasets
|
|
including a <span class="application">Recoll</span>
|
|
configuration directory and index together with the
|
|
indexed documents, and to move such a dataset around (for
|
|
example copying it to an USB drive), without having to
|
|
adjust the configuration for querying the index.</p>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p>This is a query-time feature only. The index must
|
|
only be updated in its original location. If an update
|
|
is necessary in a different location, the index must be
|
|
reset.</p>
|
|
</div>
|
|
<p>The principle of operation is that the configuration
|
|
stores the location of the original configuration
|
|
directory, which must reside on the movable volume. If
|
|
the volume is later mounted elsewhere, <span class=
|
|
"application">Recoll</span> adjusts the paths stored
|
|
inside the index by the difference between the original
|
|
and current locations of the configuration directory.</p>
|
|
<p>To make a long story short, here follows a script to
|
|
create a <span class="application">Recoll</span>
|
|
configuration and index under a given directory (given as
|
|
single parameter). The resulting data set (files + recoll
|
|
directory) can later to be moved to a CDROM or thumb
|
|
drive. Longer explanations come after the script.</p>
|
|
<pre class="programlisting">#!/bin/sh
|
|
|
|
fatal()
|
|
{
|
|
echo $*;exit 1
|
|
}
|
|
usage()
|
|
{
|
|
fatal "Usage: init-recoll-volume.sh <top-directory>"
|
|
}
|
|
|
|
test $# = 1 || usage
|
|
topdir=$1
|
|
test -d "$topdir" || fatal $topdir should be a directory
|
|
|
|
confdir="$topdir/recoll-config"
|
|
test ! -d "$confdir" || fatal $confdir should not exist
|
|
|
|
mkdir "$confdir"
|
|
cd "$topdir"
|
|
topdir=`pwd`
|
|
cd "$confdir"
|
|
confdir=`pwd`
|
|
|
|
(echo topdirs = '"'$topdir'"'; \
|
|
echo orgidxconfdir = $topdir/recoll-config) > "$confdir/recoll.conf"
|
|
|
|
recollindex -c "$confdir"
|
|
</pre>
|
|
<p>The examples below will assume that you have a dataset
|
|
under <code class="filename">/home/me/mydata/</code>,
|
|
with the index configuration and data stored inside
|
|
<code class=
|
|
"filename">/home/me/mydata/recoll-confdir</code>.</p>
|
|
<p>In order to be able to run queries after the dataset
|
|
has been moved, you must ensure the following:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The main configuration file must define the
|
|
<a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.ORGIDXCONFDIR">orgidxconfdir</a>
|
|
variable to be the original location of the
|
|
configuration directory (<code class=
|
|
"filename">orgidxconfdir=/home/me/mydata/recoll-confdir</code>
|
|
must be set inside <code class=
|
|
"filename">/home/me/mydata/recoll-confdir/recoll.conf</code>
|
|
in the example above).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The configuration directory must exist with the
|
|
documents, somewhere under the directory which will
|
|
be moved. E.g. if you are moving <code class=
|
|
"filename">/home/me/mydata</code> around, the
|
|
configuration directory must exist somewhere below
|
|
this point, for example <code class=
|
|
"filename">/home/me/mydata/recoll-confdir</code>,
|
|
or <code class=
|
|
"filename">/home/me/mydata/sub/recoll-confdir</code>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>You should keep the default locations for the
|
|
index elements which are relative to the
|
|
configuration directory by default (principally
|
|
<code class="literal">dbdir</code>). Only the paths
|
|
referring to the documents themselves (e.g.
|
|
<code class="literal">topdirs</code> values) should
|
|
be absolute (in general, they are only used when
|
|
indexing anyway).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Only the first point needs an explicit user action,
|
|
the <span class="application">Recoll</span> defaults are
|
|
compatible with the third one, and the second is
|
|
natural.</p>
|
|
<p>If, after the move, the configuration directory needs
|
|
to be copied out of the dataset (for example because the
|
|
thumb drive is too slow), you can set the <a class="link"
|
|
href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.CURIDXCONFDIR">curidxconfdir</a>,
|
|
variable inside the copied configuration to define the
|
|
location of the moved one. For example if <code class=
|
|
"filename">/home/me/mydata</code> is now mounted onto
|
|
<code class="filename">/media/me/somelabel</code>, but
|
|
the configuration directory and index has been copied to
|
|
<code class="filename">/tmp/tempconfig</code>, you would
|
|
set <code class="literal">curidxconfdir</code> to
|
|
<code class=
|
|
"filename">/media/me/somelabel/recoll-confdir</code>
|
|
inside <code class=
|
|
"filename">/tmp/tempconfig/recoll.conf</code>.
|
|
<code class="literal">orgidxconfdir</code> would still be
|
|
<code class=
|
|
"filename">/home/me/mydata/recoll-confdir</code> in the
|
|
original and the copy.</p>
|
|
<p>If you are regularly copying the configuration out of
|
|
the dataset, it will be useful to write a script to
|
|
automate the procedure. This can't really be done inside
|
|
<span class="application">Recoll</span> because there are
|
|
probably many possible variants. One example would be to
|
|
copy the configuration to make it writable, but keep the
|
|
index data on the medium because it is too big - in this
|
|
case, the script would also need to set <code class=
|
|
"literal">dbdir</code> in the copied configuration.</p>
|
|
<p>The same set of modifications (<span class=
|
|
"application">Recoll</span> 1.24) has also made it
|
|
possible to run queries from a readonly configuration
|
|
directory (with slightly reduced function of course, such
|
|
as not recording the query history).</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.WebQUEUE" id=
|
|
"RCL.INDEXING.WebQUEUE"></a>2.5. <span class=
|
|
"application">Unix</span>-like systems: indexing
|
|
visited Web pages</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>With the help of a <span class=
|
|
"application">Firefox</span> extension, <span class=
|
|
"application">Recoll</span> can index the Internet pages
|
|
that you visit. The extension has a long history: it was
|
|
initially designed for the <span class=
|
|
"application">Beagle</span> indexer, then adapted to
|
|
<span class="application">Recoll</span> and the
|
|
<span class="application">Firefox</span> <span class=
|
|
"application">XUL</span> API. The current version of the
|
|
extension is located in the <a class="ulink" href=
|
|
"https://addons.mozilla.org/en-US/firefox/addon/recoll-we/"
|
|
target="_top">Mozilla add-ons repository</a> uses the
|
|
<span class="application">WebExtensions</span> API, and
|
|
works with current <span class="application">Firefox</span>
|
|
versions.</p>
|
|
<p>The extension works by copying visited Web pages to an
|
|
indexing queue directory, which <span class=
|
|
"application">Recoll</span> then processes, storing the
|
|
data into a local cache, then indexing it, then removing
|
|
the file from the queue.</p>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">The local cache is not an archive</h3>
|
|
<p>As mentioned above, a copy of the indexed Web pages is
|
|
retained by Recoll in a local cache (from which data is
|
|
fetched for previews, or when resetting the index). The
|
|
cache is not changed by an index reset, just read for
|
|
indexing. The cache has a maximum size, which can be
|
|
adjusted from the <span class="guilabel">Index
|
|
configuration</span> / <span class="guilabel">Web
|
|
history</span> panel (<code class=
|
|
"literal">webcachemaxmbs</code> parameter in <code class=
|
|
"filename">recoll.conf</code>). Once the maximum size is
|
|
reached, old pages are erased to make room for new ones.
|
|
The pages which you want to keep indefinitely need to be
|
|
explicitly archived elsewhere. Using a very high value
|
|
for the cache size can avoid data erasure, but see the
|
|
above 'Howto' page for more details and gotchas.</p>
|
|
</div>
|
|
<p>The visited Web pages indexing feature can be enabled on
|
|
the <span class="application">Recoll</span> side from the
|
|
GUI <span class="guilabel">Index configuration</span>
|
|
panel, or by editing the configuration file (set
|
|
<code class="varname">processwebqueue</code> to 1).</p>
|
|
<p>The <span class="application">Recoll</span> GUI has a
|
|
tool to list and edit the contents of the Web cache.
|
|
(<span class="guimenu">Tools</span> → <span class=
|
|
"guimenuitem">Webcache editor</span>)</p>
|
|
<p>The <span class=
|
|
"command"><strong>recollindex</strong></span> command has
|
|
two options to help manage the Web cache:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem"><code class=
|
|
"option">--webcache-compact</code> will recover the
|
|
space from erased entries. It may need to use twice the
|
|
disk space currently needed for the Web cache.</li>
|
|
<li class="listitem"><code class=
|
|
"option">--webcache-burst <em class=
|
|
"replaceable"><code>destdir</code></em></code> will
|
|
extract all current entries into pairs of metadata and
|
|
data files created inside <em class=
|
|
"replaceable"><code>destdir</code></em></li>
|
|
</ul>
|
|
</div>
|
|
<p>You can find more details on Web indexing, its usage and
|
|
configuration in a <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWebHistory"
|
|
target="_top">Recoll 'Howto' entry</a>.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.EXTATTR" id=
|
|
"RCL.INDEXING.EXTATTR"></a>2.6. <span class=
|
|
"application">Unix</span>-like systems: using
|
|
extended attributes</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>User extended attributes are named pieces of information
|
|
that most modern file systems can attach to any file.</p>
|
|
<p><span class="application">Recoll</span> processes
|
|
extended attributes as document fields by default.</p>
|
|
<p>A <a class="ulink" href=
|
|
"http://www.freedesktop.org/wiki/CommonExtendedAttributes"
|
|
target="_top">freedesktop standard</a> defines a few
|
|
special attributes, which are handled as such by
|
|
<span class="application">Recoll</span>:</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">mime_type</span></dt>
|
|
<dd>
|
|
<p>If set, this overrides any other determination of
|
|
the file MIME type.</p>
|
|
</dd>
|
|
<dt><span class="term">charset</span></dt>
|
|
<dd>
|
|
<p>If set, this defines the file character set
|
|
(mostly useful for plain text files).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<p>By default, other attributes are handled as <span class=
|
|
"application">Recoll</span> fields of the same name.</p>
|
|
<p>On Linux, the <code class="literal">user</code> prefix
|
|
is removed from the name.</p>
|
|
<p>The name translation can be configured more precisely
|
|
inside the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file"><code class=
|
|
"filename">fields</code> configuration file</a>.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.EXTTAGS" id=
|
|
"RCL.INDEXING.EXTTAGS"></a>2.7. <span class=
|
|
"application">Unix</span>-like systems: importing
|
|
external tags</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>During indexing, it is possible to import metadata for
|
|
each file by executing commands. This allows, for example,
|
|
extracting tag data from an external application and
|
|
storing it in a field for indexing.</p>
|
|
<p>See the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS">section about
|
|
the <code class="literal">metadatacmds</code> field</a> in
|
|
the main configuration chapter for a description of the
|
|
configuration syntax.</p>
|
|
<p>For example, if you would want <span class=
|
|
"application">Recoll</span> to use tags managed by
|
|
<span class="application">tmsu</span> in a field named
|
|
<em class="replaceable"><code>tags</code></em>, you would
|
|
add the following to the configuration file:</p>
|
|
<pre class="programlisting">[/some/area/of/the/fs]
|
|
metadatacmds = ; <em class=
|
|
"replaceable"><code>tags</code></em> = tmsu tags %f
|
|
</pre>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p>Depending on the <span class="application">tmsu</span>
|
|
version, you may need/want to add options like
|
|
<code class="literal">--database=/some/db</code>.</p>
|
|
</div>
|
|
<p>You may want to restrict this processing to a subset of
|
|
the directory tree, because it may slow down indexing a bit
|
|
(<code class="literal">[some/area/of/the/fs]</code>).</p>
|
|
<p>Note the initial semi-colon after the equal sign.</p>
|
|
<p>In the example above, the output of <span class=
|
|
"command"><strong>tmsu</strong></span> is used to set a
|
|
field named <em class="replaceable"><code>tags</code></em>.
|
|
The field name is arbitrary and could be <em class=
|
|
"replaceable"><code>tmsu</code></em> or <em class=
|
|
"replaceable"><code>myfield</code></em> just the same, but
|
|
<em class="replaceable"><code>tags</code></em> is an alias
|
|
for the standard <span class="application">Recoll</span>
|
|
<code class="literal">keywords</code> field, and the
|
|
<span class="command"><strong>tmsu</strong></span> output
|
|
will just augment its contents. This will avoid the need to
|
|
extend the <a class="link" href="#RCL.PROGRAM.FIELDS"
|
|
title="4.2. Field data processing">field
|
|
configuration</a>.</p>
|
|
<p>Once re-indexing is performed (you will need to force
|
|
the file reindexing, <span class=
|
|
"application">Recoll</span> will not detect the need by
|
|
itself), you will be able to search from the query
|
|
language, through any of its aliases: <em class=
|
|
"replaceable"><code>tags:some/alternate/values</code></em>
|
|
or <em class=
|
|
"replaceable"><code>tags:all,these,values</code></em> (the
|
|
compact field search syntax is supported for recoll 1.20
|
|
and later. For older versions, you would need to repeat the
|
|
<em class="replaceable"><code>tags:</code></em> specifier
|
|
for each term, e.g. <em class=
|
|
"replaceable"><code>tags:some</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>tags:alternate</code></em>).</p>
|
|
<p>Tags changes will not be detected by the indexer if the
|
|
file itself did not change. One possible workaround would
|
|
be to update the file <code class="literal">ctime</code>
|
|
when you modify the tags, which would be consistent with
|
|
how extended attributes function. A pair of <span class=
|
|
"command"><strong>chmod</strong></span> commands could
|
|
accomplish this, or a <code class="literal">touch -a</code>
|
|
. Alternatively, just couple the tag update with a
|
|
<code class="literal">recollindex -e -i</code> <em class=
|
|
"replaceable"><code>/path/to/the/file</code></em>.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.PDF" id=
|
|
"RCL.INDEXING.PDF"></a>2.8. The PDF input
|
|
handler</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The PDF format is very important for scientific and
|
|
technical documentation, and document archival. It has
|
|
extensive facilities for storing metadata along with the
|
|
document, and these facilities are actually used in the
|
|
real world.</p>
|
|
<p>In consequence, the <span class=
|
|
"command"><strong>rclpdf.py</strong></span> PDF input
|
|
handler has more complex capabilities than most others, and
|
|
it is also more configurable. Specifically, <span class=
|
|
"command"><strong>rclpdf.py</strong></span> has the
|
|
following features:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>It can be configured to extract specific metadata
|
|
tags from an XMP packet.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>It can extract PDF attachments.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>It can automatically perform OCR if the document
|
|
text is empty. This is done by executing an external
|
|
program and is now described in a <a class="link"
|
|
href="#RCL.INDEXING.OCR" title=
|
|
"2.9. Recoll and OCR">separate section</a>,
|
|
because the OCR framework can also be used with
|
|
non-PDF image files.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.INDEXING.PDF.XMP"
|
|
id="RCL.INDEXING.PDF.XMP"></a>2.8.1. XMP
|
|
fields extraction</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <code class="filename">rclpdf.py</code> script in
|
|
<span class="application">Recoll</span> version 1.23.2
|
|
and later can extract XMP metadata fields by executing
|
|
the <span class="command"><strong>pdfinfo</strong></span>
|
|
command (usually found with <span class=
|
|
"application">poppler-utils</span>). This is controlled
|
|
by the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</a>
|
|
configuration variable, which specifies which tags to
|
|
extract and, possibly, how to rename them.</p>
|
|
<p>The <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</a>
|
|
variable can be used to designate a file with Python code
|
|
to edit the metadata fields (available for <span class=
|
|
"application">Recoll</span> 1.23.3 and later. 1.23.2 has
|
|
equivalent code inside the handler script). Example:</p>
|
|
<pre class="programlisting">import sys
|
|
import re
|
|
|
|
class MetaFixer(object):
|
|
def __init__(self):
|
|
pass
|
|
|
|
def metafix(self, nm, txt):
|
|
if nm == 'bibtex:pages':
|
|
txt = re.sub(r'--', '-', txt)
|
|
elif nm == 'someothername':
|
|
# do something else
|
|
pass
|
|
elif nm == 'stillanother':
|
|
# etc.
|
|
pass
|
|
|
|
return txt
|
|
def wrapup(self, metaheaders):
|
|
pass
|
|
</pre>
|
|
<p>If the 'metafix()' method is defined, it is called for
|
|
each metadata field. A new MetaFixer object is created
|
|
for each PDF document (so the object can keep state for,
|
|
for example, eliminating duplicate values). If the
|
|
'wrapup()' method is defined, it is called at the end of
|
|
XMP fields processing with the whole metadata as
|
|
parameter, as an array of '(nm, val)' pairs, allowing an
|
|
alternate approach for editing or adding/deleting
|
|
fields.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.INDEXING.PDF.ATTACH"
|
|
id="RCL.INDEXING.PDF.ATTACH"></a>2.8.2. PDF
|
|
attachment indexing</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>If <span class="application">pdftk</span> is
|
|
installed, and if the the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</a>
|
|
configuration variable is set, the PDF input handler will
|
|
try to extract PDF attachments for indexing as
|
|
sub-documents of the PDF file. This is disabled by
|
|
default, because it slows down PDF indexing a bit even if
|
|
not one attachment is ever found (PDF attachments are
|
|
uncommon in my experience).</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.OCR" id=
|
|
"RCL.INDEXING.OCR"></a>2.9. Recoll and OCR</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This is new in <span class="application">Recoll</span>
|
|
1.26.5. Older versions had a more limited, non-caching
|
|
capability to execute an external OCR program in the PDF
|
|
handler. The new function has the following features:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The OCR output is cached, stored as separate
|
|
files. The caching is ultimately based on a hash
|
|
value of the original file contents, so that it is
|
|
immune to file renames. A first path-based layer
|
|
ensures fast operation for unchanged (unmoved files),
|
|
and the data hash (which is still orders of magnitude
|
|
faster than OCR) is only re-computed if the file has
|
|
moved. OCR is only performed if the file was not
|
|
previously processed or if it changed.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The support for a specific program is implemented
|
|
in a simple Python module. It should be
|
|
straightforward to add support for any OCR engine
|
|
with a capability to run from the command line.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Modules initially exist for <span class=
|
|
"application">tesseract</span> (Linux and Windows),
|
|
and <span class="application">ABBYY FineReader</span>
|
|
(Linux, tested with version 11). ABBYY FineReader is
|
|
a commercial closed source program, but it sometimes
|
|
perform better than tesseract.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The OCR is currently only called from the PDF
|
|
handler, but there should be no problem using it for
|
|
other image types.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>To enable this feature, you need to install one of the
|
|
supported OCR applications (<span class=
|
|
"application">tesseract</span> or <span class=
|
|
"application">ABBYY</span>), enable OCR in the PDF handler,
|
|
and tell <span class="application">Recoll</span> where the
|
|
appropriate command resides. The last parts are done by
|
|
setting configuration variables. See the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG.RECOLLCONF.OCR" title=
|
|
"Parameters for OCR processing">relevant section</a>. All
|
|
parameters can be localized in subdirectories through the
|
|
usual main configuration mechanism (path sections).</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.PERIODIC" id=
|
|
"RCL.INDEXING.PERIODIC"></a>2.10. Periodic
|
|
indexing</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.PERIODIC.EXEC" id=
|
|
"RCL.INDEXING.PERIODIC.EXEC"></a>Running the
|
|
indexer</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <span class=
|
|
"command"><strong>recollindex</strong></span> program
|
|
performs index updates. You can start it either from the
|
|
command line or from the <span class=
|
|
"guimenu">File</span> menu in the <span class=
|
|
"command"><strong>recoll</strong></span> GUI program.
|
|
When started from the GUI, the indexing will run on the
|
|
same configuration <span class=
|
|
"command"><strong>recoll</strong></span> was started on.
|
|
When started from the command line, <span class=
|
|
"command"><strong>recollindex</strong></span> will use
|
|
the <code class="envar">RECOLL_CONFDIR</code> variable or
|
|
accept a <code class="option">-c</code> <em class=
|
|
"replaceable"><code>confdir</code></em> option to specify
|
|
a non-default configuration directory.</p>
|
|
<p>If the <span class=
|
|
"command"><strong>recoll</strong></span> program finds no
|
|
index when it starts, it will automatically start
|
|
indexing (except if canceled).</p>
|
|
<p>The GUI <span class="guimenu">File</span> menu has
|
|
entries to start or stop the current indexing operation.
|
|
When indexing is not currently running, you have a choice
|
|
between <span class="guimenuitem">Update Index</span> or
|
|
<span class="guimenuitem">Rebuild Index</span>. The first
|
|
choice only processes changed files, the second one
|
|
erases the index before starting so that all files are
|
|
processed.</p>
|
|
<p>On Linux and Windows, the GUI can be used to manage
|
|
the indexing operation. Stopping the indexer can be done
|
|
from the <span class=
|
|
"command"><strong>recoll</strong></span> GUI <span class=
|
|
"guimenu">File</span> → <span class="guimenuitem">Stop
|
|
Indexing</span> menu entry.</p>
|
|
<p>On Linux, the <span class=
|
|
"command"><strong>recollindex</strong></span> indexing
|
|
process can be interrupted by sending an interrupt
|
|
(<span class="keysym">Ctrl-C</span>, SIGINT) or terminate
|
|
(SIGTERM) signal.</p>
|
|
<p>When stopped, some time may elapse before <span class=
|
|
"command"><strong>recollindex</strong></span> exits,
|
|
because it needs to properly flush and close the
|
|
index.</p>
|
|
<p>After an interruption, the index will be somewhat
|
|
inconsistent because some operations which are normally
|
|
performed at the end of the indexing pass will have been
|
|
skipped (for example, the stemming and spelling databases
|
|
will be inexistent or out of date). You just need to
|
|
restart indexing at a later time to restore consistency.
|
|
The indexing will restart at the interruption point (the
|
|
full file tree will be traversed, but files that were
|
|
indexed up to the interruption and for which the index is
|
|
still up to date will not need to be reindexed).</p>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.PERIODIC.CMDLINE" id=
|
|
"RCL.INDEXING.PERIODIC.CMDLINE"></a>recollindex
|
|
command line</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span> has many
|
|
options which are listed in its <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/manpages/recollindex.1.html"
|
|
target="_top">manual page</a>. Only a few will be
|
|
described here.</p>
|
|
<p>Option <code class="option">-z</code> will reset the
|
|
index when starting. This is almost the same as
|
|
destroying the index files (the nuance is that the
|
|
<span class="application">Xapian</span> format version
|
|
will not be changed).</p>
|
|
<p>Option <code class="option">-Z</code> will force the
|
|
update of all documents without resetting the index
|
|
first. This will not have the "clean start" aspect of
|
|
<code class="option">-z</code>, but the advantage is that
|
|
the index will remain available for querying while it is
|
|
rebuilt, which can be a significant advantage if it is
|
|
very big (some installations need days for a full index
|
|
rebuild).</p>
|
|
<p>Option <code class="option">-k</code> will force
|
|
retrying files which previously failed to be indexed, for
|
|
example because of a missing helper program.</p>
|
|
<p>Of special interest also, maybe, are the <code class=
|
|
"option">-i</code> and <code class="option">-f</code>
|
|
options. <code class="option">-i</code> allows indexing
|
|
an explicit list of files (given as command line
|
|
parameters or read on <code class=
|
|
"literal">stdin</code>). <code class="option">-f</code>
|
|
tells <span class=
|
|
"command"><strong>recollindex</strong></span> to ignore
|
|
file selection parameters from the configuration.
|
|
Together, these options allow building a custom file
|
|
selection process for some area of the file system, by
|
|
adding the top directory to the <code class=
|
|
"varname">skippedPaths</code> list and using an
|
|
appropriate file selection method to build the file list
|
|
to be fed to <span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
<code class="option">-if</code>. Trivial example:</p>
|
|
<pre class="programlisting">
|
|
find . -name indexable.txt -print | recollindex -if
|
|
</pre>
|
|
<p><span class=
|
|
"command"><strong>recollindex</strong></span>
|
|
<code class="option">-i</code> will not descend into
|
|
subdirectories specified as parameters, but just add them
|
|
as index entries. It is up to the external file selection
|
|
method to build the complete file list.</p>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.PERIODIC.AUTOMAT" id=
|
|
"RCL.INDEXING.PERIODIC.AUTOMAT"></a>Linux: using
|
|
<span class="command"><strong>cron</strong></span>
|
|
to automate indexing</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The most common way to set up indexing is to have a
|
|
cron task execute it every night. For example the
|
|
following <code class="filename">crontab</code> entry
|
|
would do it every day at 3:30AM (supposing <span class=
|
|
"command"><strong>recollindex</strong></span> is in your
|
|
PATH):</p>
|
|
<pre class="screen">
|
|
30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
|
|
</pre>
|
|
<p>Or, using <span class=
|
|
"command"><strong>anacron</strong></span>:</p>
|
|
<pre class="screen">
|
|
1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
|
|
</pre>
|
|
<p>The <span class="application">Recoll</span> GUI has
|
|
dialogs to manage <code class="filename">crontab</code>
|
|
entries for <span class=
|
|
"command"><strong>recollindex</strong></span>. You can
|
|
reach them from the <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">Indexing Schedule</span> menu. They only
|
|
work with the good old <span class=
|
|
"command"><strong>cron</strong></span>, and do not give
|
|
access to all features of <span class=
|
|
"command"><strong>cron</strong></span> scheduling.
|
|
Entries created via the tool are marked with a
|
|
<code class="literal">RCLCRON_RCLINDEX=</code> marker so
|
|
that the tool knows which entries belong to it. As a side
|
|
effect, this sets an environment variable for the
|
|
process, but it's not actually used, this is just a
|
|
marker.</p>
|
|
<p>The usual command to edit your <code class=
|
|
"filename">crontab</code> is <span class=
|
|
"command"><strong>crontab</strong></span> <code class=
|
|
"option">-e</code> (which will usually start the
|
|
<span class="command"><strong>vi</strong></span> editor
|
|
to edit the file). You may have more sophisticated tools
|
|
available on your system.</p>
|
|
<p>Please be aware that there may be differences between
|
|
your usual interactive command line environment and the
|
|
one seen by crontab commands. Especially the PATH
|
|
variable may be of concern. Please check the crontab
|
|
manual pages about possible issues.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INDEXING.MONITOR" id=
|
|
"RCL.INDEXING.MONITOR"></a>2.11. <span class=
|
|
"application">Unix</span>-like systems: real time
|
|
indexing</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Real time monitoring/indexing is performed by starting
|
|
the <span class=
|
|
"command"><strong>recollindex</strong></span> <code class=
|
|
"option">-m</code> command. With this option, <span class=
|
|
"command"><strong>recollindex</strong></span> will detach
|
|
from the terminal and become a daemon, permanently
|
|
monitoring file changes and updating the index.</p>
|
|
<p>In this situation, the <span class=
|
|
"command"><strong>recoll</strong></span> GUI <span class=
|
|
"guimenu">File</span> menu makes two operations available:
|
|
<span class="guimenuitem">Stop</span> and <span class=
|
|
"guimenuitem">Trigger incremental pass</span>.</p>
|
|
<p><span class="guimenuitem">Trigger incremental
|
|
pass</span> has the same effect as restarting the indexer,
|
|
and will cause a complete walk of the indexed area,
|
|
processing the changed files, then switch to monitoring.
|
|
This is only marginally useful, maybe in cases where the
|
|
indexer is configured to delay updates, or to force an
|
|
immediate rebuild of the stemming and phonetic data, which
|
|
are only processed at intervals by the real time
|
|
indexer.</p>
|
|
<p>While it is convenient that data is indexed in real
|
|
time, repeated indexing can generate a significant load on
|
|
the system when files such as email folders change. Also,
|
|
monitoring large file trees by itself significantly taxes
|
|
system resources. You probably do not want to enable it if
|
|
your system is short on resources. Periodic indexing is
|
|
adequate in most cases.</p>
|
|
<p>As of <span class="application">Recoll</span> 1.24, you
|
|
can set the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS">monitordirs</a>
|
|
configuration variable to specify that only a subset of
|
|
your indexed files will be monitored for instant indexing.
|
|
In this situation, an incremental pass on the full tree can
|
|
be triggered by either restarting the indexer, or just
|
|
running <span class=
|
|
"command"><strong>recollindex</strong></span>, which will
|
|
notify the running process. The <span class=
|
|
"command"><strong>recoll</strong></span> GUI also has a
|
|
menu entry for this.</p>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.MONITOR.START.SYSTEMD" id=
|
|
"RCL.INDEXING.MONITOR.START.SYSTEMD"></a>Automatic
|
|
daemon start with systemd</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The installation contains two example files (in
|
|
<code class="filename">share/recoll/examples</code>) for
|
|
starting the indexing daemon with <span class=
|
|
"application">systemd</span>.</p>
|
|
<p><code class="filename">recollindex-user.service</code>
|
|
would be used for starting <span class=
|
|
"command"><strong>recollindex</strong></span> as a user
|
|
service, and can be installed with the following
|
|
commands:</p>
|
|
<pre class=
|
|
"programlisting">systemctl --user link /usr/share/recoll/examples/recollindex-user.service
|
|
systemctl --user enable --now recollindex-user.service</pre>
|
|
<p>The indexer will start when the user logs in and run
|
|
while there is a session open for them.</p>
|
|
<p><code class=
|
|
"filename">recollindex-system.service</code> would be
|
|
used for starting the indexer at boot time, running as a
|
|
specific user. It can be useful when running the text
|
|
search as a shared service (e.g. when users access it
|
|
through the WEB UI). You will need to edit it to replace
|
|
the @SOMEUSER@ value with something which makes sense in
|
|
your case, then install it as a regular <span class=
|
|
"application">systemd</span> system service. Of course,
|
|
if you want to run several such units, you will also need
|
|
to rename the installed file.</p>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.MONITOR.START" id=
|
|
"RCL.INDEXING.MONITOR.START"></a>Automatic daemon
|
|
start from the desktop session</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Under <span class="application">KDE</span>,
|
|
<span class="application">Gnome</span> and some other
|
|
desktop environments, the daemon can automatically
|
|
started when you log in, by creating a desktop file
|
|
inside the <code class=
|
|
"filename">~/.config/autostart</code> directory. This can
|
|
be done for you by the <span class=
|
|
"application">Recoll</span> GUI. Use the <span class=
|
|
"guimenu">Preferences->Indexing Schedule</span>
|
|
menu.</p>
|
|
<p>With older <span class="application">X11</span>
|
|
setups, starting the daemon is normally performed as part
|
|
of the user session script.</p>
|
|
<p>The <code class="filename">rclmon.sh</code> script can
|
|
be used to easily start and stop the daemon. It can be
|
|
found in the <code class="filename">examples</code>
|
|
directory (typically <code class=
|
|
"filename">/usr/local/[share/]recoll/examples</code>).</p>
|
|
<p>For example, a good old <span class=
|
|
"application">xdm</span>-based session could have a
|
|
<code class="filename">.xsession</code> script with the
|
|
following lines at the end:</p>
|
|
<pre class="programlisting">recollconf=$HOME/.recoll-home
|
|
recolldata=/usr/local/share/recoll
|
|
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
|
|
|
fvwm
|
|
</pre>
|
|
<p>The indexing daemon gets started, then the window
|
|
manager, for which the session waits.</p>
|
|
<p>By default the indexing daemon will monitor the state
|
|
of the X11 session, and exit when it finishes, it is not
|
|
necessary to kill it explicitly. (The <span class=
|
|
"application">X11</span> server monitoring can be
|
|
disabled with option <code class="option">-x</code> to
|
|
<span class=
|
|
"command"><strong>recollindex</strong></span>).</p>
|
|
<p>If you use the daemon completely out of an
|
|
<span class="application">X11</span> session, you need to
|
|
add option <code class="option">-x</code> to disable
|
|
<span class="application">X11</span> session monitoring
|
|
(else the daemon will not start).</p>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INDEXING.MONITOR.DETAILS" id=
|
|
"RCL.INDEXING.MONITOR.DETAILS"></a>Miscellaneous
|
|
details</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>By default, the messages from the indexing daemon will
|
|
be sent to the same file as those from the interactive
|
|
commands (<code class="literal">logfilename</code>). You
|
|
may want to change this by setting the <code class=
|
|
"varname">daemlogfilename</code> and <code class=
|
|
"varname">daemloglevel</code> configuration parameters.
|
|
Also the log file will only be truncated when the daemon
|
|
starts. If the daemon runs permanently, the log file may
|
|
grow quite big, depending on the log level.</p>
|
|
<p><b>Increasing resources for inotify. </b>On Linux
|
|
systems, monitoring a big tree may need increasing the
|
|
resources available to inotify, which are normally
|
|
defined in <code class=
|
|
"filename">/etc/sysctl.conf</code>.</p>
|
|
<pre class="programlisting">
|
|
### inotify
|
|
#
|
|
# cat /proc/sys/fs/inotify/max_queued_events - 16384
|
|
# cat /proc/sys/fs/inotify/max_user_instances - 128
|
|
# cat /proc/sys/fs/inotify/max_user_watches - 16384
|
|
#
|
|
# -- Change to:
|
|
#
|
|
fs.inotify.max_queued_events=32768
|
|
fs.inotify.max_user_instances=256
|
|
fs.inotify.max_user_watches=32768
|
|
</pre>
|
|
<p>Especially, you will need to trim your tree or adjust
|
|
the <code class="literal">max_user_watches</code> value
|
|
if indexing exits with a message about errno <code class=
|
|
"literal">ENOSPC</code> (28) from <code class=
|
|
"function">inotify_add_watch</code>.</p>
|
|
<p><b>Slowing down the reindexing rate for fast changing
|
|
files. </b>When using the real time monitor, it may
|
|
happen that some files need to be indexed, but change so
|
|
often that they impose an excessive load for the system.
|
|
<span class="application">Recoll</span> provides a
|
|
configuration option to specify the minimum time before
|
|
which a file, specified by a wildcard pattern, cannot be
|
|
reindexed. See the <code class=
|
|
"varname">mondelaypatterns</code> parameter in the
|
|
<a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF.MISC" title=
|
|
"Miscellaneous parameters">configuration section</a>.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.SEARCH" id=
|
|
"RCL.SEARCH"></a>Chapter 3. Searching</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.INTRODUCTION" id=
|
|
"RCL.SEARCH.INTRODUCTION"></a>3.1. Introduction</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Getting answers to specific queries is of course the
|
|
whole point of <span class="application">Recoll</span>. The
|
|
multiple provided interfaces always understand simple
|
|
queries made of one or several words, and return
|
|
appropriate results in most cases.</p>
|
|
<p>In order to make the most of <span class=
|
|
"application">Recoll</span> though, it may be worthwhile to
|
|
understand how it processes your input. Five different
|
|
modes exist:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>In <code class="literal">All Terms</code> mode,
|
|
<span class="application">Recoll</span> looks for
|
|
documents containing all your input terms.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">Query Language</code> mode
|
|
behaves like <code class="literal">All Terms</code>
|
|
in the absence of special input, but it can also do
|
|
much more. This is the best mode for getting the most
|
|
of <span class="application">Recoll</span>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>In <code class="literal">Any Term</code> mode,
|
|
<span class="application">Recoll</span> looks for
|
|
documents containing any your input terms, preferring
|
|
those which contain more.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>In <code class="literal">File Name</code> mode,
|
|
<span class="application">Recoll</span> will only
|
|
match file names, not content. Using a small subset
|
|
of the index allows things like left-hand wildcards
|
|
without performance issues, and may sometimes be
|
|
useful.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The GUI <code class="literal">Advanced
|
|
Search</code> mode is actually not more powerful than
|
|
the query language, but it helps you build complex
|
|
queries without having to remember the language, and
|
|
avoids any interpretation ambiguity, as it bypasses
|
|
the user input parser.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>These five input modes are supported by the different
|
|
user interfaces which are described in the following
|
|
sections.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.GUI" id=
|
|
"RCL.SEARCH.GUI"></a>3.2. Searching with the Qt
|
|
graphical user interface</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <span class="command"><strong>recoll</strong></span>
|
|
program provides the main user interface for searching. It
|
|
is based on the <span class="application">Qt</span>
|
|
library.</p>
|
|
<p><span class="command"><strong>recoll</strong></span> has
|
|
two search interfaces:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Simple search (the default, on the main screen)
|
|
has a single entry field where you can enter multiple
|
|
words.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Advanced search (a panel accessed through the
|
|
<span class="guilabel">Tools</span> menu or the
|
|
toolbox bar icon) has multiple entry fields, which
|
|
you may use to build a logical condition, with
|
|
additional filtering on file type, location in the
|
|
file system, modification date, and size.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>In most cases, you can enter the terms as you think
|
|
them, even if they contain embedded punctuation or other
|
|
non-textual characters (e.g. <span class=
|
|
"application">Recoll</span> can handle things like email
|
|
addresses).</p>
|
|
<p>The main case where you should enter text differently
|
|
from how it is printed is for east-asian languages
|
|
(Chinese, Japanese, Korean). Words composed of single or
|
|
multiple characters should be entered separated by white
|
|
space in this case (they would typically be printed without
|
|
white space).</p>
|
|
<p>Some searches can be quite complex, and you may want to
|
|
re-use them later, perhaps with some tweaking. <span class=
|
|
"application">Recoll</span> can save and restore searches.
|
|
See <a class="link" href="#RCL.SEARCH.SAVING" title=
|
|
"3.2.15. Saving and restoring queries (1.21 and later)">
|
|
Saving and restoring queries</a>.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.SIMPLE"
|
|
id="RCL.SEARCH.GUI.SIMPLE"></a>3.2.1. Simple
|
|
search</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="procedure">
|
|
<ol class="procedure" type="1">
|
|
<li class="step">
|
|
<p>Start the <span class=
|
|
"command"><strong>recoll</strong></span>
|
|
program.</p>
|
|
</li>
|
|
<li class="step">
|
|
<p>Possibly choose a search mode: <span class=
|
|
"guilabel">Any term</span>, <span class=
|
|
"guilabel">All terms</span>, <span class=
|
|
"guilabel">File name</span> or <span class=
|
|
"guilabel">Query language</span>.</p>
|
|
</li>
|
|
<li class="step">
|
|
<p>Enter search term(s) in the text field at the
|
|
top of the window.</p>
|
|
</li>
|
|
<li class="step">
|
|
<p>Click the <span class="guilabel">Search</span>
|
|
button or hit the <span class=
|
|
"keycap"><strong>Enter</strong></span> key to start
|
|
the search.</p>
|
|
</li>
|
|
</ol>
|
|
</div>
|
|
<p>The initial default search mode is <span class=
|
|
"guilabel">Query language</span>. Without special
|
|
directives, this will look for documents containing all
|
|
of the search terms (the ones with more terms will get
|
|
better scores), just like the <span class="guilabel">All
|
|
terms</span> mode. <span class="guilabel">Any term</span>
|
|
will search for documents where at least one of the terms
|
|
appear. <span class="guilabel">File name</span> will
|
|
exclusively look for file names, not contents</p>
|
|
<p>All search modes allow terms to be expanded with
|
|
wildcards characters (<code class="literal">*</code>,
|
|
<code class="literal">?</code>, <code class=
|
|
"literal">[]</code>). See the <a class="link" href=
|
|
"#RCL.SEARCH.WILDCARDS" title=
|
|
"3.6.1. More about wildcards">section about
|
|
wildcards</a> for more details.</p>
|
|
<p>In all modes except <span class="guilabel">File
|
|
name</span>, you can search for exact phrases (adjacent
|
|
words in a given order) by enclosing the input inside
|
|
double quotes. Ex: <code class="literal">"virtual
|
|
reality"</code>.</p>
|
|
<p>The <span class="guilabel">Query Language</span>
|
|
features are described in <a class="link" href=
|
|
"#RCL.SEARCH.LANG" title="3.5. The query language">a
|
|
separate section</a>.</p>
|
|
<p>When using a stripped index (the default), character
|
|
case has no influence on search, except that you can
|
|
disable stem expansion for any term by capitalizing it.
|
|
Ie: a search for <code class="literal">floor</code> will
|
|
also normally look for <code class=
|
|
"literal">flooring</code>, <code class=
|
|
"literal">floored</code>, etc., but a search for
|
|
<code class="literal">Floor</code> will only look for
|
|
<code class="literal">floor</code>, in any character
|
|
case. Stemming can also be disabled globally in the
|
|
preferences. When using a raw index, <a class="link"
|
|
href="#RCL.SEARCH.CASEDIAC" title=
|
|
"3.9. Search case and diacritics sensitivity">the
|
|
rules are a bit more complicated</a>.</p>
|
|
<p><span class="application">Recoll</span> remembers the
|
|
last few searches that you performed. You can directly
|
|
access the search history by clicking the clock button on
|
|
the right of the search entry, while the latter is empty.
|
|
Otherwise, the history is used for entry completion (see
|
|
next). Only the search texts are remembered, not the mode
|
|
(all/any/file name).</p>
|
|
<p>While text is entered in the search area, <span class=
|
|
"command"><strong>recoll</strong></span> will display
|
|
possible completions, filtered from the history and the
|
|
index search terms. This can be disabled with a GUI
|
|
Preferences option.</p>
|
|
<p>Double-clicking on a word in the result list or a
|
|
preview window will insert it into the simple search
|
|
entry field.</p>
|
|
<p>You can cut and paste any text into an <span class=
|
|
"guilabel">All terms</span> or <span class="guilabel">Any
|
|
term</span> search field, punctuation, newlines and all -
|
|
except for wildcard characters (single <code class=
|
|
"literal">?</code> characters are ok). <span class=
|
|
"application">Recoll</span> will process it and produce a
|
|
meaningful search. This is what most differentiates this
|
|
mode from the <span class="guilabel">Query
|
|
Language</span> mode, where you have to care about the
|
|
syntax.</p>
|
|
<p>You can use the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.COMPLEX" title=
|
|
"3.2.8. Complex/advanced search"><span class=
|
|
"guimenu">Tools</span> → <span class=
|
|
"guimenuitem">Advanced search</span></a> dialog for more
|
|
complex searches.</p>
|
|
<p>The <span class="guilabel">File name</span> search
|
|
mode will specifically look for file names. The point of
|
|
having a separate file name search is that wild card
|
|
expansion can be performed more efficiently on a small
|
|
subset of the index (allowing wild cards on the left of
|
|
terms without excessive cost). Things to know:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>White space in the entry should match white
|
|
space in the file name, and is not treated
|
|
specially.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The search is insensitive to character case and
|
|
accents, independently of the type of index.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>An entry without any wild card character and not
|
|
capitalized will be prepended and appended with '*'
|
|
(ie: <em class="replaceable"><code>etc</code></em>
|
|
-> <em class=
|
|
"replaceable"><code>*etc*</code></em>, but
|
|
<em class="replaceable"><code>Etc</code></em> ->
|
|
<em class="replaceable"><code>etc</code></em>).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>If you have a big index (many files),
|
|
excessively generic fragments may result in
|
|
inefficient searches.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.RESLIST"
|
|
id="RCL.SEARCH.GUI.RESLIST"></a>3.2.2. The
|
|
result list</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>After starting a search, a list of results will
|
|
instantly be displayed in the main window.</p>
|
|
<p>By default, the document list is presented in order of
|
|
relevance (how well the system estimates that the
|
|
document matches the query). You can sort the result by
|
|
ascending or descending date by using the vertical arrows
|
|
in the toolbar.</p>
|
|
<p>Clicking the <code class="literal">Preview</code> link
|
|
for an entry will open an internal preview window for the
|
|
document. Further <code class="literal">Preview</code>
|
|
clicks for the same search will open tabs in the existing
|
|
preview window. You can use <span class=
|
|
"keycap"><strong>Shift</strong></span>+Click to force the
|
|
creation of another preview window, which may be useful
|
|
to view the documents side by side. (You can also browse
|
|
successive results in a single preview window by typing
|
|
<span class=
|
|
"keycap"><strong>Shift</strong></span>+<span class=
|
|
"keycap"><strong>ArrowUp/Down</strong></span> in the
|
|
window).</p>
|
|
<p>Clicking the <code class="literal">Open</code> link
|
|
will start an external viewer for the document. By
|
|
default, <span class="application">Recoll</span> lets the
|
|
desktop choose the appropriate application for most
|
|
document types. See <a class="link" href=
|
|
"#RCL.SEARCH.GUI.RESLIST.APPLICATIONS" title=
|
|
"Customising the applications">further</a> for
|
|
customizing the applications.</p>
|
|
<p>You can click on the <code class="literal">Query
|
|
details</code> link at the top of the results page to see
|
|
the query actually performed, after stem expansion and
|
|
other processing.</p>
|
|
<p>Double-clicking on any word inside the result list or
|
|
a preview window will insert it into the simple search
|
|
text.</p>
|
|
<p>The result list is divided into pages (the size of
|
|
which you can change in the preferences). Use the arrow
|
|
buttons in the toolbar or the links at the bottom of the
|
|
page to browse the results.</p>
|
|
<p>The <code class="literal">Preview</code> and
|
|
<code class="literal">Open</code> edit links may not be
|
|
present for all entries, meaning that <span class=
|
|
"application">Recoll</span> has no configured way to
|
|
preview a given file type (which was indexed by name
|
|
only), or no configured external editor for the file
|
|
type. This can sometimes be adjusted simply by tweaking
|
|
the <a class="link" href="#RCL.INSTALL.CONFIG.MIMEMAP"
|
|
title="5.4.4. The mimemap file"><code class=
|
|
"filename">mimemap</code></a> and <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.MIMEVIEW" title=
|
|
"5.4.6. The mimeview file"><code class=
|
|
"filename">mimeview</code></a> configuration files (the
|
|
latter can be modified with the user preferences
|
|
dialog).</p>
|
|
<p>The format of the result list entries is entirely
|
|
configurable by using the preference dialog to <a class=
|
|
"link" href="#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"The result list format">edit an HTML fragment</a>.</p>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RESLIST.APPLICATIONS" id=
|
|
"RCL.SEARCH.GUI.RESLIST.APPLICATIONS"></a>Customising
|
|
the applications</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>By default <span class="application">Recoll</span>
|
|
lets the desktop choose what application should be used
|
|
to open a given document, with exceptions.</p>
|
|
<p>The details of this behaviour can be customized with
|
|
the <span class="guimenu">Preferences</span> →
|
|
<span class="guimenuitem">GUI configuration</span> →
|
|
<span class="guimenuitem">User interface</span> →
|
|
<span class="guimenuitem">Choose editor
|
|
applications</span> dialog or by editing the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.MIMEVIEW" title=
|
|
"5.4.6. The mimeview file"><code class=
|
|
"filename">mimeview</code> configuration file.</a></p>
|
|
<p>When <span class="guilabel">Use desktop
|
|
preferences</span>, at the top of the dialog, is
|
|
checked, the desktop default is generally used, but
|
|
there is a small default list of exceptions, for MIME
|
|
types where the <span class="application">Recoll</span>
|
|
choice should override the desktop one. These are
|
|
applications which are well integrated with
|
|
<span class="application">Recoll</span>, for example,
|
|
on Linux, <span class="application">evince</span> for
|
|
viewing PDF and Postscript files because of its support
|
|
for opening the document at a specific page and passing
|
|
a search string as an argument. You can add or remove
|
|
document types to the exceptions by using the
|
|
dialog.</p>
|
|
<p>If you prefer to completely customize the choice of
|
|
applications, you can uncheck <span class=
|
|
"guilabel">Use desktop preferences</span>, in which
|
|
case the <span class="application">Recoll</span>
|
|
predefined applications will be used, and can be
|
|
changed for each document type. This is probably not
|
|
the most convenient approach in most cases.</p>
|
|
<p>In all cases, the applications choice dialog accepts
|
|
multiple selections of MIME types in the top section,
|
|
and lets you define how they are processed in the
|
|
bottom one. In most cases, you will be using
|
|
<code class="literal">%f</code> as a place holder to be
|
|
replaced by the file name in the application command
|
|
line.</p>
|
|
<p>You may also change the choice of applications by
|
|
editing the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.MIMEVIEW" title=
|
|
"5.4.6. The mimeview file"><code class=
|
|
"filename">mimeview</code></a> configuration file if
|
|
you find this more convenient.</p>
|
|
<p>Under <span class="application">Unix</span>-like
|
|
systems, each result list entry also has a right-click
|
|
menu with an <span class="guilabel">Open With</span>
|
|
entry. This lets you choose an application from the
|
|
list of those which registered with the desktop for the
|
|
document MIME type, on a case by case basis.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RESLIST.SUGGS" id=
|
|
"RCL.SEARCH.GUI.RESLIST.SUGGS"></a>No results:
|
|
the spelling suggestions</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>When a search yields no result, and if the
|
|
<span class="application">aspell</span> dictionary is
|
|
configured, <span class="application">Recoll</span>
|
|
will try to check for misspellings among the query
|
|
terms, and will propose lists of replacements. Clicking
|
|
on one of the suggestions will replace the word and
|
|
restart the search. You can hold any of the modifier
|
|
keys (Ctrl, Shift, etc.) while clicking if you would
|
|
rather stay on the suggestion screen because several
|
|
terms need replacement.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RESULTLIST.MENU" id=
|
|
"RCL.SEARCH.GUI.RESULTLIST.MENU"></a>The result
|
|
list right-click menu</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Apart from the preview and edit links, you can
|
|
display a pop-up menu by right-clicking over a
|
|
paragraph in the result list. This menu has the
|
|
following entries:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Preview</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open With</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Run Script</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Copy File
|
|
Name</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Copy Url</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Save to File</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Find similar</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Preview Parent
|
|
document</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open Parent
|
|
document</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Open Snippets
|
|
Window</span></p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>The <span class="guilabel">Preview</span> and
|
|
<span class="guilabel">Open</span> entries do the same
|
|
thing as the corresponding links.</p>
|
|
<p><span class="guilabel">Open With</span>
|
|
(<span class="application">Unix</span>-like systems)
|
|
lets you open the document with one of the applications
|
|
claiming to be able to handle its MIME type (the
|
|
information comes from the <code class=
|
|
"literal">.desktop</code> files in <code class=
|
|
"filename">/usr/share/applications</code>).</p>
|
|
<p><span class="guilabel">Run Script</span>
|
|
(<span class="application">Unix</span>-like systems)
|
|
allows starting an arbitrary command on the result
|
|
file. It will only appear for results which are
|
|
top-level files. See <a class="link" href=
|
|
"#RCL.SEARCH.GUI.RUNSCRIPT" title=
|
|
"3.2.4. Unix-like systems: running arbitrary commands on result files">
|
|
further</a> for a more detailed description.</p>
|
|
<p>The <span class="guilabel">Copy File Name</span> and
|
|
<span class="guilabel">Copy Url</span> copy the
|
|
relevant data to the clipboard, for later pasting.</p>
|
|
<p><span class="guilabel">Save to File</span> allows
|
|
saving the contents of a result document to a chosen
|
|
file. This entry will only appear if the document does
|
|
not correspond to an existing file, but is a
|
|
subdocument inside such a file (ie: an email
|
|
attachment). It is especially useful to extract
|
|
attachments with no associated editor.</p>
|
|
<p>The <span class="guilabel">Open/Preview Parent
|
|
document</span> entries allow working with the higher
|
|
level document (e.g. the email message an attachment
|
|
comes from). <span class="application">Recoll</span> is
|
|
sometimes not totally accurate as to what it can or
|
|
can't do in this area. For example the <span class=
|
|
"guilabel">Parent</span> entry will also appear for an
|
|
email which is part of an mbox folder file, but you
|
|
can't actually visualize the mbox (there will be an
|
|
error dialog if you try).</p>
|
|
<p>If the document is a top-level file, <span class=
|
|
"guilabel">Open Parent</span> will start the default
|
|
file manager on the enclosing filesystem directory.</p>
|
|
<p>The <span class="guilabel">Find similar</span> entry
|
|
will select a number of relevant term from the current
|
|
document and enter them into the simple search field.
|
|
You can then start a simple search, with a good chance
|
|
of finding documents related to the current result. I
|
|
can't remember a single instance where this function
|
|
was actually useful to me...</p>
|
|
<p><a name="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS"
|
|
id="RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS"></a>The
|
|
<span class="guilabel">Open Snippets Window</span>
|
|
entry will only appear for documents which support page
|
|
breaks (typically PDF, Postscript, DVI). The snippets
|
|
window lists extracts from the document, taken around
|
|
search terms occurrences, along with the corresponding
|
|
page number, as links which can be used to start the
|
|
native viewer on the appropriate page. If the viewer
|
|
supports it, its search function will also be primed
|
|
with one of the search terms.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.RESTABLE"
|
|
id="RCL.SEARCH.GUI.RESTABLE"></a>3.2.3. The
|
|
result table</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>As an alternative to the result list, the results can
|
|
also be displayed in spreadsheet-like fashion. You can
|
|
switch to this presentation by clicking the table-like
|
|
icon in the toolbar (this is a toggle, click again to
|
|
restore the list).</p>
|
|
<p>Clicking on the column headers will allow sorting by
|
|
the values in the column. You can click again to invert
|
|
the order, and use the header right-click menu to reset
|
|
sorting to the default relevance order (you can also use
|
|
the sort-by-date arrows to do this).</p>
|
|
<p>Both the list and the table display the same
|
|
underlying results. The sort order set from the table is
|
|
still active if you switch back to the list mode. You can
|
|
click twice on a date sort arrow to reset it from
|
|
there.</p>
|
|
<p>The header right-click menu allows adding or deleting
|
|
columns. The columns can be resized, and their order can
|
|
be changed (by dragging). All the changes are recorded
|
|
when you quit <span class=
|
|
"command"><strong>recoll</strong></span></p>
|
|
<p>Hovering over a table row will update the detail area
|
|
at the bottom of the window with the corresponding
|
|
values. You can click the row to freeze the display. The
|
|
bottom area is equivalent to a result list paragraph,
|
|
with links for starting a preview or a native
|
|
application, and an equivalent right-click menu. Typing
|
|
<span class="keycap"><strong>Esc</strong></span> (the
|
|
Escape key) will unfreeze the display.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.RUNSCRIPT" id=
|
|
"RCL.SEARCH.GUI.RUNSCRIPT"></a>3.2.4. <span class="application">Unix</span>-like
|
|
systems: running arbitrary commands on result
|
|
files</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Apart from the <span class="guilabel">Open</span> and
|
|
<span class="guilabel">Open With</span> operations, which
|
|
allow starting an application on a result document (or a
|
|
temporary copy), based on its MIME type, it is also
|
|
possible to run arbitrary commands on results which are
|
|
top-level files, using the <span class="guilabel">Run
|
|
Script</span> entry in the results pop-up menu.</p>
|
|
<p>The commands which will appear in the <span class=
|
|
"guilabel">Run Script</span> submenu must be defined by
|
|
<code class="literal">.desktop</code> files inside the
|
|
<code class="filename">scripts</code> subdirectory of the
|
|
current configuration directory.</p>
|
|
<p>Here follows an example of a <code class=
|
|
"literal">.desktop</code> file, which could be named for
|
|
example, <code class=
|
|
"filename">~/.recoll/scripts/myscript.desktop</code> (the
|
|
exact file name inside the directory is irrelevant):</p>
|
|
<pre class="programlisting">
|
|
[Desktop Entry]
|
|
Type=Application
|
|
Name=MyFirstScript
|
|
Exec=/home/me/bin/tryscript %F
|
|
MimeType=*/*
|
|
</pre>
|
|
<p>The <code class="literal">Name</code> attribute
|
|
defines the label which will appear inside the
|
|
<span class="guilabel">Run Script</span> menu. The
|
|
<code class="literal">Exec</code> attribute defines the
|
|
program to be run, which does not need to actually be a
|
|
script, of course. The <code class=
|
|
"literal">MimeType</code> attribute is not used, but
|
|
needs to exist.</p>
|
|
<p>The commands defined this way can also be used from
|
|
links inside the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA" title=
|
|
"The paragraph format">result paragraph</a>.</p>
|
|
<p>As an example, it might make sense to write a script
|
|
which would move the document to the trash and purge it
|
|
from the <span class="application">Recoll</span>
|
|
index.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.THUMBNAILS" id=
|
|
"RCL.SEARCH.GUI.THUMBNAILS"></a>3.2.5. <span class="application">Unix</span>-like
|
|
systems: displaying thumbnails</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The default format for the result list entries and the
|
|
detail area of the result table display an icon for each
|
|
result document. The icon is either a generic one
|
|
determined from the MIME type, or a thumbnail of the
|
|
document appearance. Thumbnails are only displayed if
|
|
found in the standard <span class=
|
|
"application">freedesktop</span> location, where they
|
|
would typically have been created by a file manager.</p>
|
|
<p>Recoll has no capability to create thumbnails. A
|
|
relatively simple trick is to use the <span class=
|
|
"guilabel">Open parent document/folder</span> entry in
|
|
the result list popup menu. This should open a file
|
|
manager window on the containing directory, which should
|
|
in turn create the thumbnails (depending on your
|
|
settings). Restarting the search should then display the
|
|
thumbnails.</p>
|
|
<p>There are also <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/faqsandhowtos/ResultsThumbnails.html"
|
|
target="_top">some pointers about thumbnail
|
|
generation</a> in the <span class=
|
|
"application">Recoll</span> FAQ.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.PREVIEW"
|
|
id="RCL.SEARCH.GUI.PREVIEW"></a>3.2.6. The
|
|
preview window</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The preview window opens when you first click a
|
|
<code class="literal">Preview</code> link inside the
|
|
result list.</p>
|
|
<p>Subsequent preview requests for a given search open
|
|
new tabs in the existing window (except if you hold the
|
|
<span class="keycap"><strong>Shift</strong></span> key
|
|
while clicking which will open a new window for side by
|
|
side viewing).</p>
|
|
<p>Starting another search and requesting a preview will
|
|
create a new preview window. The old one stays open until
|
|
you close it.</p>
|
|
<p>You can close a preview tab by typing <span class=
|
|
"keycap"><strong>Ctrl-W</strong></span> (<span class=
|
|
"keycap"><strong>Ctrl</strong></span> + <span class=
|
|
"keycap"><strong>W</strong></span>) in the window.
|
|
Closing the last tab, or using the window manager button
|
|
in the top of the frame will also close the window.</p>
|
|
<p>You can display successive or previous documents from
|
|
the result list inside a preview tab by typing
|
|
<span class=
|
|
"keycap"><strong>Shift</strong></span>+<span class=
|
|
"keycap"><strong>Down</strong></span> or <span class=
|
|
"keycap"><strong>Shift</strong></span>+<span class=
|
|
"keycap"><strong>Up</strong></span> (<span class=
|
|
"keycap"><strong>Down</strong></span> and <span class=
|
|
"keycap"><strong>Up</strong></span> are the arrow
|
|
keys).</p>
|
|
<p>A right-click menu in the text area allows switching
|
|
between displaying the main text or the contents of
|
|
fields associated to the document (ie: author, abtract,
|
|
etc.). This is especially useful in cases where the term
|
|
match did not occur in the main text but in one of the
|
|
fields. In the case of images, you can switch between
|
|
three displays: the image itself, the image metadata as
|
|
extracted by <span class=
|
|
"command"><strong>exiftool</strong></span> and the
|
|
fields, which is the metadata stored in the index.</p>
|
|
<p>You can print the current preview window contents by
|
|
typing <span class=
|
|
"keycap"><strong>Ctrl-P</strong></span> (<span class=
|
|
"keycap"><strong>Ctrl</strong></span> + <span class=
|
|
"keycap"><strong>P</strong></span>) in the window
|
|
text.</p>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.PREVIEW.SEARCH" id=
|
|
"RCL.SEARCH.GUI.PREVIEW.SEARCH"></a>Searching
|
|
inside the preview</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The preview window has an internal search
|
|
capability, mostly controlled by the panel at the
|
|
bottom of the window, which works in two modes: as a
|
|
classical editor incremental search, where we look for
|
|
the text entered in the entry zone, or as a way to walk
|
|
the matches between the document and the <span class=
|
|
"application">Recoll</span> query that found it.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Incremental text
|
|
search</span></dt>
|
|
<dd>
|
|
<p>The preview tabs have an internal incremental
|
|
search function. You initiate the search either
|
|
by typing a <span class=
|
|
"keycap"><strong>/</strong></span> (slash) or
|
|
<span class=
|
|
"keycap"><strong>CTL-F</strong></span> inside the
|
|
text area or by clicking into the <span class=
|
|
"guilabel">Search for:</span> text field and
|
|
entering the search string. You can then use the
|
|
<span class="guilabel">Next</span> and
|
|
<span class="guilabel">Previous</span> buttons to
|
|
find the next/previous occurrence. You can also
|
|
type <span class=
|
|
"keycap"><strong>F3</strong></span> inside the
|
|
text area to get to the next occurrence.</p>
|
|
<p>If you have a search string entered and you
|
|
use Ctrl-Up/Ctrl-Down to browse the results, the
|
|
search is initiated for each successive document.
|
|
If the string is found, the cursor will be
|
|
positioned at the first occurrence of the search
|
|
string.</p>
|
|
</dd>
|
|
<dt><span class="term">Walking the match
|
|
lists</span></dt>
|
|
<dd>
|
|
<p>If the entry area is empty when you click the
|
|
<span class="guilabel">Next</span> or
|
|
<span class="guilabel">Previous</span> buttons,
|
|
the editor will be scrolled to show the next
|
|
match to any search term (the next highlighted
|
|
zone). If you select a search group from the
|
|
dropdown list and click <span class=
|
|
"guilabel">Next</span> or <span class=
|
|
"guilabel">Previous</span>, the match list for
|
|
this group will be walked. This is not the same
|
|
as a text search, because the occurrences will
|
|
include non-exact matches (as caused by stemming
|
|
or wildcards). The search will revert to the text
|
|
mode as soon as you edit the entry area.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.FRAGBUTS"
|
|
id="RCL.SEARCH.GUI.FRAGBUTS"></a>3.2.7. The
|
|
Query Fragments window</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Selecting the <span class="guimenu">Tools</span> →
|
|
<span class="guimenuitem">Query Fragments</span> menu
|
|
entry will open a window with radio- and check-buttons
|
|
which can be used to activate query language fragments
|
|
for filtering the current query. This can be useful if
|
|
you have frequent reusable selectors, for example,
|
|
filtering on alternate directories, or searching just one
|
|
category of files, not covered by the standard category
|
|
selectors.</p>
|
|
<p>The contents of the window are entirely customizable,
|
|
and defined by the contents of the <code class=
|
|
"filename">fragbuts.xml</code> file inside the
|
|
configuration directory. The sample file distributed with
|
|
<span class="application">Recoll</span> (which you should
|
|
be able to find under <code class=
|
|
"filename">/usr/share/recoll/examples/fragbuts.xml</code>),
|
|
contains an example which filters the results from the
|
|
Web history.</p>
|
|
<p>Here follows an example:</p>
|
|
<pre class="programlisting">
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<fragbuts version="1.0">
|
|
|
|
<radiobuttons>
|
|
<!-- Actually useful: toggle Web queue results inclusion -->
|
|
<fragbut>
|
|
<label>Include Web Results</label>
|
|
<frag></frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Exclude Web Results</label>
|
|
<frag>-rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Only Web Results</label>
|
|
<frag>rclbes:BGL</frag>
|
|
</fragbut>
|
|
|
|
</radiobuttons>
|
|
|
|
<buttons>
|
|
|
|
<fragbut>
|
|
<label>Example: Year 2010</label>
|
|
<frag>date:2010-01-01/2010-12-31</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Example: c++ files</label>
|
|
<frag>ext:cpp OR ext:cxx</frag>
|
|
</fragbut>
|
|
|
|
<fragbut>
|
|
<label>Example: My Great Directory</label>
|
|
<frag>dir:/my/great/directory</frag>
|
|
</fragbut>
|
|
|
|
</buttons>
|
|
|
|
</fragbuts>
|
|
</pre>
|
|
<p>Each <code class="literal">radiobuttons</code> or
|
|
<code class="literal">buttons</code> section defines a
|
|
line of checkbuttons or radiobuttons inside the window.
|
|
Any number of buttons can be selected, but the
|
|
radiobuttons in a line are exclusive.</p>
|
|
<p>Each <code class="literal">fragbut</code> section
|
|
defines the label for a button, and the Query Language
|
|
fragment which will be added (as an AND filter) before
|
|
performing the query if the button is active.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.COMPLEX"
|
|
id=
|
|
"RCL.SEARCH.GUI.COMPLEX"></a>3.2.8. Complex/advanced
|
|
search</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The advanced search dialog helps you build more
|
|
complex queries without memorizing the search language
|
|
constructs. It can be opened through the <span class=
|
|
"guilabel">Tools</span> menu or through the main
|
|
toolbar.</p>
|
|
<p><span class="application">Recoll</span> keeps a
|
|
history of searches. See <a class="link" href=
|
|
"#RCL.SEARCH.GUI.COMPLEX.HISTORY" title=
|
|
"Advanced search history">Advanced search
|
|
history</a>.</p>
|
|
<p>The dialog has two tabs:</p>
|
|
<div class="orderedlist">
|
|
<ol class="orderedlist" type="1">
|
|
<li class="listitem">
|
|
<p>The first tab lets you specify terms to search
|
|
for, and permits specifying multiple clauses which
|
|
are combined to build the search.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The second tab lets filter the results according
|
|
to file size, date of modification, MIME type, or
|
|
location.</p>
|
|
</li>
|
|
</ol>
|
|
</div>
|
|
<p>Click on the <span class="guilabel">Start
|
|
Search</span> button in the advanced search dialog, or
|
|
type <span class="keycap"><strong>Enter</strong></span>
|
|
in any text field to start the search. The button in the
|
|
main window always performs a simple search.</p>
|
|
<p>Click on the <code class="literal">Show query
|
|
details</code> link at the top of the result page to see
|
|
the query expansion.</p>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.COMPLEX.TERMS" id=
|
|
"RCL.SEARCH.GUI.COMPLEX.TERMS"></a>Advanced
|
|
search: the "find" tab</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This part of the dialog lets you constructc a query
|
|
by combining multiple clauses of different types. Each
|
|
entry field is configurable for the following
|
|
modes:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>All terms.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Any term.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>None of the terms.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Phrase (exact terms in order within an
|
|
adjustable window).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Proximity (terms in any order within an
|
|
adjustable window).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Filename search.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Additional entry fields can be created by clicking
|
|
the <span class="guilabel">Add clause</span>
|
|
button.</p>
|
|
<p>When searching, the non-empty clauses will be
|
|
combined either with an AND or an OR conjunction,
|
|
depending on the choice made on the left (<span class=
|
|
"guilabel">All clauses</span> or <span class=
|
|
"guilabel">Any clause</span>).</p>
|
|
<p>Entries of all types except "Phrase" and "Near"
|
|
accept a mix of single words and phrases enclosed in
|
|
double quotes. Stemming and wildcard expansion will be
|
|
performed as for simple search.</p>
|
|
<p><b>Phrases and Proximity searches. </b>These
|
|
two clauses work in similar ways, with the difference
|
|
that proximity searches do not impose an order on the
|
|
words. In both cases, an adjustable number (slack) of
|
|
non-matched words may be accepted between the searched
|
|
ones (use the counter on the left to adjust this
|
|
count). For phrases, the default count is zero (exact
|
|
match). For proximity it is ten (meaning that two
|
|
search terms, would be matched if found within a window
|
|
of twelve words). Examples: a phrase search for
|
|
<code class="literal">quick fox</code> with a slack of
|
|
0 will match <code class="literal">quick fox</code> but
|
|
not <code class="literal">quick brown fox</code>. With
|
|
a slack of 1 it will match the latter, but not
|
|
<code class="literal">fox quick</code>. A proximity
|
|
search for <code class="literal">quick fox</code> with
|
|
the default slack will match the latter, and also
|
|
<code class="literal">a fox is a cunning and quick
|
|
animal</code>.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.COMPLEX.FILTER" id=
|
|
"RCL.SEARCH.GUI.COMPLEX.FILTER"></a>Advanced
|
|
search: the "filter" tab</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This part of the dialog has several sections which
|
|
allow filtering the results of a search according to a
|
|
number of criteria</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The first section allows filtering by dates of
|
|
last modification. You can specify both a minimum
|
|
and a maximum date. The initial values are set
|
|
according to the oldest and newest documents
|
|
found in the index.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The next section allows filtering the results
|
|
by file size. There are two entries for minimum
|
|
and maximum size. Enter decimal numbers. You can
|
|
use suffix multipliers: <code class=
|
|
"literal">k/K</code>, <code class=
|
|
"literal">m/M</code>, <code class=
|
|
"literal">g/G</code>, <code class=
|
|
"literal">t/T</code> for 1E3, 1E6, 1E9, 1E12
|
|
respectively.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The next section allows filtering the results
|
|
by their MIME types, or MIME categories (ie:
|
|
media/text/message/etc.).</p>
|
|
<p>You can transfer the types between two boxes,
|
|
to define which will be included or excluded by
|
|
the search.</p>
|
|
<p>The state of the file type selection can be
|
|
saved as the default (the file type filter will
|
|
not be activated at program start-up, but the
|
|
lists will be in the restored state).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The bottom section allows restricting the
|
|
search results to a sub-tree of the indexed area.
|
|
You can use the <span class=
|
|
"guilabel">Invert</span> checkbox to search for
|
|
files not in the sub-tree instead. If you use
|
|
directory filtering often and on big subsets of
|
|
the file system, you may think of setting up
|
|
multiple indexes instead, as the performance may
|
|
be better.</p>
|
|
<p>You can use relative/partial paths for
|
|
filtering. Ie, entering <code class=
|
|
"literal">dirA/dirB</code> would match either
|
|
<code class=
|
|
"filename">/dir1/dirA/dirB/myfile1</code> or
|
|
<code class=
|
|
"filename">/dir2/dirA/dirB/someother/myfile2</code>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.COMPLEX.HISTORY" id=
|
|
"RCL.SEARCH.GUI.COMPLEX.HISTORY"></a>Advanced
|
|
search history</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The advanced search tool memorizes the last 100
|
|
searches performed. You can walk the saved searches by
|
|
using the up and down arrow keys while the keyboard
|
|
focus belongs to the advanced search dialog.</p>
|
|
<p>The complex search history can be erased, along with
|
|
the one for simple search, by selecting the
|
|
<span class="guimenu">File</span> → <span class=
|
|
"guimenuitem">Erase Search History</span> menu
|
|
entry.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TERMEXPLORER" id=
|
|
"RCL.SEARCH.GUI.TERMEXPLORER"></a>3.2.9. The
|
|
term explorer tool</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> automatically
|
|
manages the expansion of search terms to their
|
|
derivatives (ie: plural/singular, verb inflections). But
|
|
there are other cases where the exact search term is not
|
|
known. For example, you may not remember the exact
|
|
spelling, or only know the beginning of the name.</p>
|
|
<p>The search will only propose replacement terms with
|
|
spelling variations when no matching document were found.
|
|
In some cases, both proper spellings and mispellings are
|
|
present in the index, and it may be interesting to look
|
|
for them explicitly.</p>
|
|
<p>The term explorer tool (started from the toolbar icon
|
|
or from the <span class="guilabel">Term explorer</span>
|
|
entry of the <span class="guilabel">Tools</span> menu)
|
|
can be used to search the full index terms list. It has
|
|
three modes of operations:</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Wildcard</span></dt>
|
|
<dd>
|
|
<p>In this mode of operation, you can enter a
|
|
search string with shell-like wildcards (*, ?, []).
|
|
ie: <em class="replaceable"><code>xapi*</code></em>
|
|
would display all index terms beginning with
|
|
<em class="replaceable"><code>xapi</code></em>.
|
|
(More about wildcards <a class="link" href=
|
|
"#RCL.SEARCH.WILDCARDS" title=
|
|
"3.6.1. More about wildcards">here</a> ).</p>
|
|
</dd>
|
|
<dt><span class="term">Regular expression</span></dt>
|
|
<dd>
|
|
<p>This mode will accept a regular expression as
|
|
input. Example: <em class=
|
|
"replaceable"><code>word[0-9]+</code></em>. The
|
|
expression is implicitly anchored at the beginning.
|
|
Ie: <em class="replaceable"><code>press</code></em>
|
|
will match <em class=
|
|
"replaceable"><code>pression</code></em> but not
|
|
<em class=
|
|
"replaceable"><code>expression</code></em>. You can
|
|
use <em class=
|
|
"replaceable"><code>.*press</code></em> to match
|
|
the latter, but be aware that this will cause a
|
|
full index term list scan, which can be quite
|
|
long.</p>
|
|
</dd>
|
|
<dt><span class="term">Stem expansion</span></dt>
|
|
<dd>
|
|
<p>This mode will perform the usual stem expansion
|
|
normally done as part user input processing. As
|
|
such it is probably mostly useful to demonstrate
|
|
the process.</p>
|
|
</dd>
|
|
<dt><span class="term">Spelling/Phonetic</span></dt>
|
|
<dd>
|
|
<p>In this mode, you enter the term as you think it
|
|
is spelled, and <span class=
|
|
"application">Recoll</span> will do its best to
|
|
find index terms that sound like your entry. This
|
|
mode uses the <span class=
|
|
"application">Aspell</span> spelling application,
|
|
which must be installed on your system for things
|
|
to work (if your documents contain non-ascii
|
|
characters, <span class="application">Recoll</span>
|
|
needs an aspell version newer than 0.60 for UTF-8
|
|
support). The language which is used to build the
|
|
dictionary out of the index terms (which is done at
|
|
the end of an indexing pass) is the one defined by
|
|
your NLS environment. Weird things will probably
|
|
happen if languages are mixed up.</p>
|
|
</dd>
|
|
<dt><span class="term">Show index
|
|
statistics</span></dt>
|
|
<dd>
|
|
<p>This will print a long list of boring numbers
|
|
about the index</p>
|
|
</dd>
|
|
<dt><span class="term">List files which could not be
|
|
indexed</span></dt>
|
|
<dd>
|
|
<p>This will show the files which caused errors,
|
|
usually because <span class=
|
|
"command"><strong>recollindex</strong></span> could
|
|
not translate their format into text.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<p>Note that in cases where <span class=
|
|
"application">Recoll</span> does not know the beginning
|
|
of the string to search for (ie a wildcard expression
|
|
like <em class="replaceable"><code>*coll</code></em>),
|
|
the expansion can take quite a long time because the full
|
|
index term list will have to be processed. The expansion
|
|
is currently limited at 10000 results for wildcards and
|
|
regular expressions. It is possible to change the limit
|
|
in the configuration file.</p>
|
|
<p>Double-clicking on a term in the result list will
|
|
insert it into the simple search entry field. You can
|
|
also cut/paste between the result list and any entry
|
|
field (the end of lines will be taken care of).</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.MULTIDB"
|
|
id=
|
|
"RCL.SEARCH.GUI.MULTIDB"></a>3.2.10. Multiple
|
|
indexes</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>See the section describing <a class="link" href=
|
|
"#RCL.INDEXING.CONFIG.MULTIPLE" title=
|
|
"2.3.1. Multiple indexes">the use of multiple
|
|
indexes</a> for generalities. Only the aspects concerning
|
|
the <span class="command"><strong>recoll</strong></span>
|
|
GUI are described here.</p>
|
|
<p>A <span class="command"><strong>recoll</strong></span>
|
|
program instance is always associated with a specific
|
|
index, which is the one to be updated when requested from
|
|
the <span class="guimenu">File</span> menu, but it can
|
|
use any number of <span class="application">Recoll</span>
|
|
indexes for searching. The external indexes can be
|
|
selected through the <span class="guilabel">external
|
|
indexes</span> tab in the preferences dialog.</p>
|
|
<p>Index selection is performed in two phases. A set of
|
|
all usable indexes must first be defined, and then the
|
|
subset of indexes to be used for searching. These
|
|
parameters are retained across program executions (there
|
|
are kept separately for each <span class=
|
|
"application">Recoll</span> configuration). The set of
|
|
all indexes is usually quite stable, while the active
|
|
ones might typically be adjusted quite frequently.</p>
|
|
<p>The main index (defined by <code class=
|
|
"envar">RECOLL_CONFDIR</code>) is always active. If this
|
|
is undesirable, you can set up your base configuration to
|
|
index an empty directory.</p>
|
|
<p>When adding a new index to the set, you can select
|
|
either a <span class="application">Recoll</span>
|
|
configuration directory, or directly a <span class=
|
|
"application">Xapian</span> index directory. In the first
|
|
case, the <span class="application">Xapian</span> index
|
|
directory will be obtained from the selected
|
|
configuration.</p>
|
|
<p>As building the set of all indexes can be a little
|
|
tedious when done through the user interface, you can use
|
|
the <code class="envar">RECOLL_EXTRA_DBS</code>
|
|
environment variable to provide an initial set. This
|
|
might typically be set up by a system administrator so
|
|
that every user does not have to do it. The variable
|
|
should define a colon-separated list of index
|
|
directories, ie:</p>
|
|
<pre class=
|
|
"screen">export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db</pre>
|
|
<p>Another environment variable, <code class=
|
|
"envar">RECOLL_ACTIVE_EXTRA_DBS</code> allows adding to
|
|
the active list of indexes. This variable was suggested
|
|
and implemented by a <span class=
|
|
"application">Recoll</span> user. It is mostly useful if
|
|
you use scripts to mount external volumes with
|
|
<span class="application">Recoll</span> indexes. By using
|
|
<code class="envar">RECOLL_EXTRA_DBS</code> and
|
|
<code class="envar">RECOLL_ACTIVE_EXTRA_DBS</code>, you
|
|
can add and activate the index for the mounted volume
|
|
when starting <span class=
|
|
"command"><strong>recoll</strong></span>. Unreachable
|
|
indexes will automatically be deactivated when starting
|
|
up.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.HISTORY"
|
|
id=
|
|
"RCL.SEARCH.GUI.HISTORY"></a>3.2.11. Document
|
|
history</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Documents that you actually view (with the internal
|
|
preview or an external tool) are entered into the
|
|
document history, which is remembered.</p>
|
|
<p>You can display the history list by using the
|
|
<span class="guilabel">Tools/</span><span class=
|
|
"guilabel">Doc History</span> menu entry.</p>
|
|
<p>You can erase the document history by using the
|
|
<span class="guilabel">Erase document history</span>
|
|
entry in the <span class="guimenu">File</span> menu.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.SORT" id=
|
|
"RCL.SEARCH.GUI.SORT"></a>3.2.12. Sorting
|
|
search results and collapsing duplicates</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The documents in a result list are normally sorted in
|
|
order of relevance. It is possible to specify a different
|
|
sort order, either by using the vertical arrows in the
|
|
GUI toolbox to sort by date, or switching to the result
|
|
table display and clicking on any header. The sort order
|
|
chosen inside the result table remains active if you
|
|
switch back to the result list, until you click one of
|
|
the vertical arrows, until both are unchecked (you are
|
|
back to sort by relevance).</p>
|
|
<p>Sort parameters are remembered between program
|
|
invocations, but result sorting is normally always
|
|
inactive when the program starts. It is possible to keep
|
|
the sorting activation state between program invocations
|
|
by checking the <span class="guilabel">Remember sort
|
|
activation state</span> option in the preferences.</p>
|
|
<p>It is also possible to hide duplicate entries inside
|
|
the result list (documents with the exact same contents
|
|
as the displayed one). The test of identity is based on
|
|
an MD5 hash of the document container, not only of the
|
|
text contents (so that ie, a text document with an image
|
|
added will not be a duplicate of the text only).
|
|
Duplicates hiding is controlled by an entry in the
|
|
<span class="guilabel">GUI configuration</span> dialog,
|
|
and is off by default.</p>
|
|
<p>When a result document does have undisplayed
|
|
duplicates, a <code class="literal">Dups</code> link will
|
|
be shown with the result list entry. Clicking the link
|
|
will display the paths (URLs + ipaths) for the duplicate
|
|
entries.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.GUI.SHORTCUTS" id=
|
|
"RCL.SEARCH.GUI.SHORTCUTS"></a>3.2.13. Keyboard
|
|
shortcuts</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>A number of common actions within the graphical
|
|
interface can be triggered through keyboard shortcuts. As
|
|
of <span class="application">Recoll</span> 1.29, many of
|
|
the shortcut values can be customised from a screen in
|
|
the GUI preferences. Most shortcuts are specific to a
|
|
given context (e.g. within a preview window, within the
|
|
result table).</p>
|
|
<div class="table">
|
|
<a name="idm1465" id="idm1465"></a>
|
|
<p class="title"><b>Table 3.1. Keyboard
|
|
shortcuts</b></p>
|
|
<div class="table-contents">
|
|
<table class="table" summary="Keyboard shortcuts"
|
|
border="1">
|
|
<colgroup>
|
|
<col align="left" class="c1">
|
|
<col align="left" class="c2">
|
|
</colgroup>
|
|
<thead>
|
|
<tr>
|
|
<th align="left">Description</th>
|
|
<th align="left">Default value</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: almost
|
|
everywhere</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Program exit</td>
|
|
<td align="left">Ctrl+Q</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: advanced
|
|
search</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Load the next entry from the
|
|
search history</td>
|
|
<td align="left">Up</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Load the previous entry from
|
|
the search history</td>
|
|
<td align="left">Down</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: main
|
|
window</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Clear search. This will move
|
|
the keyboard cursor to the simple search entry
|
|
and erase the current text</td>
|
|
<td align="left">Ctrl+S</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Move the keyboard cursor to
|
|
the search entry area without erasing the
|
|
current text</td>
|
|
<td align="left">Ctrl+L</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Move the keyboard cursor to
|
|
the search entry area without erasing the
|
|
current text</td>
|
|
<td align="left">Ctrl+Shift+S</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Toggle displaying the current
|
|
results as a table or as a list</td>
|
|
<td align="left">Ctrl+T</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: main window, when
|
|
showing the results as a
|
|
table</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Move the keyboard cursor to
|
|
currently the selected row in the table, or to
|
|
the first one if none is selected</td>
|
|
<td align="left">Ctrl+R</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Jump to row 0-9 or a-z in the
|
|
table</td>
|
|
<td align="left">Ctrl+[0-9] or
|
|
Ctrl+Shift+[a-z]</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Cancel the current
|
|
selection</td>
|
|
<td align="left">Esc</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: preview
|
|
window</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Close the preview window</td>
|
|
<td align="left">Esc</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Close the current tab</td>
|
|
<td align="left">Ctrl+W</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Open a print dialog for the
|
|
current tab contents</td>
|
|
<td align="left">Ctrl+P</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Load the next result from the
|
|
list to the current tab</td>
|
|
<td align="left">Shift+Down</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Load the previous result from
|
|
the list to the current tab</td>
|
|
<td align="left">Shift+Up</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: result
|
|
table</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Copy the text contained in the
|
|
selected document to the clipboard</td>
|
|
<td align="left">Ctrl+G</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Open the current document and
|
|
exit Recoll</td>
|
|
<td align="left">Ctrl+Shift+O</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Open the current document</td>
|
|
<td align="left">Ctrl+O</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Show a full preview for the
|
|
current document</td>
|
|
<td align="left">Ctrl+D</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Toggle showing the column
|
|
names</td>
|
|
<td align="left">Ctrl+H</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Show a snippets (keyword in
|
|
context) list for the current document</td>
|
|
<td align="left">Ctrl+E</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Toggle showing the row
|
|
letters/numbers</td>
|
|
<td align="left">Ctrl+V</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan="2" align="left"><span class=
|
|
"command"><strong>Context: snippets
|
|
window</strong></span></td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Close the snippets window</td>
|
|
<td align="left">Esc</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Find in the snippets list
|
|
(method #1)</td>
|
|
<td align="left">Ctrl+F</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Find in the snippets list
|
|
(method #2)</td>
|
|
<td align="left">/</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Find the next instance of the
|
|
search term</td>
|
|
<td align="left">F3</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="left">Find the previous instance of
|
|
the search term</td>
|
|
<td align="left">Shift+F3</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div><br class="table-break">
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.TIPS" id=
|
|
"RCL.SEARCH.GUI.TIPS"></a>3.2.14. Search
|
|
tips</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TIPS.TERMS" id=
|
|
"RCL.SEARCH.GUI.TIPS.TERMS"></a>Terms and search
|
|
expansion</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><b>Term completion. </b>While typing into the
|
|
simple search entry, a popup menu will appear and show
|
|
completions for the current string. Values preceded by
|
|
a clock icon come from the history, those preceded by a
|
|
magnifier icon come from the index terms. This can be
|
|
disabled in the preferences.</p>
|
|
<p><b>Picking up new terms from result or preview
|
|
text. </b>Double-clicking on a word in the result
|
|
list or in a preview window will copy it to the simple
|
|
search entry field.</p>
|
|
<p><b>Wildcards. </b>Wildcards can be used inside
|
|
search terms in all forms of searches. <a class="link"
|
|
href="#RCL.SEARCH.WILDCARDS" title=
|
|
"3.6.1. More about wildcards">More about
|
|
wildcards</a>.</p>
|
|
<p><b>Automatic suffixes. </b>Words like
|
|
<code class="literal">odt</code> or <code class=
|
|
"literal">ods</code> can be automatically turned into
|
|
query language <code class="literal">ext:xxx</code>
|
|
clauses. This can be enabled in the <span class=
|
|
"guilabel">Search preferences</span> panel in the
|
|
GUI.</p>
|
|
<p><b>Disabling stem expansion. </b>Entering a
|
|
capitalized word in any search field will prevent stem
|
|
expansion (no search for <code class=
|
|
"literal">gardening</code> if you enter <code class=
|
|
"literal">Garden</code> instead of <code class=
|
|
"literal">garden</code>). This is the only case where
|
|
character case should make a difference for a
|
|
<span class="application">Recoll</span> search. You can
|
|
also disable stem expansion or change the stemming
|
|
language in the preferences.</p>
|
|
<p><b>Finding related documents. </b>Selecting the
|
|
<span class="guilabel">Find similar documents</span>
|
|
entry in the result list paragraph right-click menu
|
|
will select a set of "interesting" terms from the
|
|
current result, and insert them into the simple search
|
|
entry field. You can then possibly edit the list and
|
|
start a search to find documents which may be
|
|
apparented to the current result.</p>
|
|
<p><b>File names. </b>File names are added as
|
|
terms during indexing, and you can specify them as
|
|
ordinary terms in normal search fields (<span class=
|
|
"application">Recoll</span> used to index all
|
|
directories in the file path as terms. This has been
|
|
abandoned as it did not seem really useful).
|
|
Alternatively, you can use the specific file name
|
|
search which will <span class=
|
|
"emphasis"><em>only</em></span> look for file names,
|
|
and may be faster than the generic search especially
|
|
when using wildcards.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TIPS.PHRASES" id=
|
|
"RCL.SEARCH.GUI.TIPS.PHRASES"></a>Working with
|
|
phrases and proximity</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><b>Phrases searches. </b>A phrase can be looked
|
|
for by enclosing a number of terms in double quotes.
|
|
Example: <code class="literal">"user manual"</code>
|
|
will look only for occurrences of <code class=
|
|
"literal">user</code> immediately followed by
|
|
<code class="literal">manual</code>. You can use the
|
|
<span class="guilabel">"Phrase"</span> field of the
|
|
advanced search dialog to the same effect. Phrases can
|
|
be entered along simple terms in all simple or advanced
|
|
search entry fields, except <span class=
|
|
"guilabel">"Phrase"</span>.</p>
|
|
<p><b>Proximity searches. </b>A proximity search
|
|
differs from a phrase search in that it does not impose
|
|
an order on the terms. Proximity searches can be
|
|
entered by specifying the <span class=
|
|
"guilabel">"Proximity"</span> type in the advanced
|
|
search, or by postfixing a phrase search with a 'p'.
|
|
Example: "user manual"p would also match "manual user".
|
|
Also see <a class="link" href=
|
|
"#RCL.SEARCH.LANG.MODIFIERS" title=
|
|
"3.5.2. Modifiers">the modifier section</a> from
|
|
the query language documentation.</p>
|
|
<p><b>AutoPhrases. </b>This option can be set in
|
|
the preferences dialog. If it is set, a phrase will be
|
|
automatically built and added to simple searches when
|
|
looking for <code class="literal">Any terms</code>.
|
|
This will not change radically the results, but will
|
|
give a relevance boost to the results where the search
|
|
terms appear as a phrase. Ie: searching for
|
|
<code class="literal">virtual reality</code> will still
|
|
find all documents where either <code class=
|
|
"literal">virtual</code> or <code class=
|
|
"literal">reality</code> or both appear, but those
|
|
which contain <code class="literal">virtual
|
|
reality</code> should appear sooner in the list.</p>
|
|
<p>Phrase searches can slow down a query if most of the
|
|
terms in the phrase are common. If the <code class=
|
|
"varname">autophrase</code> option is on, very common
|
|
terms will be removed from the automatically
|
|
constructed phrase. The removal threshold can be
|
|
adjusted from the search preferences.</p>
|
|
<p><b>Phrases and abbreviations. </b>Dotted
|
|
abbreviations like <code class="literal">I.B.M.</code>
|
|
are also automatically indexed as a word without the
|
|
dots: <code class="literal">IBM</code>. Searching for
|
|
the word inside a phrase (ie: <code class=
|
|
"literal">"the IBM company"</code>) will only match the
|
|
dotted abrreviation if you increase the phrase slack
|
|
(using the advanced search panel control, or the
|
|
<code class="literal">o</code> query language
|
|
modifier). Literal occurrences of the word will be
|
|
matched normally.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.TIPS.MISC" id=
|
|
"RCL.SEARCH.GUI.TIPS.MISC"></a>Others</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><b>Using fields. </b>You can use the <a class=
|
|
"link" href="#RCL.SEARCH.LANG" title=
|
|
"3.5. The query language">query language</a> and
|
|
field specifications to only search certain parts of
|
|
documents. This can be especially helpful with email,
|
|
for example only searching emails from a specific
|
|
originator: <code class="literal">search tips
|
|
from:helpfulgui</code></p>
|
|
<p><b>Adjusting the result table columns. </b>When
|
|
displaying results in table mode, you can use a right
|
|
click on the table headers to activate a pop-up menu
|
|
which will let you adjust what columns are displayed.
|
|
You can drag the column headers to adjust their order.
|
|
You can click them to sort by the field displayed in
|
|
the column. You can also save the result list in CSV
|
|
format.</p>
|
|
<p><b>Changing the GUI geometry. </b>It is
|
|
possible to configure the GUI in wide form factor by
|
|
dragging the toolbars to one of the sides (their
|
|
location is remembered between sessions), and moving
|
|
the category filters to a menu (can be set in the
|
|
<span class="guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">GUI configuration</span> → <span class=
|
|
"guimenuitem">User interface</span> panel).</p>
|
|
<p><b>Query explanation. </b>You can get an exact
|
|
description of what the query looked for, including
|
|
stem expansion, and Boolean operators used, by clicking
|
|
on the result list header.</p>
|
|
<p><b>Advanced search history. </b>You can display
|
|
any of the last 100 complex searches performed by using
|
|
the up and down arrow keys while the advanced search
|
|
panel is active.</p>
|
|
<p><b>Forced opening of a preview window. </b>You
|
|
can use <span class=
|
|
"keycap"><strong>Shift</strong></span>+Click on a
|
|
result list <code class="literal">Preview</code> link
|
|
to force the creation of a preview window instead of a
|
|
new tab in the existing one.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.SAVING" id=
|
|
"RCL.SEARCH.SAVING"></a>3.2.15. Saving and
|
|
restoring queries (1.21 and later)</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Both simple and advanced query dialogs save recent
|
|
history, but the amount is limited: old queries will
|
|
eventually be forgotten. Also, important queries may be
|
|
difficult to find among others. This is why both types of
|
|
queries can also be explicitly saved to files, from the
|
|
GUI menus: <span class="guimenu">File</span> →
|
|
<span class="guimenuitem">Save last query / Load last
|
|
query</span></p>
|
|
<p>The default location for saved queries is a
|
|
subdirectory of the current configuration directory, but
|
|
saved queries are ordinary files and can be written or
|
|
moved anywhere.</p>
|
|
<p>Some of the saved query parameters are part of the
|
|
preferences (e.g. <code class="literal">autophrase</code>
|
|
or the active external indexes), and may differ when the
|
|
query is loaded from the time it was saved. In this case,
|
|
<span class="application">Recoll</span> will warn of the
|
|
differences, but will not change the user
|
|
preferences.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.GUI.CUSTOM"
|
|
id=
|
|
"RCL.SEARCH.GUI.CUSTOM"></a>3.2.16. Customizing
|
|
the search interface</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>You can customize some aspects of the search interface
|
|
by using the <span class="guimenu">GUI
|
|
configuration</span> entry in the <span class=
|
|
"guimenu">Preferences</span> menu.</p>
|
|
<p>There are several tabs in the dialog, dealing with the
|
|
interface itself, the parameters used for searching and
|
|
returning results, and what indexes are searched.</p>
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.UI" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.UI"></a><b>User interface
|
|
parameters: </b></p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Highlight color for query
|
|
terms</span>: Terms from the user query are
|
|
highlighted in the result list samples and the
|
|
preview window. The color can be chosen here. Any
|
|
Qt color string should work (ie <code class=
|
|
"literal">red</code>, <code class=
|
|
"literal">#ff0000</code>). The default is
|
|
<code class="literal">blue</code>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Style sheet</span>: The
|
|
name of a <span class="application">Qt</span> style
|
|
sheet text file which is applied to the whole
|
|
Recoll application on startup. The default value is
|
|
empty, but there is a skeleton style sheet
|
|
(<code class="filename">recoll.qss</code>) inside
|
|
the <code class=
|
|
"filename">/usr/share/recoll/examples</code>
|
|
directory. Using a style sheet, you can change most
|
|
<span class=
|
|
"command"><strong>recoll</strong></span> graphical
|
|
parameters: colors, fonts, etc. See the sample file
|
|
for a few simple examples.</p>
|
|
<p>You should be aware that parameters (e.g.: the
|
|
background color) set inside the <span class=
|
|
"application">Recoll</span> GUI style sheet will
|
|
override global system preferences, with possible
|
|
strange side effects: for example if you set the
|
|
foreground to a light color and the background to a
|
|
dark one in the desktop preferences, but only the
|
|
background is set inside the <span class=
|
|
"application">Recoll</span> style sheet, and it is
|
|
light too, then text will appear light-on-light
|
|
inside the <span class="application">Recoll</span>
|
|
GUI.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Maximum text size
|
|
highlighted for preview</span> Inserting highlights
|
|
on search term inside the text before inserting it
|
|
in the preview window involves quite a lot of
|
|
processing, and can be disabled over the given text
|
|
size to speed up loading.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Prefer HTML to plain text
|
|
for preview</span> if set, Recoll will display HTML
|
|
as such inside the preview window. If this causes
|
|
problems with the Qt HTML display, you can uncheck
|
|
it to display the plain text version instead.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Activate links in
|
|
preview</span> if set, Recoll will turn HTTP links
|
|
found inside plain text into proper HTML anchors,
|
|
and clicking a link inside a preview window will
|
|
start the default browser on the link target.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Plain text to HTML line
|
|
style</span>: when displaying plain text inside the
|
|
preview window, <span class=
|
|
"application">Recoll</span> tries to preserve some
|
|
of the original text line breaks and indentation.
|
|
It can either use PRE HTML tags, which will well
|
|
preserve the indentation but will force horizontal
|
|
scrolling for long lines, or use BR tags to break
|
|
at the original line breaks, which will let the
|
|
editor introduce other line breaks according to the
|
|
window width, but will lose some of the original
|
|
indentation. The third option has been available in
|
|
recent releases and is probably now the best one:
|
|
use PRE tags with line wrapping.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Choose editor
|
|
application</span>: this opens a dialog which
|
|
allows you to select the application to be used to
|
|
open each MIME type. The default is to use the
|
|
<span class=
|
|
"command"><strong>xdg-open</strong></span> utility,
|
|
but you can use this dialog to override it, setting
|
|
exceptions for MIME types that will still be opened
|
|
according to <span class=
|
|
"application">Recoll</span> preferences. This is
|
|
useful for passing parameters like page numbers or
|
|
search strings to applications that support them
|
|
(e.g. <span class="application">evince</span>).
|
|
This cannot be done with <span class=
|
|
"command"><strong>xdg-open</strong></span> which
|
|
only supports passing one parameter.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Disable Qt autocompletion
|
|
in search entry</span>: this will disable the
|
|
completion popup. Il will only appear, and display
|
|
the full history, either if you enter only white
|
|
space in the search area, or if you click the clock
|
|
button on the right of the area.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Document filter choice
|
|
style</span>: this will let you choose if the
|
|
document categories are displayed as a list or a
|
|
set of buttons, or a menu.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Start with simple search
|
|
mode</span>: this lets you choose the value of the
|
|
simple search type on program startup. Either a
|
|
fixed value (e.g. <code class="literal">Query
|
|
Language</code>, or the value in use when the
|
|
program last exited.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Start with advanced
|
|
search dialog open</span> : If you use this dialog
|
|
frequently, checking the entries will get it to
|
|
open when recoll starts.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Remember sort activation
|
|
state</span> if set, Recoll will remember the sort
|
|
tool stat between invocations. It normally starts
|
|
with sorting disabled.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.RL" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RL"></a><b>Result list
|
|
parameters: </b></p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Number of results in a
|
|
result page</span></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Result list font</span>:
|
|
There is quite a lot of information shown in the
|
|
result list, and you may want to customize the font
|
|
and/or font size. The rest of the fonts used by
|
|
<span class="application">Recoll</span> are
|
|
determined by your generic Qt config (try the
|
|
<span class=
|
|
"command"><strong>qtconfig</strong></span>
|
|
command).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.RESULTPARA" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESULTPARA"></a><span class=
|
|
"guilabel">Edit result list paragraph format
|
|
string</span>: allows you to change the
|
|
presentation of each result list entry. See the
|
|
<a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"The result list format">result list customisation
|
|
section</a>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.RESULTHEAD" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESULTHEAD"></a><span class=
|
|
"guilabel">Edit result page HTML header
|
|
insert</span>: allows you to define text inserted
|
|
at the end of the result page HTML header. More
|
|
detail in the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"The result list format">result list customisation
|
|
section</a>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Date format</span>:
|
|
allows specifying the format used for displaying
|
|
dates inside the result list. This should be
|
|
specified as an strftime() string (man
|
|
strftime).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.ABSSEP" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.ABSSEP"></a><span class=
|
|
"guilabel">Abstract snippet separator</span>: for
|
|
synthetic abstracts built from index data, which
|
|
are usually made of several snippets from different
|
|
parts of the document, this defines the snippet
|
|
separator, an ellipsis by default.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.SEARCH" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.SEARCH"></a><b>Search
|
|
parameters: </b></p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Hide duplicate
|
|
results</span>: decides if result list entries are
|
|
shown for identical documents found in different
|
|
places.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Stemming language</span>:
|
|
stemming obviously depends on the document's
|
|
language. This listbox will let you chose among the
|
|
stemming databases which were built during indexing
|
|
(this is set in the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.RECOLLCONF" title=
|
|
"5.4.2. Recoll main configuration file, recoll.conf">
|
|
main configuration file</a>), or later added with
|
|
<span class="command"><strong>recollindex
|
|
-s</strong></span> (See the recollindex manual).
|
|
Stemming languages which are dynamically added will
|
|
be deleted at the next indexing pass unless they
|
|
are also added in the configuration file.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Automatically add phrase
|
|
to simple searches</span>: a phrase will be
|
|
automatically built and added to simple searches
|
|
when looking for <code class="literal">Any
|
|
terms</code>. This will give a relevance boost to
|
|
the results where the search terms appear as a
|
|
phrase (consecutive and in order).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Autophrase term frequency
|
|
threshold percentage</span>: very frequent terms
|
|
should not be included in automatic phrase searches
|
|
for performance reasons. The parameter defines the
|
|
cutoff percentage (percentage of the documents
|
|
where the term appears).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Replace abstracts from
|
|
documents</span>: this decides if we should
|
|
synthesize and display an abstract in place of an
|
|
explicit abstract found within the document
|
|
itself.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Dynamically build
|
|
abstracts</span>: this decides if <span class=
|
|
"application">Recoll</span> tries to build document
|
|
abstracts (lists of <span class=
|
|
"emphasis"><em>snippets</em></span>) when
|
|
displaying the result list. Abstracts are
|
|
constructed by taking context from the document
|
|
information, around the search terms.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Synthetic abstract
|
|
size</span>: adjust to taste...</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Synthetic abstract
|
|
context words</span>: how many words should be
|
|
displayed around each term occurrence.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="guilabel">Query language magic file
|
|
name suffixes</span>: a list of words which
|
|
automatically get turned into <code class=
|
|
"literal">ext:xxx</code> file name suffix clauses
|
|
when starting a query language query (e.g.:
|
|
<code class="literal">doc xls xlsx...</code>). This
|
|
will save some typing for people who use file types
|
|
a lot when querying.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p><a name="RCL.SEARCH.GUI.CUSTOM.EXTRADB" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.EXTRADB"></a><b>External
|
|
indexes: </b>This panel will let you browse for
|
|
additional indexes that you may want to search. External
|
|
indexes are designated by their database directory (ie:
|
|
<code class=
|
|
"filename">/home/someothergui/.recoll/xapiandb</code>,
|
|
<code class=
|
|
"filename">/usr/local/recollglobal/xapiandb</code>).</p>
|
|
<p>Once entered, the indexes will appear in the
|
|
<span class="guilabel">External indexes</span> list, and
|
|
you can chose which ones you want to use at any moment by
|
|
checking or unchecking their entries.</p>
|
|
<p>Your main database (the one the current configuration
|
|
indexes to), is always implicitly active. If this is not
|
|
desirable, you can set up your configuration so that it
|
|
indexes, for example, an empty directory. An alternative
|
|
indexer may also need to implement a way of purging the
|
|
index from stale data,</p>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST"></a>The result
|
|
list format</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Recoll normally uses a full function HTML processor
|
|
to display the result list and the <a class="link"
|
|
href="#RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">snippets
|
|
window</a>. Depending on the version, this may be based
|
|
on either Qt WebKit or Qt WebEngine. It is then
|
|
possible to completely customise the result list with
|
|
full support for CSS and Javascript.</p>
|
|
<p>It is also possible to build <span class=
|
|
"application">Recoll</span> to use a simpler Qt
|
|
QTextBrowser widget to display the HTML, which may be
|
|
necessary if the ones above are not ported on the
|
|
system, or to reduce the application size and
|
|
dependencies. There are limits to what you can do in
|
|
this case, but it is still possible to decide what data
|
|
each result will contain, and how it will be
|
|
displayed.</p>
|
|
<p>The result list presentation can be customized by
|
|
adjusting two elements:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The paragraph format</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>HTML code inside the header section. For
|
|
versions 1.21 and later, this is also used for
|
|
the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.RESULTLIST.MENU.SNIPPETS">snippets
|
|
window</a>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>The paragraph format and the header fragment can be
|
|
edited from the <span class="guilabel">Result
|
|
list</span> tab of the <span class="guilabel">GUI
|
|
configuration</span>.</p>
|
|
<p>The header fragment is used both for the result list
|
|
and the snippets window. The snippets list is a table
|
|
and has a <code class="literal">snippets</code> class
|
|
attribute. Each paragraph in the result list is a
|
|
table, with class <code class="literal">respar</code>,
|
|
but this can be changed by editing the paragraph
|
|
format.</p>
|
|
<p>There are a few examples on the <a class="ulink"
|
|
href="http://www.recoll.org/pages/custom.html" target=
|
|
"_top">page about customising the result list</a> on
|
|
the <span class="application">Recoll</span> web
|
|
site.</p>
|
|
<div class="sect4">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA" id=
|
|
"RCL.SEARCH.GUI.CUSTOM.RESLIST.PARA"></a>The
|
|
paragraph format</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This is an arbitrary HTML string where the
|
|
following printf-like <code class="literal">%</code>
|
|
substitutions will be performed:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><b>%A. </b>Abstract</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%D. </b>Date</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%I. </b>Icon image name. This is
|
|
normally determined from the MIME type. The
|
|
associations are defined inside the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.MIMECONF"
|
|
title=
|
|
"5.4.5. The mimeconf file"><code class=
|
|
"filename">mimeconf</code> configuration
|
|
file</a>. If a thumbnail for the file is found
|
|
at the standard Freedesktop location, this will
|
|
be displayed instead.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%K. </b>Keywords (if any)</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%L. </b>Precooked Preview, Edit, and
|
|
possibly Snippets links</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%M. </b>MIME type</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%N. </b>result Number inside the
|
|
result page</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%P. </b>Parent folder Url. In the
|
|
case of an embedded document, this is the
|
|
parent folder for the top level container
|
|
file.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%R. </b>Relevance percentage</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%S. </b>Size information</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%T. </b>Title or Filename if not
|
|
set.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%t. </b>Title or empty.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%(filename). </b>File name.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%U. </b>Url</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>The format of the Preview, Edit, and Snippets
|
|
links is <code class="literal"><a
|
|
href="P%N"></code>, <code class="literal"><a
|
|
href="E%N"></code> and <code class="literal"><a
|
|
href="A%N"></code> where <em class=
|
|
"replaceable"><code>docnum</code></em> (%N) expands
|
|
to the document number inside the result page).</p>
|
|
<p>A link target defined as <code class=
|
|
"literal">"F%N"</code> will open the document
|
|
corresponding to the <code class="literal">%P</code>
|
|
parent folder expansion, usually creating a file
|
|
manager window on the folder where the container file
|
|
resides. E.g.:</p>
|
|
<pre class=
|
|
"programlisting"><a href="F%N">%P</a></pre>
|
|
<p>A link target defined as <code class=
|
|
"literal">R%N|<em class=
|
|
"replaceable"><code>scriptname</code></em></code>
|
|
will run the corresponding script on the result file
|
|
(if the document is embedded, the script will be
|
|
started on the top-level parent). See the <a class=
|
|
"link" href="#RCL.SEARCH.GUI.RUNSCRIPT" title=
|
|
"3.2.4. Unix-like systems: running arbitrary commands on result files">
|
|
section about defining scripts</a>.</p>
|
|
<p>In addition to the predefined values above, all
|
|
strings like <code class=
|
|
"literal">%(fieldname)</code> will be replaced by the
|
|
value of the field named <code class=
|
|
"literal">fieldname</code> for this document. Only
|
|
stored fields can be accessed in this way, the value
|
|
of indexed but not stored fields is not known at this
|
|
point in the search process (see <a class="link"
|
|
href="#RCL.PROGRAM.FIELDS" title=
|
|
"4.2. Field data processing">field
|
|
configuration</a>). There are currently very few
|
|
fields stored by default, apart from the values above
|
|
(only <code class="literal">author</code> and
|
|
<code class="literal">filename</code>), so this
|
|
feature will need some custom local configuration to
|
|
be useful. An example candidate would be the
|
|
<code class="literal">recipient</code> field which is
|
|
generated by the message input handlers.</p>
|
|
<p>The default value for the paragraph format string
|
|
is:</p>
|
|
<pre class="screen">
|
|
"<table class=\"respar\">\n"
|
|
"<tr>\n"
|
|
"<td><a href='%U'><img src='%I' width='64'></a></td>\n"
|
|
"<td>%L &nbsp;<i>%S</i> &nbsp;&nbsp;<b>%T</b><br>\n"
|
|
"<span style='white-space:nowrap'><i>%M</i>&nbsp;%D</span>&nbsp;&nbsp;&nbsp; <i>%U</i>&nbsp;%i<br>\n"
|
|
"%A %K</td>\n"
|
|
"</tr></table>\n"
|
|
</pre>
|
|
<p>You may, for example, try the following for a more
|
|
web-like experience:</p>
|
|
<pre class="screen">
|
|
<u><b><a href="P%N">%T</a></b></u><br>
|
|
%A<font color=#008000>%U - %S</font> - %L
|
|
</pre>
|
|
<p>Note that the P%N link in the above paragraph
|
|
makes the title a preview link. Or the clean
|
|
looking:</p>
|
|
<pre class="screen">
|
|
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
|
&nbsp;&nbsp;<b>%T&</b><br>%S&nbsp;
|
|
<font color="#808080"><i>%U</i></font>
|
|
<table bgcolor="#e0e0e0">
|
|
<tr><td><div>%A</div></td></tr>
|
|
</table>%K
|
|
</pre>
|
|
<p>These samples, and some others are <a class=
|
|
"ulink" href=
|
|
"http://www.recoll.org/pages/custom.html" target=
|
|
"_top">on the web site, with pictures to show how
|
|
they look.</a></p>
|
|
<p>It is also possible to <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.ABSSEP">define the value of
|
|
the snippet separator inside the abstract
|
|
section</a>.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.KIO" id=
|
|
"RCL.SEARCH.KIO"></a>3.3. Searching with the KDE
|
|
KIO slave</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.KIO.INTRO"
|
|
id="RCL.SEARCH.KIO.INTRO"></a>3.3.1. What's
|
|
this</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <span class="application">Recoll</span> KIO slave
|
|
allows performing a <span class=
|
|
"application">Recoll</span> search by entering an
|
|
appropriate URL in a KDE open dialog, or with an
|
|
HTML-based interface displayed in <span class=
|
|
"command"><strong>Konqueror</strong></span>.</p>
|
|
<p>The HTML-based interface is similar to the Qt-based
|
|
interface, but slightly less powerful for now. Its
|
|
advantage is that you can perform your search while
|
|
staying fully within the KDE framework: drag and drop
|
|
from the result list works normally and you have your
|
|
normal choice of applications for opening files.</p>
|
|
<p>The alternative interface uses a directory view of
|
|
search results. Due to limitations in the current KIO
|
|
slave interface, it is currently not obviously useful (to
|
|
me).</p>
|
|
<p>The interface is described in more detail inside a
|
|
help file which you can access by entering <code class=
|
|
"filename">recoll:/</code> inside the <span class=
|
|
"command"><strong>konqueror</strong></span> URL line
|
|
(this works only if the recoll KIO slave has been
|
|
previously installed).</p>
|
|
<p>The instructions for building this module are located
|
|
in the source tree. See: <code class=
|
|
"filename">kde/kio/recoll/00README.txt</code>. Some Linux
|
|
distributions do package the kio-recoll module, so check
|
|
before diving into the build process, maybe it's already
|
|
out there ready for one-click installation.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.KIO.SEARCHABLEDOCS" id=
|
|
"RCL.SEARCH.KIO.SEARCHABLEDOCS"></a>3.3.2. Searchable
|
|
documents</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>As a sample application, the <span class=
|
|
"application">Recoll</span> KIO slave could allow
|
|
preparing a set of HTML documents (for example a manual)
|
|
so that they become their own search interface inside
|
|
<span class=
|
|
"command"><strong>konqueror</strong></span>.</p>
|
|
<p>This can be done by either explicitly inserting
|
|
<code class="literal"><a
|
|
href="recoll://..."></code> links around some document
|
|
areas, or automatically by adding a very small
|
|
<span class="application">javascript</span> program to
|
|
the documents, like the following example, which would
|
|
initiate a search by double-clicking any term:</p>
|
|
<pre class=
|
|
"programlisting"><script language="JavaScript">
|
|
function recollsearch() {
|
|
var t = document.getSelection();
|
|
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
|
encodeURIComponent(t);
|
|
}
|
|
</script>
|
|
....
|
|
<body ondblclick="recollsearch()">
|
|
|
|
</pre>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.COMMANDLINE" id=
|
|
"RCL.SEARCH.COMMANDLINE"></a>3.4. Searching on
|
|
the command line</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>There are several ways to obtain search results as a
|
|
text stream, without a graphical interface:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>By passing option <code class="option">-t</code>
|
|
to the <span class=
|
|
"command"><strong>recoll</strong></span> program, or
|
|
by calling it as <span class=
|
|
"command"><strong>recollq</strong></span> (through a
|
|
link).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>By using the <span class=
|
|
"command"><strong>recollq</strong></span>
|
|
program.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>By writing a custom <span class=
|
|
"application">Python</span> program, using the
|
|
<a class="link" href="#RCL.PROGRAM.PYTHONAPI" title=
|
|
"4.3. Python API">Recoll Python API</a>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>The first two methods work in the same way and
|
|
accept/need the same arguments (except for the additional
|
|
<code class="option">-t</code> to <span class=
|
|
"command"><strong>recoll</strong></span>). The query to be
|
|
executed is specified as command line arguments.</p>
|
|
<p><span class="command"><strong>recollq</strong></span> is
|
|
not always built by default. You can use the <code class=
|
|
"filename">Makefile</code> in the <code class=
|
|
"filename">query</code> directory to build it. This is a
|
|
very simple program, and if you can program a little c++,
|
|
you may find it useful to taylor its output format to your
|
|
needs. Apart from being easily customised, <span class=
|
|
"command"><strong>recollq</strong></span> is only really
|
|
useful on systems where the Qt libraries are not available,
|
|
else it is redundant with <code class="literal">recoll
|
|
-t</code>.</p>
|
|
<p><span class="command"><strong>recollq</strong></span>
|
|
has a <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/manpages/recollq.1.html"
|
|
target="_top">man page</a>. The Usage string follows:</p>
|
|
<pre class="programlisting">
|
|
recollq: usage:
|
|
-P: Show the date span for all the documents present in the index
|
|
[-o|-a|-f] [-q] <query string>
|
|
Runs a recoll query and displays result lines.
|
|
Default: will interpret the argument(s) as a xesam query string
|
|
Query elements:
|
|
* Implicit AND, exclusion, field spec: t1 -t2 title:t3
|
|
* OR has priority: t1 OR t2 t3 OR t4 means (t1 OR t2) AND (t3 OR t4)
|
|
* Phrase: "t1 t2" (needs additional quoting on cmd line)
|
|
-o Emulate the GUI simple search in ANY TERM mode
|
|
-a Emulate the GUI simple search in ALL TERMS mode
|
|
-f Emulate the GUI simple search in filename mode
|
|
-q is just ignored (compatibility with the recoll GUI command line)
|
|
Common options:
|
|
-c <configdir> : specify config directory, overriding $RECOLL_CONFDIR
|
|
-d also dump file contents
|
|
-n [first-]<cnt> define the result slice. The default value for [first]
|
|
is 0. Without the option, the default max count is 2000.
|
|
Use n=0 for no limit
|
|
-b : basic. Just output urls, no mime types or titles
|
|
-Q : no result lines, just the processed query and result count
|
|
-m : dump the whole document meta[] array for each result
|
|
-A : output the document abstracts
|
|
-S fld : sort by field <fld>
|
|
-D : sort descending
|
|
-s stemlang : set stemming language to use (must exist in index...)
|
|
Use -s "" to turn off stem expansion
|
|
-T <synonyms file>: use the parameter (Thesaurus) for word expansion
|
|
-i <dbdir> : additional index, several can be given
|
|
-e use url encoding (%xx) for urls
|
|
-F <field name list> : output exactly these fields for each result.
|
|
The field values are encoded in base64, output in one line and
|
|
separated by one space character. This is the recommended format
|
|
for use by other programs. Use a normal query with option -m to
|
|
see the field names. Use -F '' to output all fields, but you probably
|
|
also want option -N in this case
|
|
-N : with -F, print the (plain text) field names before the field values
|
|
</pre>
|
|
<p>Sample execution:</p>
|
|
<pre class="programlisting">
|
|
recollq 'ilur -nautique mime:text/html'
|
|
Recoll query: ((((ilur:(wqf=11) OR ilurs) AND_NOT (nautique:(wqf=11) OR nautiques OR nautiqu OR nautiquement)) FILTER Ttext/html))
|
|
4 results
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html] [comptes.html] 18593 bytes
|
|
text/html [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
|
|
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
|
|
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
|
|
</pre>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.LANG" id=
|
|
"RCL.SEARCH.LANG"></a>3.5. The query
|
|
language</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The query language processor is activated in the GUI
|
|
simple search entry when the search mode selector is set to
|
|
<span class="guilabel">Query Language</span>. It can also
|
|
be used with the KIO slave or the command line search. It
|
|
broadly has the same capabilities as the complex search
|
|
interface in the GUI.</p>
|
|
<p>The language was based on the now defunct <a class=
|
|
"ulink" href=
|
|
"http://www.xesam.org/main/XesamUserSearchLanguage95"
|
|
target="_top">Xesam</a> user search language
|
|
specification.</p>
|
|
<p>If the results of a query language search puzzle you and
|
|
you doubt what has been actually searched for, you can use
|
|
the GUI <code class="literal">Show Query</code> link at the
|
|
top of the result list to check the exact query which was
|
|
finally executed by Xapian.</p>
|
|
<p>Here follows a sample request that we are going to
|
|
explain:</p>
|
|
<pre class="programlisting">
|
|
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
|
</pre>
|
|
<p>This would search for all documents with <em class=
|
|
"replaceable"><code>John Doe</code></em> appearing as a
|
|
phrase in the author field (exactly what this is would
|
|
depend on the document type, ie: the <code class=
|
|
"literal">From:</code> header, for an email message), and
|
|
containing either <em class=
|
|
"replaceable"><code>beatles</code></em> or <em class=
|
|
"replaceable"><code>lennon</code></em> and either
|
|
<em class="replaceable"><code>live</code></em> or
|
|
<em class="replaceable"><code>unplugged</code></em> but not
|
|
<em class="replaceable"><code>potatoes</code></em> (in any
|
|
part of the document).</p>
|
|
<p>An element is composed of an optional field
|
|
specification, and a value, separated by a colon (the field
|
|
separator is the last colon in the element). Examples:
|
|
<em class="replaceable"><code>Eugenie</code></em>,
|
|
<em class="replaceable"><code>author:balzac</code></em>,
|
|
<em class="replaceable"><code>dc:title:grandet</code></em>
|
|
<em class="replaceable"><code>dc:title:"eugenie
|
|
grandet"</code></em></p>
|
|
<p>The colon, if present, means "contains". Xesam defines
|
|
other relations, which are mostly unsupported for now
|
|
(except in special cases, described further down).</p>
|
|
<p>All elements in the search entry are normally combined
|
|
with an implicit AND. It is possible to specify that
|
|
elements be OR'ed instead, as in <em class=
|
|
"replaceable"><code>Beatles</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>Lennon</code></em>. The <code class=
|
|
"literal">OR</code> must be entered literally (capitals),
|
|
and it has priority over the AND associations: <em class=
|
|
"replaceable"><code>word1</code></em> <em class=
|
|
"replaceable"><code>word2</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>word3</code></em> means <em class=
|
|
"replaceable"><code>word1</code></em> AND (<em class=
|
|
"replaceable"><code>word2</code></em> <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>word3</code></em>) not (<em class=
|
|
"replaceable"><code>word1</code></em> AND <em class=
|
|
"replaceable"><code>word2</code></em>) <code class=
|
|
"literal">OR</code> <em class=
|
|
"replaceable"><code>word3</code></em>.</p>
|
|
<p><span class="application">Recoll</span> versions 1.21
|
|
and later, allow using parentheses to group elements, which
|
|
will sometimes make things clearer, and may allow
|
|
expressing combinations which would have been difficult
|
|
otherwise.</p>
|
|
<p>An element preceded by a <code class="literal">-</code>
|
|
specifies a term that should <span class=
|
|
"emphasis"><em>not</em></span> appear.</p>
|
|
<p>As usual, words inside quotes define a phrase (the order
|
|
of words is significant), so that <em class=
|
|
"replaceable"><code>title:"prejudice pride"</code></em> is
|
|
not the same as <em class=
|
|
"replaceable"><code>title:prejudice
|
|
title:pride</code></em>, and is unlikely to find a
|
|
result.</p>
|
|
<p>Words inside phrases and capitalized words are not
|
|
stem-expanded. Wildcards may be used anywhere inside a
|
|
term. Specifying a wild-card on the left of a term can
|
|
produce a very slow search (or even an incorrect one if the
|
|
expansion is truncated because of excessive size). Also see
|
|
<a class="link" href="#RCL.SEARCH.WILDCARDS" title=
|
|
"3.6.1. More about wildcards">More about
|
|
wildcards</a>.</p>
|
|
<p>To save you some typing, recent <span class=
|
|
"application">Recoll</span> versions (1.20 and later)
|
|
interpret a comma-separated list of terms for a field as an
|
|
AND list inside the field. Use slash characters ('/') for
|
|
an OR list. No white space is allowed. So</p>
|
|
<pre class="programlisting">author:john,lennon</pre>
|
|
<p>will search for documents with <code class=
|
|
"literal">john</code> and <code class=
|
|
"literal">lennon</code> inside the <code class=
|
|
"literal">author</code> field (in any order), and</p>
|
|
<pre class="programlisting">author:john/ringo</pre>
|
|
<p>would search for <code class="literal">john</code> or
|
|
<code class="literal">ringo</code>. This behaviour only
|
|
happens for field queries (input without a field, comma- or
|
|
slash- separated input will produce a phrase search). You
|
|
can use a <code class="literal">text</code> field name to
|
|
search the main text this way.</p>
|
|
<p>Modifiers can be set on a double-quote value, for
|
|
example to specify a proximity search (unordered). See
|
|
<a class="link" href="#RCL.SEARCH.LANG.MODIFIERS" title=
|
|
"3.5.2. Modifiers">the modifier section</a>. No space
|
|
must separate the final double-quote and the modifiers
|
|
value, e.g. <em class="replaceable"><code>"two
|
|
one"po10</code></em></p>
|
|
<p><span class="application">Recoll</span> currently
|
|
manages the following default fields:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">title</code>, <code class=
|
|
"literal">subject</code> or <code class=
|
|
"literal">caption</code> are synonyms which specify
|
|
data to be searched for in the document title or
|
|
subject.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">author</code> or
|
|
<code class="literal">from</code> for searching the
|
|
documents originators.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">recipient</code> or
|
|
<code class="literal">to</code> for searching the
|
|
documents recipients.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">keyword</code> for searching
|
|
the document-specified keywords (few documents
|
|
actually have any).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">filename</code> for the
|
|
document's file name. This is not necessarily set for
|
|
all documents: internal documents contained inside a
|
|
compound one (for example an EPUB section) do not
|
|
inherit the container file name any more, this was
|
|
replaced by an explicit field (see next).
|
|
Sub-documents can still have a specific <code class=
|
|
"literal">filename</code>, if it is implied by the
|
|
document format, for example the attachment file name
|
|
for an email attachment.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">containerfilename</code>.
|
|
This is set for all documents, both top-level and
|
|
contained sub-documents, and is always the name of
|
|
the filesystem directory entry which contains the
|
|
data. The terms from this field can only be matched
|
|
by an explicit field specification (as opposed to
|
|
terms from <code class="literal">filename</code>
|
|
which are also indexed as general document content).
|
|
This avoids getting matches for all the sub-documents
|
|
when searching for the container file name.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">ext</code> specifies the
|
|
file name extension (Ex: <code class=
|
|
"literal">ext:html</code>).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">rclmd5</code> the MD5
|
|
checksum for the document. This is used for
|
|
displaying the duplicates of a search result (when
|
|
querying with the option to collapse duplicate
|
|
results). Incidentally, this could be used to find
|
|
the duplicates of any given file by computing its MD5
|
|
checksum and executing a query with just the
|
|
<code class="literal">rclmd5</code> value.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p><span class="application">Recoll</span> 1.20 and later
|
|
have a way to specify aliases for the field names, which
|
|
will save typing, for example by aliasing <code class=
|
|
"literal">filename</code> to <em class=
|
|
"replaceable"><code>fn</code></em> or <code class=
|
|
"literal">containerfilename</code> to <em class=
|
|
"replaceable"><code>cfn</code></em>. See the <a class=
|
|
"link" href="#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file">section about the
|
|
<code class="filename">fields</code> file</a>.</p>
|
|
<p>The document input handlers used while indexing have the
|
|
possibility to create other fields with arbitrary names,
|
|
and aliases may be defined in the configuration, so that
|
|
the exact field search possibilities may be different for
|
|
you if someone took care of the customisation.</p>
|
|
<p>The field syntax also supports a few field-like, but
|
|
special, criteria:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">dir</code> for filtering the
|
|
results on file location (Ex: <code class=
|
|
"literal">dir:/home/me/somedir</code>). <code class=
|
|
"literal">-dir</code> also works to find results not
|
|
in the specified directory (release >= 1.15.8).
|
|
Tilde expansion will be performed as usual (except
|
|
for a bug in versions 1.19 to 1.19.11p1). Wildcards
|
|
will be expanded, but please <a class="link" href=
|
|
"#RCL.SEARCH.WILDCARDS.PATH" title=
|
|
"Wildcards and path filtering">have a look</a> at an
|
|
important limitation of wildcards in path
|
|
filters.</p>
|
|
<p>Relative paths also make sense, for example,
|
|
<code class="literal">dir:share/doc</code> would
|
|
match either <code class=
|
|
"filename">/usr/share/doc</code> or <code class=
|
|
"filename">/usr/local/share/doc</code></p>
|
|
<p>Several <code class="literal">dir</code> clauses
|
|
can be specified, both positive and negative. For
|
|
example the following makes sense:</p>
|
|
<pre class="programlisting">
|
|
dir:recoll dir:src -dir:utils -dir:common
|
|
</pre>
|
|
<p>This would select results which have both
|
|
<code class="filename">recoll</code> and <code class=
|
|
"filename">src</code> in the path (in any order), and
|
|
which have not either <code class=
|
|
"filename">utils</code> or <code class=
|
|
"filename">common</code>.</p>
|
|
<p>You can also use <code class="literal">OR</code>
|
|
conjunctions with <code class="literal">dir:</code>
|
|
clauses.</p>
|
|
<p>A special aspect of <code class=
|
|
"literal">dir</code> clauses is that the values in
|
|
the index are not transcoded to UTF-8, and never
|
|
lower-cased or unaccented, but stored as binary. This
|
|
means that you need to enter the values in the exact
|
|
lower or upper case, and that searches for names with
|
|
diacritics may sometimes be impossible because of
|
|
character set conversion issues. Non-ASCII UNIX file
|
|
paths are an unending source of trouble and are best
|
|
avoided.</p>
|
|
<p>You need to use double-quotes around the path
|
|
value if it contains space characters.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">size</code> for filtering
|
|
the results on file size. Example: <code class=
|
|
"literal">size<10000</code>. You can use
|
|
<code class="literal"><</code>, <code class=
|
|
"literal">></code> or <code class=
|
|
"literal">=</code> as operators. You can specify a
|
|
range like the following: <code class=
|
|
"literal">size>100 size<1000</code>. The usual
|
|
<code class="literal">k/K, m/M, g/G, t/T</code> can
|
|
be used as (decimal) multipliers. Ex: <code class=
|
|
"literal">size>1k</code> to search for files
|
|
bigger than 1000 bytes.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">date</code> for searching or
|
|
filtering on dates. The syntax for the argument is
|
|
based on the ISO8601 standard for dates and time
|
|
intervals. Only dates are supported, no times. The
|
|
general syntax is 2 elements separated by a
|
|
<code class="literal">/</code> character. Each
|
|
element can be a date or a period of time. Periods
|
|
are specified as <code class=
|
|
"literal">P</code><em class=
|
|
"replaceable"><code>n</code></em><code class=
|
|
"literal">Y</code><em class=
|
|
"replaceable"><code>n</code></em><code class=
|
|
"literal">M</code><em class=
|
|
"replaceable"><code>n</code></em><code class=
|
|
"literal">D</code>. The <em class=
|
|
"replaceable"><code>n</code></em> numbers are the
|
|
respective numbers of years, months or days, any of
|
|
which may be missing. Dates are specified as
|
|
<em class=
|
|
"replaceable"><code>YYYY</code></em>-<em class=
|
|
"replaceable"><code>MM</code></em>-<em class=
|
|
"replaceable"><code>DD</code></em>. The days and
|
|
months parts may be missing. If the <code class=
|
|
"literal">/</code> is present but an element is
|
|
missing, the missing element is interpreted as the
|
|
lowest or highest date in the index. Examples:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: circle;">
|
|
<li class="listitem">
|
|
<p><code class=
|
|
"literal">2001-03-01/2002-05-01</code> the
|
|
basic syntax for an interval of dates.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class=
|
|
"literal">2001-03-01/P1Y2M</code> the same
|
|
specified with a period.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">2001/</code> from the
|
|
beginning of 2001 to the latest date in the
|
|
index.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">2001</code> the whole
|
|
year of 2001</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">P2D/</code> means 2
|
|
days ago up to now if there are no documents
|
|
with dates in the future.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">/2003</code> all
|
|
documents from 2003 or older.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Periods can also be specified with small letters
|
|
(ie: p2y).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">mime</code> or <code class=
|
|
"literal">format</code> for specifying the MIME type.
|
|
These clauses are processed besides the normal
|
|
Boolean logic of the search. Multiple values will be
|
|
OR'ed (instead of the normal AND). You can specify
|
|
types to be excluded, with the usual <code class=
|
|
"literal">-</code>, and use wildcards. Example:
|
|
<em class="replaceable"><code>mime:text/*
|
|
-mime:text/plain</code></em> Specifying an explicit
|
|
boolean operator before a <code class=
|
|
"literal">mime</code> specification is not supported
|
|
and will produce strange results.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">type</code> or <code class=
|
|
"literal">rclcat</code> for specifying the category
|
|
(as in text/media/presentation/etc.). The
|
|
classification of MIME types in categories is defined
|
|
in the <span class="application">Recoll</span>
|
|
configuration (<code class=
|
|
"filename">mimeconf</code>), and can be modified or
|
|
extended. The default category names are those which
|
|
permit filtering results in the main GUI screen.
|
|
Categories are OR'ed like MIME types above, and can
|
|
be negated with <code class="literal">-</code>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">issub</code> for specifying
|
|
that only standalone (<code class=
|
|
"literal">issub:0</code>) or only embedded
|
|
(<code class="literal">issub:1</code>) documents
|
|
should be returned as results.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p><code class="literal">mime</code>, <code class=
|
|
"literal">rclcat</code>, <code class=
|
|
"literal">size</code>, <code class="literal">issub</code>
|
|
and <code class="literal">date</code> criteria always
|
|
affect the whole query (they are applied as a final
|
|
filter), even if set with other terms inside a
|
|
parenthese.</p>
|
|
</div>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p><code class="literal">mime</code> (or the equivalent
|
|
<code class="literal">rclcat</code>) is the <span class=
|
|
"emphasis"><em>only</em></span> field with an
|
|
<code class="literal">OR</code> default. You do need to
|
|
use <code class="literal">OR</code> with <code class=
|
|
"literal">ext</code> terms for example.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.LANG.RANGES"
|
|
id="RCL.SEARCH.LANG.RANGES"></a>3.5.1. Range
|
|
clauses</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> 1.24 and later
|
|
support range clauses on fields which have been
|
|
configured to support it. No default field uses them
|
|
currently, so this paragraph is only interesting if you
|
|
modified the fields configuration and possibly use a
|
|
custom input handler.</p>
|
|
<p>A range clause looks like one of the following:</p>
|
|
<pre class="programlisting"><em class=
|
|
"replaceable"><code>myfield</code></em>:<em class=
|
|
"replaceable"><code>small</code></em>..<em class=
|
|
"replaceable"><code>big</code></em>
|
|
<em class="replaceable"><code>myfield</code></em>:<em class=
|
|
"replaceable"><code>small</code></em>..
|
|
<em class="replaceable"><code>myfield</code></em>:..<em class=
|
|
"replaceable"><code>big</code></em>
|
|
</pre>
|
|
<p>The nature of the clause is indicated by the two dots
|
|
<code class="literal">..</code>, and the effect is to
|
|
filter the results for which the <em class=
|
|
"replaceable"><code>myfield</code></em> value is in the
|
|
possibly open-ended interval.</p>
|
|
<p>See the section about the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file"><code class=
|
|
"filename">fields</code> configuration file</a> for the
|
|
details of configuring a field for range searches (list
|
|
them in the [values] section).</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.SEARCH.LANG.MODIFIERS" id=
|
|
"RCL.SEARCH.LANG.MODIFIERS"></a>3.5.2. Modifiers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Some characters are recognized as search modifiers
|
|
when found immediately after the closing double quote of
|
|
a phrase, as in <code class="literal">"some
|
|
term"modifierchars</code>. The actual "phrase" can be a
|
|
single term of course. Supported modifiers:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">l</code> can be used to
|
|
turn off stemming (mostly makes sense with
|
|
<code class="literal">p</code> because stemming is
|
|
off by default for phrases).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">s</code> can be used to
|
|
turn off synonym expansion, if a synonyms file is
|
|
in place (only for <span class=
|
|
"application">Recoll</span> 1.22 and later).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">o</code> can be used to
|
|
specify a "slack" for phrase and proximity
|
|
searches: the number of additional terms that may
|
|
be found between the specified ones. If
|
|
<code class="literal">o</code> is followed by an
|
|
integer number, this is the slack, else the default
|
|
is 10.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">p</code> can be used to
|
|
turn the default phrase search into a proximity one
|
|
(unordered). Example: <code class="literal">"order
|
|
any in"p</code></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">C</code> will turn on case
|
|
sensitivity (if the index supports it).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">D</code> will turn on
|
|
diacritics sensitivity (if the index supports
|
|
it).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>A weight can be specified for a query element by
|
|
specifying a decimal value at the start of the
|
|
modifiers. Example: <code class=
|
|
"literal">"Important"2.5</code>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.ANCHORWILD" id=
|
|
"RCL.SEARCH.ANCHORWILD"></a>3.6. Anchored
|
|
searches and wildcards</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Some special characters are interpreted by <span class=
|
|
"application">Recoll</span> in search strings to expand or
|
|
specialize the search. Wildcards expand a root term in
|
|
controlled ways. Anchor characters can restrict a search to
|
|
succeed only if the match is found at or near the beginning
|
|
of the document or one of its fields.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.WILDCARDS"
|
|
id="RCL.SEARCH.WILDCARDS"></a>3.6.1. More
|
|
about wildcards</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>All words entered in <span class=
|
|
"application">Recoll</span> search fields will be
|
|
processed for wildcard expansion before the request is
|
|
finally executed.</p>
|
|
<p>The wildcard characters are:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">*</code> which matches 0
|
|
or more characters.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">?</code> which matches a
|
|
single character.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">[]</code> which allow
|
|
defining sets of characters to be matched (ex:
|
|
<code class="literal">[</code><strong class=
|
|
"userinput"><code>abc</code></strong><code class=
|
|
"literal">]</code> matches a single character which
|
|
may be 'a' or 'b' or 'c', <code class=
|
|
"literal">[</code><strong class=
|
|
"userinput"><code>0-9</code></strong><code class=
|
|
"literal">]</code> matches any number.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>You should be aware of a few things when using
|
|
wildcards.</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Using a wildcard character at the beginning of a
|
|
word can make for a slow search because
|
|
<span class="application">Recoll</span> will have
|
|
to scan the whole index term list to find the
|
|
matches. However, this is much less a problem for
|
|
field searches, and queries like <em class=
|
|
"replaceable"><code>author:*@domain.com</code></em>
|
|
can sometimes be very useful.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>For <span class="application">Recoll</span>
|
|
version 18 only, when working with a raw index
|
|
(preserving character case and diacritics), the
|
|
literal part of a wildcard expression will be
|
|
matched exactly for case and diacritics. This is
|
|
not true any more for versions 19 and later.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Using a <code class="literal">*</code> at the
|
|
end of a word can produce more matches than you
|
|
would think, and strange search results. You can
|
|
use the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.TERMEXPLORER" title=
|
|
"3.2.9. The term explorer tool">term
|
|
explorer</a> tool to check what completions exist
|
|
for a given term. You can also see exactly what
|
|
search was performed by clicking on the link at the
|
|
top of the result list. In general, for natural
|
|
language terms, stem expansion will produce better
|
|
results than an ending <code class=
|
|
"literal">*</code> (stem expansion is turned off
|
|
when any wildcard character appears in the
|
|
term).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.SEARCH.WILDCARDS.PATH" id=
|
|
"RCL.SEARCH.WILDCARDS.PATH"></a>Wildcards and
|
|
path filtering</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Due to the way that <span class=
|
|
"application">Recoll</span> processes wildcards inside
|
|
<code class="literal">dir</code> path filtering
|
|
clauses, they will have a multiplicative effect on the
|
|
query size. A clause containing wildcards in several
|
|
paths elements, like, for example, <code class=
|
|
"literal">dir:</code><em class=
|
|
"replaceable"><code>/home/me/*/*/docdir</code></em>,
|
|
will almost certainly fail if your indexed tree is of
|
|
any realistic size.</p>
|
|
<p>Depending on the case, you may be able to work
|
|
around the issue by specifying the paths elements more
|
|
narrowly, with a constant prefix, or by using 2
|
|
separate <code class="literal">dir:</code> clauses
|
|
instead of multiple wildcards, as in <code class=
|
|
"literal">dir:</code><em class=
|
|
"replaceable"><code>/home/me</code></em> <code class=
|
|
"literal">dir:</code><em class=
|
|
"replaceable"><code>docdir</code></em>. The latter
|
|
query is not equivalent to the initial one because it
|
|
does not specify a number of directory levels, but
|
|
that's the best we can do (and it may be actually more
|
|
useful in some cases).</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.ANCHOR" id=
|
|
"RCL.SEARCH.ANCHOR"></a>3.6.2. Anchored
|
|
searches</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Two characters are used to specify that a search hit
|
|
should occur at the beginning or at the end of the text.
|
|
<code class="literal">^</code> at the beginning of a term
|
|
or phrase constrains the search to happen at the start,
|
|
<code class="literal">$</code> at the end force it to
|
|
happen at the end.</p>
|
|
<p>As this function is implemented as a phrase search it
|
|
is possible to specify a maximum distance at which the
|
|
hit should occur, either through the controls of the
|
|
advanced search panel, or using the query language, for
|
|
example, as in:</p>
|
|
<pre class="programlisting">"^someterm"o10</pre>
|
|
<p>which would force <code class=
|
|
"literal">someterm</code> to be found within 10 terms of
|
|
the start of the text. This can be combined with a field
|
|
search as in <code class=
|
|
"literal">somefield:"^someterm"o10</code> or <code class=
|
|
"literal">somefield:someterm$</code>.</p>
|
|
<p>This feature can also be used with an actual phrase
|
|
search, but in this case, the distance applies to the
|
|
whole phrase and anchor, so that, for example,
|
|
<code class="literal">bla bla my unexpected term</code>
|
|
at the beginning of the text would be a match for
|
|
<code class="literal">"^my term"o5</code>.</p>
|
|
<p>Anchored searches can be very useful for searches
|
|
inside somewhat structured documents like scientific
|
|
articles, in case explicit metadata has not been supplied
|
|
(a most frequent case), for example for looking for
|
|
matches inside the abstract or the list of authors (which
|
|
occur at the top of the document).</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.SYNONYMS" id=
|
|
"RCL.SEARCH.SYNONYMS"></a>3.7. Using Synonyms
|
|
(1.22)</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><b>Term synonyms and text search: </b>in general,
|
|
there are two main ways to use term synonyms for searching
|
|
text:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>At index creation time, they can be used to alter
|
|
the indexed terms, either increasing or decreasing
|
|
their number, by expanding the original terms to all
|
|
synonyms, or by reducing all synonym terms to a
|
|
canonical one.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>At query time, they can be used to match texts
|
|
containing terms which are synonyms of the ones
|
|
specified by the user, either by expanding the query
|
|
for all synonyms, or by reducing the user entry to
|
|
canonical terms (the latter only works if the
|
|
corresponding processing has been performed while
|
|
creating the index).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p><span class="application">Recoll</span> only uses
|
|
synonyms at query time. A user query term which part of a
|
|
synonym group will be optionally expanded into an
|
|
<code class="literal">OR</code> query for all terms in the
|
|
group.</p>
|
|
<p>Synonym groups are defined inside ordinary text files.
|
|
Each line in the file defines a group.</p>
|
|
<p>Example:</p>
|
|
<pre class="programlisting">
|
|
hi hello "good morning"
|
|
|
|
# not sure about "au revoir" though. Is this english ?
|
|
bye goodbye "see you" \
|
|
"au revoir"
|
|
</pre>
|
|
<p>As usual, lines beginning with a <code class=
|
|
"literal">#</code> are comments, empty lines are ignored,
|
|
and lines can be continued by ending them with a
|
|
backslash.</p>
|
|
<p>Multi-word synonyms are supported, but be aware that
|
|
these will generate phrase queries, which may degrade
|
|
performance and will disable stemming expansion for the
|
|
phrase terms.</p>
|
|
<p>The contents of the synonyms file must be casefolded
|
|
(not only lowercased), because this is what expected at the
|
|
point in the query processing where it is used. There are a
|
|
few cases where this makes a difference, for example,
|
|
German sharp s should be expressed as <code class=
|
|
"literal">ss</code>, Greek final sigma as sigma. For
|
|
reference, Python3 has an easy way to casefold words
|
|
(str.casefold()).</p>
|
|
<p>The synonyms file can be specified in the <span class=
|
|
"guilabel">Search parameters</span> tab of the <span class=
|
|
"guilabel">GUI configuration</span> <span class=
|
|
"guilabel">Preferences</span> menu entry, or as an option
|
|
for command-line searches.</p>
|
|
<p>Once the file is defined, the use of synonyms can be
|
|
enabled or disabled directly from the <span class=
|
|
"guilabel">Preferences</span> menu.</p>
|
|
<p>The synonyms are searched for matches with user terms
|
|
after the latter are stem-expanded, but the contents of the
|
|
synonyms file itself is not subjected to stem expansion.
|
|
This means that a match will not be found if the form
|
|
present in the synonyms file is not present anywhere in the
|
|
document set (same with accents when using a raw
|
|
index).</p>
|
|
<p>The synonyms function is probably not going to help you
|
|
find your letters to Mr. Smith. It is best used for
|
|
domain-specific searches. For example, it was initially
|
|
suggested by a user performing searches among historical
|
|
documents: the synonyms file would contains nicknames and
|
|
aliases for each of the persons of interest.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.PTRANS" id=
|
|
"RCL.SEARCH.PTRANS"></a>3.8. Path
|
|
translations</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>In some cases, the document paths stored inside the
|
|
index do not match the actual ones, so that document
|
|
previews and accesses will fail. This can occur in a number
|
|
of circumstances:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>When using multiple indexes it is a relatively
|
|
common occurrence that some will actually reside on a
|
|
remote volume, for example mounted via NFS. In this
|
|
case, the paths used to access the documents on the
|
|
local machine are not necessarily the same than the
|
|
ones used while indexing on the remote machine. For
|
|
example, <code class="filename">/home/me</code> may
|
|
have been used as a <code class=
|
|
"literal">topdirs</code> elements while indexing, but
|
|
the directory might be mounted as <code class=
|
|
"filename">/net/server/home/me</code> on the local
|
|
machine.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The case may also occur with removable disks. It
|
|
is perfectly possible to configure an index to live
|
|
with the documents on the removable disk, but it may
|
|
happen that the disk is not mounted at the same place
|
|
so that the documents paths from the index are
|
|
invalid.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>As a last example, one could imagine that a big
|
|
directory has been moved, but that it is currently
|
|
inconvenient to run the indexer.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p><span class="application">Recoll</span> has a facility
|
|
for rewriting access paths when extracting the data from
|
|
the index. The translations can be defined for the main
|
|
index and for any additional query index.</p>
|
|
<p>The path translation facility will be useful whenever
|
|
the documents paths seen by the indexer are not the same as
|
|
the ones which should be used at query time.</p>
|
|
<p>In the above NFS example, <span class=
|
|
"application">Recoll</span> could be instructed to rewrite
|
|
any <code class="filename">file:///home/me</code> URL from
|
|
the index to <code class=
|
|
"filename">file:///net/server/home/me</code>, allowing
|
|
accesses from the client.</p>
|
|
<p>The translations are defined in the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG.PTRANS" title=
|
|
"5.4.7. The ptrans file"><code class=
|
|
"filename">ptrans</code></a> configuration file, which can
|
|
be edited by hand or from the GUI external indexes
|
|
configuration dialog: <span class=
|
|
"guimenu">Preferences</span> → <span class=
|
|
"guimenuitem">External index dialog</span>, then click the
|
|
<span class="guilabel">Paths translations</span> button on
|
|
the right below the index list.</p>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p>Due to a current bug, the GUI must be restarted after
|
|
changing the <code class="filename">ptrans</code> values
|
|
(even when they were changed from the GUI).</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.CASEDIAC" id=
|
|
"RCL.SEARCH.CASEDIAC"></a>3.9. Search case and
|
|
diacritics sensitivity</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>For <span class="application">Recoll</span> versions
|
|
1.18 and later, and <span class="emphasis"><em>when working
|
|
with a raw index</em></span> (not the default), searches
|
|
can be sensitive to character case and diacritics. How this
|
|
happens is controlled by configuration variables and what
|
|
search data is entered.</p>
|
|
<p>The general default is that searches entered without
|
|
upper-case or accented characters are insensitive to case
|
|
and diacritics. An entry of <code class=
|
|
"literal">resume</code> will match any of <code class=
|
|
"literal">Resume</code>, <code class=
|
|
"literal">RESUME</code>, <code class=
|
|
"literal">résumé</code>, <code class=
|
|
"literal">Résumé</code> etc.</p>
|
|
<p>Two configuration variables can automate switching on
|
|
sensitivity (they were documented but actually did nothing
|
|
until <span class="application">Recoll</span> 1.22):</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">autodiacsens</span></dt>
|
|
<dd>
|
|
<p>If this is set, search sensitivity to diacritics
|
|
will be turned on as soon as an accented character
|
|
exists in a search term. When the variable is set to
|
|
true, <code class="literal">resume</code> will start
|
|
a diacritics-unsensitive search, but <code class=
|
|
"literal">résumé</code> will be matched exactly. The
|
|
default value is <span class=
|
|
"emphasis"><em>false</em></span>.</p>
|
|
</dd>
|
|
<dt><span class="term">autocasesens</span></dt>
|
|
<dd>
|
|
<p>If this is set, search sensitivity to character
|
|
case will be turned on as soon as an upper-case
|
|
character exists in a search term <span class=
|
|
"emphasis"><em>except for the first one</em></span>.
|
|
When the variable is set to true, <code class=
|
|
"literal">us</code> or <code class=
|
|
"literal">Us</code> will start a
|
|
diacritics-unsensitive search, but <code class=
|
|
"literal">US</code> will be matched exactly. The
|
|
default value is <span class=
|
|
"emphasis"><em>true</em></span> (contrary to
|
|
<code class="literal">autodiacsens</code>).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<p>As in the past, capitalizing the first letter of a word
|
|
will turn off its stem expansion and have no effect on
|
|
case-sensitivity.</p>
|
|
<p>You can also explicitly activate case and diacritics
|
|
sensitivity by using modifiers with the query language.
|
|
<code class="literal">C</code> will make the term
|
|
case-sensitive, and <code class="literal">D</code> will
|
|
make it diacritics-sensitive. Examples:</p>
|
|
<pre class="programlisting">
|
|
"us"C
|
|
</pre>
|
|
<p>will search for the term <code class="literal">us</code>
|
|
exactly (<code class="literal">Us</code> will not be a
|
|
match).</p>
|
|
<pre class="programlisting">
|
|
"resume"D
|
|
</pre>
|
|
<p>will search for the term <code class=
|
|
"literal">resume</code> exactly (<code class=
|
|
"literal">résumé</code> will not be a match).</p>
|
|
<p>When either case or diacritics sensitivity is activated,
|
|
stem expansion is turned off. Having both does not make
|
|
much sense.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.SEARCH.DESKTOP" id=
|
|
"RCL.SEARCH.DESKTOP"></a>3.10. Desktop
|
|
integration</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Being independent of the desktop type has its drawbacks:
|
|
<span class="application">Recoll</span> desktop integration
|
|
is minimal. However there are a few tools available:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Users of recent Ubuntu-derived distributions, or
|
|
any other Gnome desktop systems (e.g. Fedora) can
|
|
install the <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/pages/download.html#gssp"
|
|
target="_top">Recoll GSSP</a> (Gnome Shell Search
|
|
Provider).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The <span class="application">KDE</span> KIO Slave
|
|
was described in a <a class="link" href=
|
|
"#RCL.SEARCH.KIO" title=
|
|
"3.3. Searching with the KDE KIO slave">previous
|
|
section</a>. It can provide search results inside
|
|
<span class=
|
|
"command"><strong>Dolphin</strong></span>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>If you use an oldish version of Ubuntu Linux, you
|
|
may find the <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/faqsandhowtos/UnityLens"
|
|
target="_top">Ubuntu Unity Lens</a> module
|
|
useful.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>There is also an independently developed <a class=
|
|
"ulink" href=
|
|
"http://kde-apps.org/content/show.php/recollrunner?content=128203"
|
|
target="_top">Krunner plugin</a>.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Here follow a few other things that may help.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.SEARCH.SHORTCUT" id=
|
|
"RCL.SEARCH.SHORTCUT"></a>3.10.1. Hotkeying
|
|
recoll</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>It is surprisingly convenient to be able to show or
|
|
hide the <span class="application">Recoll</span> GUI with
|
|
a single keystroke. Recoll comes with a small Python
|
|
script, based on the <span class=
|
|
"application">libwnck</span> window manager interface
|
|
library, which will allow you to do just this. The
|
|
detailed instructions are on <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/faqsandhowtos/HotRecoll"
|
|
target="_top">this wiki page</a>.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name="RCL.KICKER-APPLET" id=
|
|
"RCL.KICKER-APPLET"></a>3.10.2. The KDE Kicker
|
|
Recoll applet</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This is probably obsolete now. Anyway:</p>
|
|
<p>The <span class="application">Recoll</span> source
|
|
tree contains the source code to the <span class=
|
|
"application">recoll_applet</span>, a small application
|
|
derived from the <span class=
|
|
"application">find_applet</span>. This can be used to add
|
|
a small <span class="application">Recoll</span> launcher
|
|
to the KDE panel.</p>
|
|
<p>The applet is not automatically built with the main
|
|
<span class="application">Recoll</span> programs, nor is
|
|
it included with the main source distribution (because
|
|
the KDE build boilerplate makes it relatively big). You
|
|
can download its source from the recoll.org download
|
|
page. Use the omnipotent <strong class=
|
|
"userinput"><code>configure;make;make
|
|
install</code></strong> incantation to build and
|
|
install.</p>
|
|
<p>You can then add the applet to the panel by
|
|
right-clicking the panel and choosing the <span class=
|
|
"guilabel">Add applet</span> entry.</p>
|
|
<p>The <span class="application">recoll_applet</span> has
|
|
a small text window where you can type a <span class=
|
|
"application">Recoll</span> query (in query language
|
|
form), and an icon which can be used to restrict the
|
|
search to certain types of files. It is quite primitive,
|
|
and launches a new recoll GUI instance every time (even
|
|
if it is already running). You may find it useful
|
|
anyway.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.PROGRAM" id=
|
|
"RCL.PROGRAM"></a>Chapter 4. Programming
|
|
interface</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> has an Application
|
|
Programming Interface, usable both for indexing and
|
|
searching, currently accessible from the <span class=
|
|
"application">Python</span> language.</p>
|
|
<p>Another less radical way to extend the application is to
|
|
write input handlers for new types of documents.</p>
|
|
<p>The processing of metadata attributes for documents
|
|
(<code class="literal">fields</code>) is highly
|
|
configurable.</p>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.PROGRAM.FILTERS" id=
|
|
"RCL.PROGRAM.FILTERS"></a>4.1. Writing a
|
|
document input handler</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Terminology</h3>
|
|
<p>The small programs or pieces of code which handle the
|
|
processing of the different document types for
|
|
<span class="application">Recoll</span> used to be called
|
|
<code class="literal">filters</code>, which is still
|
|
reflected in the name of the directory which holds them
|
|
and many configuration variables. They were named this
|
|
way because one of their primary functions is to filter
|
|
out the formatting directives and keep the text content.
|
|
However these modules may have other behaviours, and the
|
|
term <code class="literal">input handler</code> is now
|
|
progressively substituted in the documentation.
|
|
<code class="literal">filter</code> is still used in many
|
|
places though.</p>
|
|
</div>
|
|
<p><span class="application">Recoll</span> input handlers
|
|
cooperate to translate from the multitude of input document
|
|
formats, simple ones as <span class=
|
|
"application">opendocument</span>, <span class=
|
|
"application">acrobat</span>, or compound ones such as
|
|
<span class="application">Zip</span> or <span class=
|
|
"application">Email</span>, into the final <span class=
|
|
"application">Recoll</span> indexing input format, which is
|
|
plain text (in many cases the processing pipeline has an
|
|
intermediary HTML step, which may be used for better
|
|
previewing presentation). Most input handlers are
|
|
executable programs or scripts. A few handlers are coded in
|
|
C++ and live inside <span class=
|
|
"command"><strong>recollindex</strong></span>. This latter
|
|
kind will not be described here.</p>
|
|
<p>There are two kinds of external executable input
|
|
handlers:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Simple <code class="literal">exec</code> handlers
|
|
run once and exit. They can be bare programs like
|
|
<span class=
|
|
"command"><strong>antiword</strong></span>, or
|
|
scripts using other programs. They are very simple to
|
|
write, because they just need to print the converted
|
|
document to the standard output. Their output can be
|
|
plain text or HTML. HTML is usually preferred because
|
|
it can store metadata fields and it allows preserving
|
|
some of the formatting for the GUI preview. However,
|
|
these handlers have limitations:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: circle;">
|
|
<li class="listitem">
|
|
<p>They can only process one document per
|
|
file.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The output MIME type must be known and
|
|
fixed.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The character encoding, if relevant, must be
|
|
known and fixed (or possibly just depending on
|
|
location).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Multiple <code class="literal">execm</code>
|
|
handlers can process multiple files (sparing the
|
|
process startup time which can be very significant),
|
|
or multiple documents per file (e.g.: for archives or
|
|
multi-chapter publications). They communicate with
|
|
the indexer through a simple protocol, but are
|
|
nevertheless a bit more complicated than the older
|
|
kind. Most of the new handlers are written in
|
|
<span class="application">Python</span> (exception:
|
|
<span class="command"><strong>rclimg</strong></span>
|
|
which is written in Perl because <code class=
|
|
"literal">exiftool</code> has no real Python
|
|
equivalent). The Python handlers use common modules
|
|
to factor out the boilerplate, which can make them
|
|
very simple in favorable cases. The subdocuments
|
|
output by these handlers can be directly indexable
|
|
(text or HTML), or they can be other simple or
|
|
compound documents that will need to be processed by
|
|
another handler.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>In both cases, handlers deal with regular file system
|
|
files, and can process either a single document, or a
|
|
linear list of documents in each file. <span class=
|
|
"application">Recoll</span> is responsible for performing
|
|
up to date checks, deal with more complex embedding and
|
|
other upper level issues.</p>
|
|
<p>A simple handler returning a document in <code class=
|
|
"literal">text/plain</code> format, can transfer no
|
|
metadata to the indexer. Generic metadata, like document
|
|
size or modification date, will be gathered and stored by
|
|
the indexer.</p>
|
|
<p>Handlers that produce <code class=
|
|
"literal">text/html</code> format can return an arbitrary
|
|
amount of metadata inside HTML <code class=
|
|
"literal">meta</code> tags. These will be processed
|
|
according to the directives found in the <a class="link"
|
|
href="#RCL.PROGRAM.FIELDS" title=
|
|
"4.2. Field data processing"><code class=
|
|
"filename">fields</code> configuration file</a>.</p>
|
|
<p>The handlers that can handle multiple documents per file
|
|
return a single piece of data to identify each document
|
|
inside the file. This piece of data, called an <code class=
|
|
"literal">ipath</code> will be sent back by <span class=
|
|
"application">Recoll</span> to extract the document at
|
|
query time, for previewing, or for creating a temporary
|
|
file to be opened by a viewer. These handlers can also
|
|
return metadata either as HTML <code class=
|
|
"literal">meta</code> tags, or as named data through the
|
|
communication protocol.</p>
|
|
<p>The following section describes the simple handlers, and
|
|
the next one gives a few explanations about the
|
|
<code class="literal">execm</code> ones. You could
|
|
conceivably write a simple handler with only the elements
|
|
in the manual. This will not be the case for the other
|
|
ones, for which you will have to look at the code.</p>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.SIMPLE" id=
|
|
"RCL.PROGRAM.FILTERS.SIMPLE"></a>4.1.1. Simple
|
|
input handlers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> simple
|
|
handlers are usually shell-scripts, but this is in no way
|
|
necessary. Extracting the text from the native format is
|
|
the difficult part. Outputting the format expected by
|
|
<span class="application">Recoll</span> is trivial.
|
|
Happily enough, most document formats have translators or
|
|
text extractors which can be called from the handler. In
|
|
some cases the output of the translating program is
|
|
completely appropriate, and no intermediate shell-script
|
|
is needed.</p>
|
|
<p>Input handlers are called with a single argument which
|
|
is the source file name. They should output the result to
|
|
stdout.</p>
|
|
<p>When writing a handler, you should decide if it will
|
|
output plain text or HTML. Plain text is simpler, but you
|
|
will not be able to add metadata or vary the output
|
|
character encoding (this will be defined in a
|
|
configuration file). Additionally, some formatting may be
|
|
easier to preserve when previewing HTML. Actually the
|
|
deciding factor is metadata: <span class=
|
|
"application">Recoll</span> has a way to <a class="link"
|
|
href="#RCL.PROGRAM.FILTERS.HTML" title=
|
|
"4.1.4. Input handler output">extract metadata from
|
|
the HTML header and use it for field searches.</a>.</p>
|
|
<p>The <code class=
|
|
"envar">RECOLL_FILTER_FORPREVIEW</code> environment
|
|
variable (values <code class="literal">yes</code>,
|
|
<code class="literal">no</code>) tells the handler if the
|
|
operation is for indexing or previewing. Some handlers
|
|
use this to output a slightly different format, for
|
|
example stripping uninteresting repeated keywords (ie:
|
|
<code class="literal">Subject:</code> for email) when
|
|
indexing. This is not essential.</p>
|
|
<p>You should look at one of the simple handlers, for
|
|
example <span class=
|
|
"command"><strong>rclps</strong></span> for a starting
|
|
point.</p>
|
|
<p>Don't forget to make your handler executable before
|
|
testing !</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.MULTIPLE" id=
|
|
"RCL.PROGRAM.FILTERS.MULTIPLE"></a>4.1.2. "Multiple"
|
|
handlers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>If you can program and want to write an <code class=
|
|
"literal">execm</code> handler, it should not be too
|
|
difficult to make sense of one of the existing
|
|
handlers.</p>
|
|
<p>The existing handlers differ in the amount of helper
|
|
code which they are using:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">rclimg</code> is written
|
|
in Perl and handles the execm protocol all by
|
|
itself (showing how trivial it is).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>All the Python handlers share at least the
|
|
<code class="filename">rclexecm.py</code> module,
|
|
which handles the communication. Have a look at,
|
|
for example, <code class="filename">rclzip</code>
|
|
for a handler which uses <code class=
|
|
"filename">rclexecm.py</code> directly.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Most Python handlers which process
|
|
single-document files by executing another command
|
|
are further abstracted by using the <code class=
|
|
"filename">rclexec1.py</code> module. See for
|
|
example <code class="filename">rclrtf.py</code> for
|
|
a simple one, or <code class=
|
|
"filename">rcldoc.py</code> for a slightly more
|
|
complicated one (possibly executing several
|
|
commands).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Handlers which extract text from an XML document
|
|
by using an XSLT style sheet are now executed
|
|
inside <span class=
|
|
"command"><strong>recollindex</strong></span>, with
|
|
only the style sheet stored in the <code class=
|
|
"filename">filters/</code> directory. These can use
|
|
a single style sheet (e.g. <code class=
|
|
"filename">abiword.xsl</code>), or two sheets for
|
|
the data and metadata (e.g. <code class=
|
|
"filename">opendoc-body.xsl</code> and <code class=
|
|
"filename">opendoc-meta.xsl</code>). The
|
|
<code class="filename">mimeconf</code>
|
|
configuration file defines how the sheets are used,
|
|
have a look. Before the C++ import, the xsl-based
|
|
handlers used a common module <code class=
|
|
"filename">rclgenxslt.py</code>, it is still around
|
|
but unused at the moment. The handler for OpenXML
|
|
presentations is still the Python version because
|
|
the format did not fit with what the C++ code does.
|
|
It would be a good base for another similar
|
|
issue.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>There is a sample trivial handler based on
|
|
<code class="filename">rclexecm.py</code>, with many
|
|
comments, not actually used by <span class=
|
|
"application">Recoll</span>. It would index a text file
|
|
as one document per line. Look for <code class=
|
|
"filename">rcltxtlines.py</code> in the <code class=
|
|
"filename">src/filters</code> directory in the online
|
|
<span class="application">Recoll</span> <a class="ulink"
|
|
href="https://framagit.org/medoc92/recoll" target=
|
|
"_top">Git repository</a> (the sample not in the
|
|
distributed release at the moment).</p>
|
|
<p>You can also have a look at the slightly more complex
|
|
<span class="command"><strong>rclzip</strong></span>
|
|
which uses Zip file paths as identifiers (<code class=
|
|
"literal">ipath</code>).</p>
|
|
<p><code class="literal">execm</code> handlers sometimes
|
|
need to make a choice for the nature of the <code class=
|
|
"literal">ipath</code> elements that they use in
|
|
communication with the indexer. Here are a few
|
|
guidelines:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Use ASCII or UTF-8 (if the identifier is an
|
|
integer print it, for example, like printf %d would
|
|
do).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>If at all possible, the data should make some
|
|
kind of sense when printed to a log file to help
|
|
with debugging.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class="application">Recoll</span> uses a
|
|
colon (<code class="literal">:</code>) as a
|
|
separator to store a complex path internally (for
|
|
deeper embedding). Colons inside the <code class=
|
|
"literal">ipath</code> elements output by a handler
|
|
will be escaped, but would be a bad choice as a
|
|
handler-specific separator (mostly, again, for
|
|
debugging issues).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>In any case, the main goal is that it should be easy
|
|
for the handler to extract the target document, given the
|
|
file name and the <code class="literal">ipath</code>
|
|
element.</p>
|
|
<p><code class="literal">execm</code> handlers will also
|
|
produce a document with a null <code class=
|
|
"literal">ipath</code> element. Depending on the type of
|
|
document, this may have some associated data (e.g. the
|
|
body of an email message), or none (typical for an
|
|
archive file). If it is empty, this document will be
|
|
useful anyway for some operations, as the parent of the
|
|
actual data documents.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.ASSOCIATION" id=
|
|
"RCL.PROGRAM.FILTERS.ASSOCIATION"></a>4.1.3. Telling
|
|
<span class="application">Recoll</span> about the
|
|
handler</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>There are two elements that link a file to the handler
|
|
which should process it: the association of file to MIME
|
|
type and the association of a MIME type with a
|
|
handler.</p>
|
|
<p>The association of files to MIME types is mostly based
|
|
on name suffixes. The types are defined inside the
|
|
<a class="link" href="#RCL.INSTALL.CONFIG.MIMEMAP" title=
|
|
"5.4.4. The mimemap file"><code class=
|
|
"filename">mimemap</code> file</a>. Example:</p>
|
|
<pre class="programlisting">
|
|
|
|
.doc = application/msword
|
|
</pre>
|
|
<p>If no suffix association is found for the file name,
|
|
<span class="application">Recoll</span> will try to
|
|
execute a system command (typically <span class=
|
|
"command"><strong>file -i</strong></span> or <span class=
|
|
"command"><strong>xdg-mime</strong></span>) to determine
|
|
a MIME type.</p>
|
|
<p>The second element is the association of MIME types to
|
|
handlers in the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.MIMECONF" title=
|
|
"5.4.5. The mimeconf file"><code class=
|
|
"filename">mimeconf</code> file</a>. A sample will
|
|
probably be better than a long explanation:</p>
|
|
<pre class="programlisting">
|
|
|
|
[index]
|
|
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
mimetype = text/plain ; charset=utf-8
|
|
|
|
application/ogg = exec rclogg
|
|
|
|
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
|
application/x-chm = execm rclchm
|
|
</pre>
|
|
<p>The fragment specifies that:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">application/msword</code>
|
|
files are processed by executing the <span class=
|
|
"command"><strong>antiword</strong></span> program,
|
|
which outputs <code class=
|
|
"literal">text/plain</code> encoded in <code class=
|
|
"literal">utf-8</code>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">application/ogg</code>
|
|
files are processed by the <span class=
|
|
"command"><strong>rclogg</strong></span> script,
|
|
with default output type (<code class=
|
|
"literal">text/html</code>, with encoding specified
|
|
in the header, or <code class=
|
|
"literal">utf-8</code> by default).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">text/rtf</code> is
|
|
processed by <span class=
|
|
"command"><strong>unrtf</strong></span>, which
|
|
outputs <code class="literal">text/html</code>. The
|
|
<code class="literal">iso-8859-1</code> encoding is
|
|
specified because it is not the <code class=
|
|
"literal">utf-8</code> default, and not output by
|
|
<span class="command"><strong>unrtf</strong></span>
|
|
in the HTML header section.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">application/x-chm</code>
|
|
is processed by a persistent handler. This is
|
|
determined by the <code class=
|
|
"literal">execm</code> keyword.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.HTML" id=
|
|
"RCL.PROGRAM.FILTERS.HTML"></a>4.1.4. Input
|
|
handler output</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Both the simple and persistent input handlers can
|
|
return any MIME type to Recoll, which will further
|
|
process the data according to the MIME configuration.</p>
|
|
<p>Most input filters filters produce either <code class=
|
|
"literal">text/plain</code> or <code class=
|
|
"literal">text/html</code> data. There are exceptions,
|
|
for example, filters which process archive file
|
|
(<code class="literal">zip</code>, <code class=
|
|
"literal">tar</code>, etc.) will usually return the
|
|
documents as they are found, without processing them
|
|
further.</p>
|
|
<p>There is nothing to say about <code class=
|
|
"literal">text/plain</code> output, except that its
|
|
character encoding should be consistent with what is
|
|
specified in the <code class="filename">mimeconf</code>
|
|
file.</p>
|
|
<p>For filters producing HTML, the output could be very
|
|
minimal like the following example:</p>
|
|
<pre class="programlisting">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
|
|
</head>
|
|
<body>
|
|
Some text content
|
|
</body>
|
|
</html>
|
|
</pre>
|
|
<p>You should take care to escape some characters inside
|
|
the text by transforming them into appropriate entities.
|
|
At the very minimum, "<code class="literal">&</code>"
|
|
should be transformed into "<code class=
|
|
"literal">&amp;</code>", "<code class=
|
|
"literal"><</code>" should be transformed into
|
|
"<code class="literal">&lt;</code>". This is not
|
|
always properly done by external helper programs which
|
|
output HTML, and of course never by those which output
|
|
plain text.</p>
|
|
<p>When encapsulating plain text in an HTML body, the
|
|
display of a preview may be improved by enclosing the
|
|
text inside <code class="literal"><pre></code>
|
|
tags.</p>
|
|
<p>The character set needs to be specified in the header.
|
|
It does not need to be UTF-8 (<span class=
|
|
"application">Recoll</span> will take care of translating
|
|
it), but it must be accurate for good results.</p>
|
|
<p><span class="application">Recoll</span> will process
|
|
<code class="literal">meta</code> tags inside the header
|
|
as possible document fields candidates. Documents fields
|
|
can be processed by the indexer in different ways, for
|
|
searching or displaying inside query results. This is
|
|
described in a <a class="link" href="#RCL.PROGRAM.FIELDS"
|
|
title="4.2. Field data processing">following
|
|
section.</a></p>
|
|
<p>By default, the indexer will process the standard
|
|
header fields if they are present: <code class=
|
|
"literal">title</code>, <code class=
|
|
"literal">meta/description</code>, and <code class=
|
|
"literal">meta/keywords</code> are both indexed and
|
|
stored for query-time display.</p>
|
|
<p>A predefined non-standard <code class=
|
|
"literal">meta</code> tag will also be processed by
|
|
<span class="application">Recoll</span> without further
|
|
configuration: if a <code class="literal">date</code> tag
|
|
is present and has the right format, it will be used as
|
|
the document date (for display and sorting), in
|
|
preference to the file modification date. The date format
|
|
should be as follows:</p>
|
|
<pre class="programlisting">
|
|
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
|
|
or
|
|
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
|
|
</pre>
|
|
<p>Example:</p>
|
|
<pre class="programlisting">
|
|
<meta name="date" content="2013-02-24 17:50:00">
|
|
</pre>
|
|
<p>Input handlers also have the possibility to "invent"
|
|
field names. This should also be output as meta tags:</p>
|
|
<pre class="programlisting">
|
|
<meta name="somefield" content="Some textual data" />
|
|
</pre>
|
|
<p>You can embed HTML markup inside the content of custom
|
|
fields, for improving the display inside result lists. In
|
|
this case, add a (wildly non-standard) <code class=
|
|
"literal">markup</code> attribute to tell <span class=
|
|
"application">Recoll</span> that the value is HTML and
|
|
should not be escaped for display.</p>
|
|
<pre class="programlisting">
|
|
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
|
|
</pre>
|
|
<p>As written above, the processing of fields is
|
|
described in a <a class="link" href="#RCL.PROGRAM.FIELDS"
|
|
title="4.2. Field data processing">further
|
|
section</a>.</p>
|
|
<p>Persistent filters can use another, probably simpler,
|
|
method to produce metadata, by calling the <code class=
|
|
"literal">setfield()</code> helper method. This avoids
|
|
the necessity to produce HTML, and any issue with HTML
|
|
quoting. See, for example, <code class=
|
|
"filename">rclaudio</code> in <span class=
|
|
"application">Recoll</span> 1.23 and later for an example
|
|
of handler which outputs <code class=
|
|
"literal">text/plain</code> and uses <code class=
|
|
"literal">setfield()</code> to produce metadata.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.FILTERS.PAGES" id=
|
|
"RCL.PROGRAM.FILTERS.PAGES"></a>4.1.5. Page
|
|
numbers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The indexer will interpret <code class=
|
|
"literal">^L</code> characters in the handler output as
|
|
indicating page breaks, and will record them. At query
|
|
time, this allows starting a viewer on the right page for
|
|
a hit or a snippet. Currently, only the PDF, Postscript
|
|
and DVI handlers generate page breaks.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.PROGRAM.FIELDS" id=
|
|
"RCL.PROGRAM.FIELDS"></a>4.2. Field data
|
|
processing</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><code class="literal">Fields</code> are named pieces of
|
|
information in or about documents, like <code class=
|
|
"literal">title</code>, <code class=
|
|
"literal">author</code>, <code class=
|
|
"literal">abstract</code>.</p>
|
|
<p>The field values for documents can appear in several
|
|
ways during indexing: either output by input handlers as
|
|
<code class="literal">meta</code> fields in the HTML header
|
|
section, or extracted from file extended attributes, or
|
|
added as attributes of the <code class="literal">Doc</code>
|
|
object when using the API, or again synthetized internally
|
|
by <span class="application">Recoll</span>.</p>
|
|
<p>The <span class="application">Recoll</span> query
|
|
language allows searching for text in a specific field.</p>
|
|
<p><span class="application">Recoll</span> defines a number
|
|
of default fields. Additional ones can be output by
|
|
handlers, and described in the <code class=
|
|
"filename">fields</code> configuration file.</p>
|
|
<p>Fields can be:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">indexed</code>, meaning that
|
|
their terms are separately stored in inverted lists
|
|
(with a specific prefix), and that a field-specific
|
|
search is possible.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">stored</code>, meaning that
|
|
their value is recorded in the index data record for
|
|
the document, and can be returned and displayed with
|
|
search results.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>A field can be either or both indexed and stored. This
|
|
and other aspects of fields handling is defined inside the
|
|
<code class="filename">fields</code> configuration
|
|
file.</p>
|
|
<p>Some fields may also designated as supporting range
|
|
queries, meaning that the results may be selected for an
|
|
interval of its values. See the <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file">configuration section</a> for
|
|
more details.</p>
|
|
<p>The sequence of events for field processing is as
|
|
follows:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>During indexing, <span class=
|
|
"command"><strong>recollindex</strong></span> scans
|
|
all <code class="literal">meta</code> fields in HTML
|
|
documents (most document types are transformed into
|
|
HTML at some point). It compares the name for each
|
|
element to the configuration defining what should be
|
|
done with fields (the <code class=
|
|
"filename">fields</code> file)</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>If the name for the <code class=
|
|
"literal">meta</code> element matches one for a field
|
|
that should be indexed, the contents are processed
|
|
and the terms are entered into the index with the
|
|
prefix defined in the <code class=
|
|
"filename">fields</code> file.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>If the name for the <code class=
|
|
"literal">meta</code> element matches one for a field
|
|
that should be stored, the content of the element is
|
|
stored with the document data record, from which it
|
|
can be extracted and displayed at query time.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>At query time, if a field search is performed, the
|
|
index prefix is computed and the match is only
|
|
performed against appropriately prefixed terms in the
|
|
index.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>At query time, the field can be displayed inside
|
|
the result list by using the appropriate directive in
|
|
the definition of the <a class="link" href=
|
|
"#RCL.SEARCH.GUI.CUSTOM.RESLIST" title=
|
|
"The result list format">result list paragraph
|
|
format</a>. All fields are displayed on the fields
|
|
screen of the preview window (which you can reach
|
|
through the right-click menu). This is independent of
|
|
the fact that the search which produced the results
|
|
used the field or not.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>You can find more information in the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file">section about the
|
|
<code class="filename">fields</code> file</a>, or in
|
|
comments inside the file.</p>
|
|
<p>You can also have a look at the <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/faqsandhowtos/HandleCustomField"
|
|
target="_top">example in the FAQs area</a>, detailing how
|
|
one could add a <span class="emphasis"><em>page
|
|
count</em></span> field to pdf documents for displaying
|
|
inside result lists.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI" id=
|
|
"RCL.PROGRAM.PYTHONAPI"></a>4.3. Python API</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.INTRO" id=
|
|
"RCL.PROGRAM.PYTHONAPI.INTRO"></a>4.3.1. Introduction</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <span class="application">Recoll</span> Python
|
|
programming interface can be used both for searching and
|
|
for creating/updating an index. Bindings exist for
|
|
Python2 and Python3 (Jan 2021: python2 support will be
|
|
dropped soon).</p>
|
|
<p>The search interface is used in a number of active
|
|
projects: the <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/recoll/pages/download.html#gssp"
|
|
target="_top"><span class="application">Recoll</span>
|
|
<span class="application">Gnome Shell Search
|
|
Provider</span></a> , the <a class="ulink" href=
|
|
"https://framagit.org/medoc92/recollwebui" target=
|
|
"_top"><span class="application">Recoll</span> Web
|
|
UI</a>, and the <a class="ulink" href=
|
|
"https://www.lesbonscomptes.com/upmpdcli/upmpdcli-manual.html#UPRCL"
|
|
target="_top">upmpdcli UPnP Media Server</a>, in addition
|
|
to many small scripts.</p>
|
|
<p>The index update section of the API may be used to
|
|
create and update <span class="application">Recoll</span>
|
|
indexes on specific configurations (separate from the
|
|
ones created by <span class=
|
|
"command"><strong>recollindex</strong></span>). The
|
|
resulting databases can be queried alone, or in
|
|
conjunction with regular ones, through the GUI or any of
|
|
the query interfaces.</p>
|
|
<p>The search API is modeled along the Python database
|
|
API version 2.0 specification (early versions used the
|
|
version 1.0 spec).</p>
|
|
<p>The <code class="literal">recoll</code> package
|
|
contains two modules:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>The <code class="literal">recoll</code> module
|
|
contains functions and classes used to query (or
|
|
update) the index.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The <code class="literal">rclextract</code>
|
|
module contains functions and classes used at query
|
|
time to access document data. The <code class=
|
|
"literal">recoll</code> module must be imported
|
|
before <code class="literal">rclextract</code></p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>There is a good chance that your system repository has
|
|
packages for the Recoll Python API, sometimes in a
|
|
package separate from the main one (maybe named something
|
|
like python-recoll). Else refer to the <a class="link"
|
|
href="#RCL.INSTALL.BUILDING" title=
|
|
"5.3. Building from source">Building from source
|
|
chapter</a>.</p>
|
|
<p>As an introduction, the following small sample will
|
|
run a query and list the title and url for each of the
|
|
results. The <code class="filename">python/samples</code>
|
|
source directory contains several examples of Python
|
|
programming with <span class="application">Recoll</span>,
|
|
exercising the extension more completely, and especially
|
|
its data extraction features.</p>
|
|
<pre class="programlisting">
|
|
#!/usr/bin/python3
|
|
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
query = db.query()
|
|
nres = query.execute("some query")
|
|
results = query.fetchmany(20)
|
|
for doc in results:
|
|
print("%s %s" % (doc.url, doc.title))
|
|
</pre>
|
|
<p>You can also take a look at the source for the
|
|
<a class="ulink" href=
|
|
"https://framagit.org/medoc92/recollwebui/-/blob/master/webui.py"
|
|
target="_top">Recoll WebUI</a>, the <a class="ulink"
|
|
href="https://framagit.org/medoc92/upmpdcli/-/blob/master/src/mediaserver/cdplugins/uprcl/uprclfolders.py"
|
|
target="_top">upmpdcli local media server</a>, or the
|
|
<a class="ulink" href=
|
|
"https://framagit.org/medoc92/recoll-gssp/-/blob/master/gssp-recoll.py"
|
|
target="_top">Gnome Shell Search Provider</a>.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.ELEMENTS" id=
|
|
"RCL.PROGRAM.PYTHONAPI.ELEMENTS"></a>4.3.2. Interface
|
|
elements</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>A few elements in the interface are specific and and
|
|
need an explanation.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH"
|
|
id="RCL.PROGRAM.PYTHONAPI.ELEMENTS.IPATH"></a><span class="term">ipath</span></dt>
|
|
<dd>
|
|
<p>This data value (set as a field in the Doc
|
|
object) is stored, along with the URL, but not
|
|
indexed by <span class="application">Recoll</span>.
|
|
Its contents are not interpreted by the index
|
|
layer, and its use is up to the application. For
|
|
example, the <span class=
|
|
"application">Recoll</span> file system indexer
|
|
uses the <code class="literal">ipath</code> to
|
|
store the part of the document access path internal
|
|
to (possibly imbricated) container documents.
|
|
<code class="literal">ipath</code> in this case is
|
|
a vector of access elements (e.g, the first part
|
|
could be a path inside a zip file to an archive
|
|
member which happens to be an mbox file, the second
|
|
element would be the message sequential number
|
|
inside the mbox etc.). <code class=
|
|
"literal">url</code> and <code class=
|
|
"literal">ipath</code> are returned in every search
|
|
result and define the access to the original
|
|
document. <code class="literal">ipath</code> is
|
|
empty for top-level document/files (e.g. a PDF
|
|
document which is a filesystem file). The
|
|
<span class="application">Recoll</span> GUI knows
|
|
about the structure of the <code class=
|
|
"literal">ipath</code> values used by the
|
|
filesystem indexer, and uses it for such functions
|
|
as opening the parent of a given document.</p>
|
|
</dd>
|
|
<dt><a name="RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI" id=
|
|
"RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI"></a><span class=
|
|
"term">udi</span></dt>
|
|
<dd>
|
|
<p>An <code class="literal">udi</code> (unique
|
|
document identifier) identifies a document. Because
|
|
of limitations inside the index engine, it is
|
|
restricted in length (to 200 bytes), which is why a
|
|
regular URI cannot be used. The structure and
|
|
contents of the <code class="literal">udi</code> is
|
|
defined by the application and opaque to the index
|
|
engine. For example, the internal file system
|
|
indexer uses the complete document path (file path
|
|
+ internal path), truncated to length, the
|
|
suppressed part being replaced by a hash value. The
|
|
<code class="literal">udi</code> is not explicit in
|
|
the query interface (it is used "under the hood" by
|
|
the <code class="filename">rclextract</code>
|
|
module), but it is an explicit element of the
|
|
update interface.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI" id=
|
|
"RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI"></a><span class="term">parent_udi</span></dt>
|
|
<dd>
|
|
<p>If this attribute is set on a document when
|
|
entering it in the index, it designates its
|
|
physical container document. In a multilevel
|
|
hierarchy, this may not be the immediate parent.
|
|
<code class="literal">parent_udi</code> is
|
|
optional, but its use by an indexer may simplify
|
|
index maintenance, as <span class=
|
|
"application">Recoll</span> will automatically
|
|
delete all children defined by <code class=
|
|
"literal">parent_udi == udi</code> when the
|
|
document designated by <code class=
|
|
"literal">udi</code> is destroyed. e.g. if a
|
|
<code class="literal">Zip</code> archive contains
|
|
entries which are themselves containers, like
|
|
<code class="literal">mbox</code> files, all the
|
|
subdocuments inside the <code class=
|
|
"literal">Zip</code> file (mbox, messages, message
|
|
attachments, etc.) would have the same <code class=
|
|
"literal">parent_udi</code>, matching the
|
|
<code class="literal">udi</code> for the
|
|
<code class="literal">Zip</code> file, and all
|
|
would be destroyed when the <code class=
|
|
"literal">Zip</code> file (identified by its
|
|
<code class="literal">udi</code>) is removed from
|
|
the index. The standard filesystem indexer uses
|
|
<code class="literal">parent_udi</code>.</p>
|
|
</dd>
|
|
<dt><span class="term">Stored and indexed
|
|
fields</span></dt>
|
|
<dd>
|
|
<p>The <a class="link" href=
|
|
"#RCL.INSTALL.CONFIG.FIELDS" title=
|
|
"5.4.3. The fields file"><code class=
|
|
"filename">fields</code> file</a> inside the
|
|
<span class="application">Recoll</span>
|
|
configuration defines which document fields are
|
|
either <code class="literal">indexed</code>
|
|
(searchable), <code class="literal">stored</code>
|
|
(retrievable with search results), or both. Apart
|
|
from a few standard/internal fields, only the
|
|
<code class="literal">stored</code> fields are
|
|
retrievable through the Python search
|
|
interface.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.LOG" id=
|
|
"RCL.PROGRAM.PYTHONAPI.LOG"></a>4.3.3. Log
|
|
messages for Python scripts</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Two specific configuration variables: <code class=
|
|
"literal">pyloglevel</code> and <code class=
|
|
"literal">pylogfilename</code> allow overriding the
|
|
generic values for Python programs. Set <code class=
|
|
"literal">pyloglevel</code> to 2 to suppress default
|
|
startup messages (printed at level 3).</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.SEARCH" id=
|
|
"RCL.PROGRAM.PYTHONAPI.SEARCH"></a>4.3.4. Python
|
|
search interface</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL" id=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL"></a>The recoll
|
|
module</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CONNECT" id=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CONNECT"></a>connect(confdir=None,
|
|
extra_dbs=None, writable = False)</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The <code class="literal">connect()</code>
|
|
function connects to one or several <span class=
|
|
"application">Recoll</span> index(es) and returns a
|
|
<code class="literal">Db</code> object.</p>
|
|
<p>This call initializes the recoll module, and it
|
|
should always be performed before any other call or
|
|
object creation.</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">confdir</code> may
|
|
specify a configuration directory. The usual
|
|
defaults apply.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">extra_dbs</code> is a
|
|
list of additional indexes (Xapian
|
|
directories).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">writable</code>
|
|
decides if we can index new data through this
|
|
connection.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.DB" id=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.DB"></a>The Db
|
|
class</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>A Db object is created by a <code class=
|
|
"literal">connect()</code> call and holds a
|
|
connection to a Recoll index.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Db.close()</span></dt>
|
|
<dd>
|
|
<p>Closes the connection. You can't do anything
|
|
with the <code class="literal">Db</code> object
|
|
after this.</p>
|
|
</dd>
|
|
<dt><span class="term">Db.query(),
|
|
Db.cursor()</span></dt>
|
|
<dd>
|
|
<p>These aliases return a blank <code class=
|
|
"literal">Query</code> object for this
|
|
index.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Db.setAbstractParams(maxchars,
|
|
contextwords)</span></dt>
|
|
<dd>
|
|
<p>Set the parameters used to build snippets
|
|
(sets of keywords in context text fragments).
|
|
<code class="literal">maxchars</code> defines
|
|
the maximum total size of the abstract.
|
|
<code class="literal">contextwords</code>
|
|
defines how many terms are shown around the
|
|
keyword.</p>
|
|
</dd>
|
|
<dt><span class="term">Db.termMatch(match_type,
|
|
expr, field='', maxlen=-1, casesens=False,
|
|
diacsens=False, lang='english')</span></dt>
|
|
<dd>
|
|
<p>Expand an expression against the index term
|
|
list. Performs the basic function from the GUI
|
|
term explorer tool. <code class=
|
|
"literal">match_type</code> can be either of
|
|
<code class="literal">wildcard</code>,
|
|
<code class="literal">regexp</code> or
|
|
<code class="literal">stem</code>. Returns a
|
|
list of terms expanded from the input
|
|
expression.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY"
|
|
id="RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.QUERY">
|
|
</a>The Query class</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>A <code class="literal">Query</code> object
|
|
(equivalent to a cursor in the Python DB API) is
|
|
created by a <code class="literal">Db.query()</code>
|
|
call. It is used to execute index searches.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Query.sortby(fieldname,
|
|
ascending=True)</span></dt>
|
|
<dd>
|
|
<p>Sort results by <em class=
|
|
"replaceable"><code>fieldname</code></em>, in
|
|
ascending or descending order. Must be called
|
|
before executing the search.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.execute(query_string, stemming=1,
|
|
stemlang="english", fetchtext=False,
|
|
collapseduplicates=False)</span></dt>
|
|
<dd>
|
|
<p>Starts a search for <em class=
|
|
"replaceable"><code>query_string</code></em>, a
|
|
<span class="application">Recoll</span> search
|
|
language string. If the index stores the
|
|
document texts and <code class=
|
|
"literal">fetchtext</code> is True, store the
|
|
document extracted text in <code class=
|
|
"literal">doc.text</code>.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.executesd(SearchData,
|
|
fetchtext=False,
|
|
collapseduplicates=False)</span></dt>
|
|
<dd>
|
|
<p>Starts a search for the query defined by the
|
|
SearchData object. If the index stores the
|
|
document texts and <code class=
|
|
"literal">fetchtext</code> is True, store the
|
|
document extracted text in <code class=
|
|
"literal">doc.text</code>.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.fetchmany(size=query.arraysize)</span></dt>
|
|
<dd>
|
|
<p>Fetches the next <code class=
|
|
"literal">Doc</code> objects in the current
|
|
search results, and returns them as an array of
|
|
the required size, which is by default the
|
|
value of the <code class=
|
|
"literal">arraysize</code> data member.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.fetchone()</span></dt>
|
|
<dd>
|
|
<p>Fetches the next <code class=
|
|
"literal">Doc</code> object from the current
|
|
search results. Generates a StopIteration
|
|
exception if there are no results left.</p>
|
|
</dd>
|
|
<dt><span class="term">Query.close()</span></dt>
|
|
<dd>
|
|
<p>Closes the query. The object is unusable
|
|
after the call.</p>
|
|
</dd>
|
|
<dt><span class="term">Query.scroll(value,
|
|
mode='relative')</span></dt>
|
|
<dd>
|
|
<p>Adjusts the position in the current result
|
|
set. <code class="literal">mode</code> can be
|
|
<code class="literal">relative</code> or
|
|
<code class="literal">absolute</code>.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.getgroups()</span></dt>
|
|
<dd>
|
|
<p>Retrieves the expanded query terms as a list
|
|
of pairs. Meaningful only after executexx In
|
|
each pair, the first entry is a list of user
|
|
terms (of size one for simple terms, or more
|
|
for group and phrase clauses), the second a
|
|
list of query terms as derived from the user
|
|
terms and used in the Xapian Query.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.getxquery()</span></dt>
|
|
<dd>
|
|
<p>Return the Xapian query description as a
|
|
Unicode string. Meaningful only after
|
|
executexx.</p>
|
|
</dd>
|
|
<dt><span class="term">Query.highlight(text,
|
|
ishtml = 0, methods = object)</span></dt>
|
|
<dd>
|
|
<p>Will insert <span "class=rclmatch">,
|
|
</span> tags around the match areas in
|
|
the input text and return the modified text.
|
|
<code class="literal">ishtml</code> can be set
|
|
to indicate that the input text is HTML and
|
|
that HTML special characters should not be
|
|
escaped. <code class="literal">methods</code>
|
|
if set should be an object with methods
|
|
startMatch(i) and endMatch() which will be
|
|
called for each match and should return a begin
|
|
and end tag</p>
|
|
</dd>
|
|
<dt><span class="term">Query.makedocabstract(doc,
|
|
methods = object))</span></dt>
|
|
<dd>
|
|
<p>Create a snippets abstract for <code class=
|
|
"literal">doc</code> (a <code class=
|
|
"literal">Doc</code> object) by selecting text
|
|
around the match terms. If methods is set, will
|
|
also perform highlighting. See the highlight
|
|
method.</p>
|
|
</dd>
|
|
<dt><span class="term">Query.getsnippets(doc,
|
|
maxoccs = -1, ctxwords = -1, sortbypage=False,
|
|
methods = object)</span></dt>
|
|
<dd>
|
|
<p>Will return a list of extracts from the
|
|
result document by selecting text around the
|
|
match terms. Each entry in the result list is a
|
|
triple: page number, term, text. By default,
|
|
the most relevants snippets appear first in the
|
|
list. Set <code class=
|
|
"literal">sortbypage</code> to sort by page
|
|
number instead. If <code class=
|
|
"literal">methods</code> is set, the fragments
|
|
will be highlighted (see the highlight method).
|
|
If <code class="literal">maxoccs</code> is set,
|
|
it defines the maximum result list length.
|
|
<code class="literal">ctxwords</code> allows
|
|
adjusting the individual snippet context
|
|
size.</p>
|
|
</dd>
|
|
<dt><span class="term">Query.__iter__() and
|
|
Query.next()</span></dt>
|
|
<dd>
|
|
<p>So that things like <code class=
|
|
"literal">for doc in query:</code> will
|
|
work.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class=
|
|
"term">Query.arraysize</span></dt>
|
|
<dd>
|
|
<p>Default number of records processed by
|
|
fetchmany (r/w).</p>
|
|
</dd>
|
|
<dt><span class="term">Query.rowcount</span></dt>
|
|
<dd>
|
|
<p>Number of records returned by the last
|
|
execute.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Query.rownumber</span></dt>
|
|
<dd>
|
|
<p>Next index to be fetched from results.
|
|
Normally increments after each fetchone() call,
|
|
but can be set/reset before the call to effect
|
|
seeking (equivalent to using <code class=
|
|
"literal">scroll()</code>). Starts at 0.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC" id=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC"></a>The
|
|
Doc class</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>A <code class="literal">Doc</code> object contains
|
|
index data for a given document. The data is
|
|
extracted from the index when searching, or set by
|
|
the indexer program when updating. The Doc object has
|
|
many attributes to be read or set by its user. It
|
|
mostly matches the Rcl::Doc C++ object. Some of the
|
|
attributes are predefined, but, especially when
|
|
indexing, others can be set, the name of which will
|
|
be processed as field names by the indexing
|
|
configuration. Inputs can be specified as Unicode or
|
|
strings. Outputs are Unicode objects. All dates are
|
|
specified as Unix timestamps, printed as strings.
|
|
Please refer to the <code class=
|
|
"filename">rcldb/rcldoc.cpp</code> C++ file for a
|
|
full description of the predefined attributes. Here
|
|
follows a short list.</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="literal">url</code> the
|
|
document URL but see also <code class=
|
|
"literal">getbinurl()</code></p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">ipath</code> the
|
|
document <code class="literal">ipath</code> for
|
|
embedded documents.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">fbytes, dbytes</code>
|
|
the document file and text sizes.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">fmtime, dmtime</code>
|
|
the document file and document times.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">xdocid</code> the
|
|
document Xapian document ID. This is useful if
|
|
you want to access the document through a
|
|
direct Xapian operation.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="literal">mtype</code> the
|
|
document MIME type.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Fields stored by default: <code class=
|
|
"literal">author</code>, <code class=
|
|
"literal">filename</code>, <code class=
|
|
"literal">keywords</code>, <code class=
|
|
"literal">recipient</code></p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>At query time, only the fields that are defined as
|
|
<code class="literal">stored</code> either by default
|
|
or in the <code class="filename">fields</code>
|
|
configuration file will be meaningful in the
|
|
<code class="literal">Doc</code> object. The document
|
|
processed text may be present or not, depending if
|
|
the index stores the text at all, and if it does, on
|
|
the <code class="literal">fetchtext</code> query
|
|
execute option. See also the <code class=
|
|
"literal">rclextract</code> module for accessing
|
|
document contents.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">get(key), []
|
|
operator</span></dt>
|
|
<dd>
|
|
<p>Retrieve the named document attribute. You
|
|
can also use <code class="literal">getattr(doc,
|
|
key)</code> or <code class=
|
|
"literal">doc.key</code>.</p>
|
|
</dd>
|
|
<dt><span class="term">doc.key =
|
|
value</span></dt>
|
|
<dd>
|
|
<p>Set the the named document attribute. You
|
|
can also use <code class="literal">setattr(doc,
|
|
key, value)</code>.</p>
|
|
</dd>
|
|
<dt><span class="term">getbinurl()</span></dt>
|
|
<dd>
|
|
<p>Retrieve the URL in byte array format (no
|
|
transcoding), for use as parameter to a system
|
|
call.</p>
|
|
</dd>
|
|
<dt><span class="term">setbinurl(url)</span></dt>
|
|
<dd>
|
|
<p>Set the URL in byte array format (no
|
|
transcoding).</p>
|
|
</dd>
|
|
<dt><span class="term">items()</span></dt>
|
|
<dd>
|
|
<p>Return a dictionary of doc object
|
|
keys/values</p>
|
|
</dd>
|
|
<dt><span class="term">keys()</span></dt>
|
|
<dd>
|
|
<p>list of doc object keys (attribute
|
|
names).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA"
|
|
id=
|
|
"RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.SEARCHDATA">
|
|
</a>The SearchData class</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>A <code class="literal">SearchData</code> object
|
|
allows building a query by combining clauses, for
|
|
execution by <code class=
|
|
"literal">Query.executesd()</code>. It can be used in
|
|
replacement of the query language approach. The
|
|
interface is going to change a little, so no detailed
|
|
doc for now...</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class=
|
|
"term">addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
qstring=string, slack=0, field='', stemming=1,
|
|
subSearch=SearchData)</span></dt>
|
|
<dd></dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT" id=
|
|
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT"></a>The
|
|
rclextract module</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Prior to <span class="application">Recoll</span>
|
|
1.25, index queries could not provide document content
|
|
because it was never stored. <span class=
|
|
"application">Recoll</span> 1.25 and later usually
|
|
store the document text, which can be optionally
|
|
retrieved when running a query (see <code class=
|
|
"literal">query.execute()</code> above - the result is
|
|
always plain text).</p>
|
|
<p>The <code class="literal">rclextract</code> module
|
|
can give access to the original document and to the
|
|
document text content (if not stored by the index, or
|
|
to access an HTML version of the text). Accessing the
|
|
original document is particularly useful if it is
|
|
embedded (e.g. an email attachment).</p>
|
|
<p>You need to import the <code class=
|
|
"literal">recoll</code> module before the <code class=
|
|
"literal">rclextract</code> module.</p>
|
|
<div class="simplesect">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h5 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"
|
|
id=
|
|
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
|
|
</a>The Extractor class</h5>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">Extractor(doc)</span></dt>
|
|
<dd>
|
|
<p>An <code class="literal">Extractor</code>
|
|
object is built from a <code class=
|
|
"literal">Doc</code> object, output from a
|
|
query.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Extractor.textextract(ipath)</span></dt>
|
|
<dd>
|
|
<p>Extract document defined by <em class=
|
|
"replaceable"><code>ipath</code></em> and
|
|
return a <code class="literal">Doc</code>
|
|
object. The <code class=
|
|
"literal">doc.text</code> field has the
|
|
document text converted to either text/plain or
|
|
text/html according to <code class=
|
|
"literal">doc.mimetype</code>. The typical use
|
|
would be as follows:</p>
|
|
<pre class="programlisting">
|
|
from recoll import recoll, rclextract
|
|
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
doc = extractor.textextract(qdoc.ipath)
|
|
# use doc.text, e.g. for previewing</pre>
|
|
<p>Passing <code class=
|
|
"literal">qdoc.ipath</code> to <code class=
|
|
"literal">textextract()</code> is redundant,
|
|
but reflects the fact that the <code class=
|
|
"literal">Extractor</code> object actually has
|
|
the capability to access the other entries in a
|
|
compound document.</p>
|
|
</dd>
|
|
<dt><span class=
|
|
"term">Extractor.idoctofile(ipath, targetmtype,
|
|
outfile='')</span></dt>
|
|
<dd>
|
|
<p>Extracts document into an output file, which
|
|
can be given explicitly or will be created as a
|
|
temporary file to be deleted by the caller.
|
|
Typical use:</p>
|
|
<pre class="programlisting">
|
|
from recoll import recoll, rclextract
|
|
|
|
qdoc = query.fetchone()
|
|
extractor = recoll.Extractor(qdoc)
|
|
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
|
|
<p>In all cases the output is a copy, even if
|
|
the requested document is a regular system
|
|
file, which may be wasteful in some cases. If
|
|
you want to avoid this, you can test for a
|
|
simple file document as follows:</p>
|
|
<pre class="programlisting">
|
|
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
|
|
</pre>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE" id=
|
|
"RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE"></a>Search
|
|
API usage example</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The following sample would query the index with a
|
|
user language string. See the <code class=
|
|
"filename">python/samples</code> directory inside the
|
|
<span class="application">Recoll</span> source for
|
|
other examples. The <code class=
|
|
"filename">recollgui</code> subdirectory has a very
|
|
embryonic GUI which demonstrates the highlighting and
|
|
data extraction functions.</p>
|
|
<pre class="programlisting">
|
|
#!/usr/bin/python3
|
|
|
|
from recoll import recoll
|
|
|
|
db = recoll.connect()
|
|
db.setAbstractParams(maxchars=80, contextwords=4)
|
|
|
|
query = db.query()
|
|
nres = query.execute("some user question")
|
|
print("Result count: %d" % nres)
|
|
if nres > 5:
|
|
nres = 5
|
|
for i in range(nres):
|
|
doc = query.fetchone()
|
|
print("Result #%d" % (query.rownumber))
|
|
for k in ("title", "size"):
|
|
print("%s : %s" % (k, getattr(doc, k)))
|
|
print("%s\n" % db.makeDocAbstract(doc, query))
|
|
</pre>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE" id=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE"></a>4.3.5. Creating
|
|
Python external indexers</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The update API can be used to create an index from
|
|
data which is not accessible to the regular <span class=
|
|
"application">Recoll</span> indexer, or structured to
|
|
present difficulties to the <span class=
|
|
"application">Recoll</span> input handlers.</p>
|
|
<p>An indexer created using this API will be have
|
|
equivalent work to do as the the Recoll file system
|
|
indexer: look for modified documents, extract their text,
|
|
call the API for indexing it, take care of purging the
|
|
index out of data from documents which do not exist in
|
|
the document store any more.</p>
|
|
<p>The data for such an external indexer should be stored
|
|
in an index separate from any used by the <span class=
|
|
"application">Recoll</span> internal file system indexer.
|
|
The reason is that the main document indexer purge pass
|
|
(removal of deleted documents) would also remove all the
|
|
documents belonging to the external indexer, as they were
|
|
not seen during the filesystem walk. The main indexer
|
|
documents would also probably be a problem for the
|
|
external indexer own purge operation.</p>
|
|
<p>While there would be ways to enable multiple foreign
|
|
indexers to cooperate on a single index, it is just
|
|
simpler to use separate ones, and use the multiple index
|
|
access capabilities of the query interface, if
|
|
needed.</p>
|
|
<p>There are two parts in the update interface:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Methods inside the <code class=
|
|
"filename">recoll</code> module allow inserting
|
|
data into the index, to make it accessible by the
|
|
normal query interface.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>An interface based on scripts execution is
|
|
defined to allow either the GUI or the <code class=
|
|
"filename">rclextract</code> module to access
|
|
original document data for previewing or
|
|
editing.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE" id=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE.UPDATE"></a>Python
|
|
update interface</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The update methods are part of the <code class=
|
|
"filename">recoll</code> module described above. The
|
|
connect() method is used with a <code class=
|
|
"literal">writable=true</code> parameter to obtain a
|
|
writable <code class="literal">Db</code> object. The
|
|
following <code class="literal">Db</code> object
|
|
methods are then available.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">addOrUpdate(udi, doc,
|
|
parent_udi=None)</span></dt>
|
|
<dd>
|
|
<p>Add or update index data for a given document
|
|
The <code class="literal"><a class="link" href=
|
|
"#RCL.PROGRAM.PYTHONAPI.ELEMENTS.UDI">udi</a></code>
|
|
string must define a unique id for the document.
|
|
It is an opaque interface element and not
|
|
interpreted inside Recoll. <code class=
|
|
"literal">doc</code> is a <code class=
|
|
"literal"><a class="link" href=
|
|
"#RCL.PROGRAM.PYTHONAPI.RECOLL.CLASSES.DOC"
|
|
title="The Doc class">Doc</a></code> object,
|
|
created from the data to be indexed (the main
|
|
text should be in <code class=
|
|
"literal">doc.text</code>). If <code class=
|
|
"literal"><a class="link" href=
|
|
"#RCL.PROGRAM.PYTHONAPI.ELEMENTS.PARENTUDI">parent_udi</a></code>
|
|
is set, this is a unique identifier for the
|
|
top-level container (e.g. for the filesystem
|
|
indexer, this would be the one which is an actual
|
|
file).</p>
|
|
</dd>
|
|
<dt><span class="term">delete(udi)</span></dt>
|
|
<dd>
|
|
<p>Purge index from all data for <code class=
|
|
"literal">udi</code>, and all documents (if any)
|
|
which have a matrching <code class=
|
|
"literal">parent_udi</code>.</p>
|
|
</dd>
|
|
<dt><span class="term">needUpdate(udi,
|
|
sig)</span></dt>
|
|
<dd>
|
|
<p>Test if the index needs to be updated for the
|
|
document identified by <code class=
|
|
"literal">udi</code>. If this call is to be used,
|
|
the <code class="literal">doc.sig</code> field
|
|
should contain a signature value when calling
|
|
<code class="literal">addOrUpdate()</code>. The
|
|
<code class="literal">needUpdate()</code> call
|
|
then compares its parameter value with the stored
|
|
<code class="literal">sig</code> for <code class=
|
|
"literal">udi</code>. <code class=
|
|
"literal">sig</code> is an opaque value, compared
|
|
as a string.</p>
|
|
<p>The filesystem indexer uses a concatenation of
|
|
the decimal string values for file size and
|
|
update time, but a hash of the contents could
|
|
also be used.</p>
|
|
<p>As a side effect, if the return value is false
|
|
(the index is up to date), the call will set the
|
|
existence flag for the document (and any
|
|
subdocument defined by its <code class=
|
|
"literal">parent_udi</code>), so that a later
|
|
<code class="literal">purge()</code> call will
|
|
preserve them).</p>
|
|
<p>The use of <code class=
|
|
"literal">needUpdate()</code> and <code class=
|
|
"literal">purge()</code> is optional, and the
|
|
indexer may use another method for checking the
|
|
need to reindex or to delete stale entries.</p>
|
|
</dd>
|
|
<dt><span class="term">purge()</span></dt>
|
|
<dd>
|
|
<p>Delete all documents that were not touched
|
|
during the just finished indexing pass (since
|
|
open-for-write). These are the documents for the
|
|
needUpdate() call was not performed, indicating
|
|
that they no longer exist in the primary storage
|
|
system.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE.ACCESS" id=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE.ACCESS"></a>Query
|
|
data access for external indexers (1.23)</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> has internal
|
|
methods to access document data for its internal
|
|
(filesystem) indexer. An external indexer needs to
|
|
provide data access methods if it needs integration
|
|
with the GUI (e.g. preview function), or support for
|
|
the <code class="filename">rclextract</code>
|
|
module.</p>
|
|
<p>The index data and the access method are linked by
|
|
the <code class="literal">rclbes</code> (recoll backend
|
|
storage) <code class="literal">Doc</code> field. You
|
|
should set this to a short string value identifying
|
|
your indexer (e.g. the filesystem indexer uses either
|
|
"FS" or an empty value, the Web history indexer uses
|
|
"BGL").</p>
|
|
<p>The link is actually performed inside a <code class=
|
|
"filename">backends</code> configuration file (stored
|
|
in the configuration directory). This defines commands
|
|
to execute to access data from the specified indexer.
|
|
Example, for the mbox indexing sample found in the
|
|
Recoll source (which sets <code class=
|
|
"literal">rclbes="MBOX"</code>):</p>
|
|
<pre class="programlisting">[MBOX]
|
|
fetch = /path/to/recoll/src/python/samples/rclmbox.py fetch
|
|
makesig = path/to/recoll/src/python/samples/rclmbox.py makesig
|
|
</pre>
|
|
<p><code class="literal">fetch</code> and <code class=
|
|
"literal">makesig</code> define two commands to execute
|
|
to respectively retrieve the document text and compute
|
|
the document signature (the example implementation uses
|
|
the same script with different first parameters to
|
|
perform both operations).</p>
|
|
<p>The scripts are called with three additional
|
|
arguments: <code class="literal">udi</code>,
|
|
<code class="literal">url</code>, <code class=
|
|
"literal">ipath</code>, stored with the document when
|
|
it was indexed, and may use any or all to perform the
|
|
requested operation. The caller expects the result data
|
|
on <code class="literal">stdout</code>.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE.SAMPLES" id=
|
|
"RCL.PROGRAM.PYTHONAPI.UPDATE.SAMPLES"></a>External
|
|
indexer samples</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The Recoll source tree has two samples of external
|
|
indexers in the <code class=
|
|
"filename">src/python/samples</code> directory. The
|
|
more interesting one is <code class=
|
|
"filename">rclmbox.py</code> which indexes a directory
|
|
containing <code class="literal">mbox</code> folder
|
|
files. It exercises most features in the update
|
|
interface, and has a data access interface.</p>
|
|
<p>See the comments inside the file for more
|
|
information.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.PROGRAM.PYTHONAPI.COMPAT" id=
|
|
"RCL.PROGRAM.PYTHONAPI.COMPAT"></a>4.3.6. Package
|
|
compatibility with the previous version</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The following code fragments can be used to ensure
|
|
that code can run with both the old and the new API (as
|
|
long as it does not use the new abilities of the new API
|
|
of course).</p>
|
|
<p>Adapting to the new package structure:</p>
|
|
<pre class="programlisting">
|
|
try:
|
|
from recoll import recoll
|
|
from recoll import rclextract
|
|
hasextract = True
|
|
except:
|
|
import recoll
|
|
hasextract = False
|
|
</pre>
|
|
<p>Adapting to the change of nature of the <code class=
|
|
"literal">next</code> <code class="literal">Query</code>
|
|
member. The same test can be used to choose to use the
|
|
<code class="literal">scroll()</code> method (new) or set
|
|
the <code class="literal">next</code> value (old).</p>
|
|
<pre class=
|
|
"programlisting">rownum = query.next if type(query.next) == int else query.rownumber</pre>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="chapter">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h1 class="title"><a name="RCL.INSTALL" id=
|
|
"RCL.INSTALL"></a>Chapter 5. Installation and
|
|
configuration</h1>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.BINARY" id=
|
|
"RCL.INSTALL.BINARY"></a>5.1. Installing a
|
|
binary copy</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> binary copies
|
|
are always distributed as regular packages for your system.
|
|
They can be obtained either through the system's normal
|
|
software distribution framework (e.g. <span class=
|
|
"application">Debian/Ubuntu apt</span>, <span class=
|
|
"application">FreeBSD</span> ports, etc.), or from some
|
|
type of "backports" repository providing versions newer
|
|
than the standard ones, or found on the <span class=
|
|
"application">Recoll</span> Web site in some cases. The
|
|
most up-to-date information about Recoll packages can
|
|
usually be found on the <a class="ulink" href=
|
|
"http://www.recoll.org/pages/download.html" target=
|
|
"_top"><span class="application">Recoll</span> Web site
|
|
downloads page</a></p>
|
|
<p>The <span class="application">Windows</span> version of
|
|
Recoll comes in a self-contained setup file, there is
|
|
nothing else to install.</p>
|
|
<p>On <span class="application">Unix</span>-like systems,
|
|
the package management tools will automatically install
|
|
hard dependencies for packages obtained from a proper
|
|
package repository. You will have to deal with them by hand
|
|
for downloaded packages (for example, when <span class=
|
|
"command"><strong>dpkg</strong></span> complains about
|
|
missing dependencies).</p>
|
|
<p>In all cases, you will have to check or install
|
|
<a class="link" href="#RCL.INSTALL.EXTERNAL" title=
|
|
"5.2. Supporting packages">supporting applications</a>
|
|
for the file types that you want to index beyond those that
|
|
are natively processed by <span class=
|
|
"application">Recoll</span> (text, HTML, email files, and a
|
|
few others).</p>
|
|
<p>You should also maybe have a look at the <a class="link"
|
|
href="#RCL.INSTALL.CONFIG" title=
|
|
"5.4. Configuration overview">configuration
|
|
section</a> (but this may not be necessary for a quick test
|
|
with default parameters). Most parameters can be more
|
|
conveniently set from the GUI interface.</p>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.EXTERNAL" id=
|
|
"RCL.INSTALL.EXTERNAL"></a>5.2. Supporting
|
|
packages</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="note" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Note</h3>
|
|
<p>The <span class="application">Windows</span>
|
|
installation of <span class="application">Recoll</span>
|
|
is self-contained. <span class=
|
|
"application">Windows</span> users can skip this
|
|
section.</p>
|
|
</div>
|
|
<p><span class="application">Recoll</span> uses external
|
|
applications to index some file types. You need to install
|
|
them for the file types that you wish to have indexed
|
|
(these are run-time optional dependencies. None is needed
|
|
for building or running <span class=
|
|
"application">Recoll</span> except for indexing their
|
|
specific file type).</p>
|
|
<p>After an indexing pass, the commands that were found
|
|
missing can be displayed from the <span class=
|
|
"command"><strong>recoll</strong></span> <span class=
|
|
"guilabel">File</span> menu. The list is stored in the
|
|
<code class="filename">missing</code> text file inside the
|
|
configuration directory.</p>
|
|
<p>The past has proven that I was unable to maintain an up
|
|
to date application list in this manual. Please check
|
|
<a class="ulink" href=
|
|
"http://www.recoll.org/pages/features.html#doctypes"
|
|
target="_top">http://www.recoll.org/pages/features.html</a>
|
|
for a complete list along with links to the home pages or
|
|
best source/patches pages, and misc tips. What follows is
|
|
only a very short extract of the stable essentials.</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>PDF files need <span class=
|
|
"command"><strong>pdftotext</strong></span> which is
|
|
part of <span class="application">Poppler</span>
|
|
(usually comes with the <code class=
|
|
"literal">poppler-utils</code> package). Avoid the
|
|
original one from <span class=
|
|
"application">Xpdf</span>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>MS Word documents need <span class=
|
|
"command"><strong>antiword</strong></span>. It is
|
|
also useful to have <span class=
|
|
"command"><strong>wvWare</strong></span> installed as
|
|
it may be be used as a fallback for some files which
|
|
<span class=
|
|
"command"><strong>antiword</strong></span> does not
|
|
handle.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>RTF files need <span class=
|
|
"command"><strong>unrtf</strong></span>, which, in
|
|
its older versions, has much trouble with non-western
|
|
character sets. Many Linux distributions carry
|
|
outdated <span class=
|
|
"command"><strong>unrtf</strong></span> versions.
|
|
Check <a class="ulink" href=
|
|
"http://www.recoll.org/pages/features.html#doctypes"
|
|
target=
|
|
"_top">http://www.recoll.org/pages/features.html</a>
|
|
for details.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Pictures: <span class="application">Recoll</span>
|
|
uses the <span class="application">Exiftool</span>
|
|
<span class="application">Perl</span> package to
|
|
extract tag information. Most image file formats are
|
|
supported.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Up to <span class="application">Recoll</span>
|
|
1.24, many XML-based formats need the <span class=
|
|
"command"><strong>xsltproc</strong></span> command,
|
|
which usually comes with <span class=
|
|
"application">libxslt</span>. These are: abiword, fb2
|
|
ebooks, kword, openoffice, opendocument svg.
|
|
<span class="application">Recoll</span> 1.25 and
|
|
later process them internally (using libxslt).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.BUILDING" id=
|
|
"RCL.INSTALL.BUILDING"></a>5.3. Building from
|
|
source</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.PREREQS" id=
|
|
"RCL.INSTALL.BUILDING.PREREQS"></a>5.3.1. Prerequisites</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The following prerequisites are described in broad
|
|
terms and not as specific package names (which will
|
|
depend on the exact platform). The dependencies should be
|
|
available as packages on most common Unix derivatives,
|
|
and it should be quite uncommon that you would have to
|
|
build one of them.</p>
|
|
<p>If you do not need the GUI, you can avoid all GUI
|
|
dependencies by disabling its build. (See the configure
|
|
section further).</p>
|
|
<p>The shopping list:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>If you start from git code, you will need the
|
|
<span class=
|
|
"command"><strong>autoconf</strong></span>,
|
|
<span class=
|
|
"command"><strong>automake</strong></span> and
|
|
<span class=
|
|
"command"><strong>libtool</strong></span> triad.
|
|
They are not needed for building from tar
|
|
distributions.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>C++ compiler. Recent versions require C++11
|
|
compatibility (1.23 and later).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><span class=
|
|
"command"><strong>bison</strong></span> command
|
|
(for <span class="application">Recoll</span> 1.21
|
|
and later).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>For building the documentation: the <span class=
|
|
"command"><strong>xsltproc</strong></span> command,
|
|
and the Docbook XML and style sheet files. You can
|
|
avoid this dependency by disabling documentation
|
|
building with the <code class=
|
|
"literal">--disable-userdoc</code> <span class=
|
|
"command"><strong>configure</strong></span>
|
|
option.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Development files for <a class="ulink" href=
|
|
"http://www.xapian.org" target="_top"><span class=
|
|
"application">Xapian core</span></a>.</p>
|
|
<div class="important" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Important</h3>
|
|
<p>If you are building Xapian for an older CPU
|
|
(before Pentium 4 or Athlon 64), you need to add
|
|
the <code class="option">--disable-sse</code>
|
|
flag to the configure command. Else all Xapian
|
|
application will crash with an <code class=
|
|
"literal">illegal instruction</code> error.</p>
|
|
</div>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Development files for <a class="ulink" href=
|
|
"http://qt-project.org/downloads" target=
|
|
"_top"><span class="application">Qt 5</span></a> .
|
|
and its own dependencies (X11 etc.)</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Development files for libxslt</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Development files for <span class=
|
|
"application">zlib</span>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Development files for <span class=
|
|
"application">Python</span> (or use <code class=
|
|
"literal">--disable-python-module</code>).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Development files for libchm</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>You may also need <a class="ulink" href=
|
|
"http://www.gnu.org/software/libiconv/" target=
|
|
"_top">libiconv</a>. On <span class=
|
|
"application">Linux</span> systems, the iconv
|
|
interface is part of libc and you should not need
|
|
to do anything special.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Check the <a class="ulink" href=
|
|
"http://www.recoll.org/pages/download.html" target=
|
|
"_top"><span class="application">Recoll</span> download
|
|
page</a> for up to date version information.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.BUILDING" id=
|
|
"RCL.INSTALL.BUILDING.BUILDING"></a>5.3.2. Building</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> has been built
|
|
on Linux, FreeBSD, Mac OS X, and Solaris, most versions
|
|
after 2005 should be ok, maybe some older ones too
|
|
(Solaris 8 used to be ok). If you build on another
|
|
system, and need to modify things, <a class="ulink" href=
|
|
"mailto:jfd@recoll.org" target="_top">I would very much
|
|
welcome patches</a>.</p>
|
|
<p><b>Configure options: </b></p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><code class="option">--without-aspell</code>
|
|
will disable the code for phonetic matching of
|
|
search terms.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--with-fam</code> or
|
|
<code class="option">--with-inotify</code> will
|
|
enable the code for real time indexing. Inotify
|
|
support is enabled by default on Linux systems.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--with-qzeitgeist</code>
|
|
will enable sending <span class=
|
|
"application">Zeitgeist</span> events about the
|
|
visited search results, and needs the <span class=
|
|
"application">qzeitgeist</span> package.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-webkit</code> is
|
|
available from version 1.17 to implement the result
|
|
list with a <span class="application">Qt</span>
|
|
QTextBrowser instead of a WebKit widget if you do
|
|
not or can't depend on the latter.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-qtgui</code>
|
|
Disable the Qt interface. Will allow building the
|
|
indexer and the command line search program in
|
|
absence of a Qt environment.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--enable-webengine</code>
|
|
Enable the use of Qt Webengine (only meaningful if
|
|
the Qt GUI is enabled), in place or Qt Webkit.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-idxthreads</code>
|
|
is available from version 1.19 to suppress
|
|
multithreading inside the indexing process. You can
|
|
also use the run-time configuration to restrict
|
|
<span class=
|
|
"command"><strong>recollindex</strong></span> to
|
|
using a single thread, but the compile-time option
|
|
may disable a few more unused locks. This only
|
|
applies to the use of multithreading for the core
|
|
index processing (data input). The <span class=
|
|
"application">Recoll</span> monitor mode always
|
|
uses at least two threads of execution.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class=
|
|
"option">--disable-python-module</code> will avoid
|
|
building the <span class=
|
|
"application">Python</span> module.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-python-chm</code>
|
|
will avoid building the Python libchm interface
|
|
used to index CHM files.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--enable-camelcase</code>
|
|
will enable splitting <em class=
|
|
"replaceable"><code>camelCase</code></em> words.
|
|
This is not enabled by default as it has the
|
|
unfortunate side-effect of making some phrase
|
|
searches quite confusing: ie, <code class=
|
|
"literal">"MySQL manual"</code> would be matched by
|
|
<code class="literal">"MySQL manual"</code> and
|
|
<code class="literal">"my sql manual"</code> but
|
|
not <code class="literal">"mysql manual"</code>
|
|
(only inside phrase searches).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--with-file-command</code>
|
|
Specify the version of the 'file' command to use
|
|
(ie: --with-file-command=/usr/local/bin/file). Can
|
|
be useful to enable the gnu version on systems
|
|
where the native one is bad.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-x11mon</code>
|
|
Disable <span class="application">X11</span>
|
|
connection monitoring inside recollindex. Together
|
|
with --disable-qtgui, this allows building recoll
|
|
without <span class="application">Qt</span> and
|
|
<span class="application">X11</span>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-userdoc</code>
|
|
will avoid building the user manual. This avoids
|
|
having to install the Docbook XML/XSL files and the
|
|
TeX toolchain used for translating the manual to
|
|
PDF.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--enable-recollq</code>
|
|
Enable building the <span class=
|
|
"command"><strong>recollq</strong></span> command
|
|
line query tool (recoll -t without need for Qt).
|
|
This is done by default if --disable-qtgui is set
|
|
but this option enables forcing it.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><code class="option">--disable-pic</code>
|
|
(<span class="application">Recoll</span> versions
|
|
up to 1.21 only) will compile <span class=
|
|
"application">Recoll</span> with position-dependant
|
|
code. This is incompatible with building the KIO or
|
|
the <span class="application">Python</span> or
|
|
<span class="application">PHP</span> extensions,
|
|
but might yield very marginally faster code.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Of course the usual <span class=
|
|
"application">autoconf</span> <span class=
|
|
"command"><strong>configure</strong></span>
|
|
options, like <code class="option">--prefix</code>
|
|
apply.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Normal procedure (for source extracted from a tar
|
|
distribution):</p>
|
|
<pre class="screen">
|
|
<strong class=
|
|
"userinput"><code>cd recoll-xxx</code></strong>
|
|
<strong class=
|
|
"userinput"><code>./configure</code></strong>
|
|
<strong class="userinput"><code>make</code></strong>
|
|
<strong class=
|
|
"userinput"><code>(practices usual hardship-repelling invocations)</code></strong>
|
|
</pre>
|
|
<p>When building from source cloned from the git
|
|
repository, you also need to install <span class=
|
|
"application">autoconf</span>, <span class=
|
|
"application">automake</span>, and <span class=
|
|
"application">libtool</span> and you must execute
|
|
<code class="literal">sh autogen.sh</code> in the top
|
|
source directory before running <code class=
|
|
"literal">configure</code>.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.INSTALL" id=
|
|
"RCL.INSTALL.BUILDING.INSTALL"></a>5.3.3. Installing</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Use <strong class="userinput"><code>make
|
|
install</code></strong> in the root of the source tree.
|
|
This will copy the commands to <code class=
|
|
"filename"><em class=
|
|
"replaceable"><code>prefix</code></em>/bin</code> and the
|
|
sample configuration files, scripts and other shared data
|
|
to <code class="filename"><em class=
|
|
"replaceable"><code>prefix</code></em>/share/recoll</code>.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.PYTHON" id=
|
|
"RCL.INSTALL.BUILDING.PYTHON"></a>5.3.4. Python
|
|
API package</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The Python interface can be found in the source tree,
|
|
under the <code class="filename">python/recoll</code>
|
|
directory.</p>
|
|
<p>As of <span class="application">Recoll</span> 1.19,
|
|
the module can be compiled for Python3.</p>
|
|
<p>The normal <span class="application">Recoll</span>
|
|
build procedure (see above) installs the API package for
|
|
the default system version (python) along with the main
|
|
code. The package for other Python versions (e.g. python3
|
|
if the system default is python2) must be explicitly
|
|
built and installed.</p>
|
|
<p>The <code class="filename">python/recoll/</code>
|
|
directory contains the usual <code class=
|
|
"filename">setup.py</code>. After configuring and
|
|
building the main <span class="application">Recoll</span>
|
|
code, you can use the script to build and install the
|
|
Python module:</p>
|
|
<pre class="screen">
|
|
<strong class=
|
|
"userinput"><code>cd recoll-xxx/python/recoll</code></strong>
|
|
<strong class=
|
|
"userinput"><code>pythonX setup.py build</code></strong>
|
|
<strong class=
|
|
"userinput"><code>sudo pythonX setup.py install</code></strong>
|
|
</pre>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.BUILDING.SOLARIS" id=
|
|
"RCL.INSTALL.BUILDING.SOLARIS"></a>5.3.5. Building
|
|
on Solaris</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>We did not test building the GUI on Solaris for recent
|
|
versions. You will need at least Qt 4.4. There are some
|
|
hints on <a class="ulink" href=
|
|
"http://www.recoll.org/download-1.14.html" target=
|
|
"_top">an old web site page</a>, they may still be
|
|
valid.</p>
|
|
<p>Someone did test the 1.19 indexer and Python module
|
|
build, they do work, with a few minor glitches. Be sure
|
|
to use GNU <span class=
|
|
"command"><strong>make</strong></span> and <span class=
|
|
"command"><strong>install</strong></span>.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect1">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h2 class="title" style="clear: both"><a name=
|
|
"RCL.INSTALL.CONFIG" id=
|
|
"RCL.INSTALL.CONFIG"></a>5.4. Configuration
|
|
overview</h2>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Most of the parameters specific to the <span class=
|
|
"command"><strong>recoll</strong></span> GUI are set
|
|
through the <span class="guilabel">Preferences</span> menu
|
|
and stored in the standard Qt place (<code class=
|
|
"filename">$HOME/.config/Recoll.org/recoll.conf</code>).
|
|
You probably do not want to edit this by hand.</p>
|
|
<p><span class="application">Recoll</span> indexing options
|
|
are set inside text configuration files located in a
|
|
configuration directory. There can be several such
|
|
directories, each of which defines the parameters for one
|
|
index.</p>
|
|
<p>The configuration files can be edited by hand or through
|
|
the <span class="guilabel">Index configuration</span>
|
|
dialog (<span class="guilabel">Preferences</span> menu).
|
|
The GUI tool will try to respect your formatting and
|
|
comments as much as possible, so it is quite possible to
|
|
use both approaches on the same configuration.</p>
|
|
<p>The most accurate documentation for the configuration
|
|
parameters is given by comments inside the default files,
|
|
and we will just give a general overview here.</p>
|
|
<p>For each index, there are at least two sets of
|
|
configuration files. System-wide configuration files are
|
|
kept in a directory named like <code class=
|
|
"filename">/usr/share/recoll/examples</code>, and define
|
|
default values, shared by all indexes. For each index, a
|
|
parallel set of files defines the customized
|
|
parameters.</p>
|
|
<p>The default location of the customized configuration is
|
|
the <code class="filename">.recoll</code> directory in your
|
|
home. Most people will only use this directory.</p>
|
|
<p>This location can be changed, or others can be added
|
|
with the <code class="envar">RECOLL_CONFDIR</code>
|
|
environment variable or the <code class="option">-c</code>
|
|
option parameter to <span class=
|
|
"command"><strong>recoll</strong></span> and <span class=
|
|
"command"><strong>recollindex</strong></span>.</p>
|
|
<p>In addition (as of <span class=
|
|
"application">Recoll</span> version 1.19.7), it is possible
|
|
to specify two additional configuration directories which
|
|
will be stacked before and after the user configuration
|
|
directory. These are defined by the <code class=
|
|
"envar">RECOLL_CONFTOP</code> and <code class=
|
|
"envar">RECOLL_CONFMID</code> environment variables. Values
|
|
from configuration files inside the top directory will
|
|
override user ones, values from configuration files inside
|
|
the middle directory will override system ones and be
|
|
overridden by user ones. These two variables may be of use
|
|
to applications which augment <span class=
|
|
"application">Recoll</span> functionality, and need to add
|
|
configuration data without disturbing the user's files.
|
|
Please note that the two, currently single, values will
|
|
probably be interpreted as colon-separated lists in the
|
|
future: do not use colon characters inside the directory
|
|
paths.</p>
|
|
<p>If the <code class="filename">.recoll</code> directory
|
|
does not exist when <span class=
|
|
"command"><strong>recoll</strong></span> or <span class=
|
|
"command"><strong>recollindex</strong></span> are started,
|
|
it will be created with a set of empty configuration files.
|
|
<span class="command"><strong>recoll</strong></span> will
|
|
give you a chance to edit the configuration file before
|
|
starting indexing. <span class=
|
|
"command"><strong>recollindex</strong></span> will proceed
|
|
immediately. To avoid mistakes, the automatic directory
|
|
creation will only occur for the default location, not if
|
|
<code class="option">-c</code> or <code class=
|
|
"envar">RECOLL_CONFDIR</code> were used (in the latter
|
|
cases, you will have to create the directory).</p>
|
|
<p>All configuration files share the same format. For
|
|
example, a short extract of the main configuration file
|
|
might look as follows:</p>
|
|
<pre class="programlisting">
|
|
# Space-separated list of files and directories to index.
|
|
topdirs = ~/docs /usr/share/doc
|
|
|
|
[~/somedirectory-with-utf8-txt-files]
|
|
defaultcharset = utf-8
|
|
</pre>
|
|
<p>There are three kinds of lines:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Comment (starts with <span class=
|
|
"emphasis"><em>#</em></span>) or empty.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Parameter affectation (<span class=
|
|
"emphasis"><em>name = value</em></span>).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Section definition ([<span class=
|
|
"emphasis"><em>somedirname</em></span>]).</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>Long lines can be broken by ending each incomplete part
|
|
with a backslash (<code class="literal">\</code>).</p>
|
|
<p>Depending on the type of configuration file, section
|
|
definitions either separate groups of parameters or allow
|
|
redefining some parameters for a directory sub-tree. They
|
|
stay in effect until another section definition, or the end
|
|
of file, is encountered. Some of the parameters used for
|
|
indexing are looked up hierarchically from the current
|
|
directory location upwards. Not all parameters can be
|
|
meaningfully redefined, this is specified for each in the
|
|
next section.</p>
|
|
<div class="important" style=
|
|
"margin-left: 0.5in; margin-right: 0.5in;">
|
|
<h3 class="title">Important</h3>
|
|
<p>Global parameters <span class="emphasis"><em>must
|
|
not</em></span> be defined in a directory subsection,
|
|
else they will not be found at all by the <span class=
|
|
"application">Recoll</span> code, which looks for them at
|
|
the top level (e.g. <code class=
|
|
"literal">skippedPaths</code>).</p>
|
|
</div>
|
|
<p>When found at the beginning of a file path, the tilde
|
|
character (~) is expanded to the name of the user's home
|
|
directory, as a shell would do.</p>
|
|
<p>Some parameters are lists of strings. White space is
|
|
used for separation. List elements with embedded spaces can
|
|
be quoted using double-quotes. Double quotes inside these
|
|
elements can be escaped with a backslash.</p>
|
|
<p>No value inside a configuration file can contain a
|
|
newline character. Long lines can be continued by escaping
|
|
the physical newline with backslash, even inside quoted
|
|
strings.</p>
|
|
<pre class="programlisting">
|
|
astringlist = "some string \
|
|
with spaces"
|
|
thesame = "some string with spaces"
|
|
</pre>
|
|
<p>Parameters which are not part of string lists can't be
|
|
quoted, and leading and trailing space characters are
|
|
stripped before the value is used.</p>
|
|
<p><b>Encoding issues. </b>Most of the configuration
|
|
parameters are plain ASCII. Two particular sets of values
|
|
may cause encoding issues:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style="list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>File path parameters may contain non-ascii
|
|
characters and should use the exact same byte values
|
|
as found in the file system directory. Usually, this
|
|
means that the configuration file should use the
|
|
system default locale encoding.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>The <code class="envar">unac_except_trans</code>
|
|
parameter should be encoded in UTF-8. If your system
|
|
locale is not UTF-8, and you need to also specify
|
|
non-ascii file paths, this poses a difficulty because
|
|
common text editors cannot handle multiple encodings
|
|
in a single file. In this relatively unlikely case,
|
|
you can edit the configuration file as two separate
|
|
text files with appropriate encodings, and
|
|
concatenate them to create the complete
|
|
configuration.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.ENVIR" id=
|
|
"RCL.INSTALL.CONFIG.ENVIR"></a>5.4.1. Environment
|
|
variables</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_CONFDIR</code></span></dt>
|
|
<dd>
|
|
<p>Defines the main configuration directory.</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_TMPDIR, TMPDIR</code></span></dt>
|
|
<dd>
|
|
<p>Locations for temporary files, in this order of
|
|
priority. The default if none of these is set is to
|
|
use <code class="filename">/tmp</code>. Big
|
|
temporary files may be created during indexing,
|
|
mostly for decompressing, and also for processing,
|
|
e.g. email attachments.</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_CONFTOP,
|
|
RECOLL_CONFMID</code></span></dt>
|
|
<dd>
|
|
<p>Allow adding configuration directories with
|
|
priorities below and above the user directory (see
|
|
above the Configuration overview section for
|
|
details).</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_EXTRA_DBS,
|
|
RECOLL_ACTIVE_EXTRA_DBS</code></span></dt>
|
|
<dd>
|
|
<p>Help for setting up external indexes. See
|
|
<a class="link" href="#RCL.SEARCH.GUI.MULTIDB"
|
|
title="3.2.10. Multiple indexes">this
|
|
paragraph</a> for explanations.</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_DATADIR</code></span></dt>
|
|
<dd>
|
|
<p>Defines replacement for the default location of
|
|
Recoll data files, normally found in, e.g.,
|
|
<code class=
|
|
"filename">/usr/share/recoll</code>).</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">RECOLL_FILTERSDIR</code></span></dt>
|
|
<dd>
|
|
<p>Defines replacement for the default location of
|
|
Recoll filters, normally found in, e.g.,
|
|
<code class=
|
|
"filename">/usr/share/recoll/filters</code>).</p>
|
|
</dd>
|
|
<dt><span class="term"><code class=
|
|
"varname">ASPELL_PROG</code></span></dt>
|
|
<dd>
|
|
<p><span class=
|
|
"command"><strong>aspell</strong></span> program to
|
|
use for creating the spelling dictionary. The
|
|
result has to be compatible with the <code class=
|
|
"filename">libaspell</code> which <span class=
|
|
"application">Recoll</span> is using.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF"></a>5.4.2. Recoll
|
|
main configuration file, recoll.conf</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WHATDOCS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WHATDOCS"></a>Parameters
|
|
affecting what documents we index</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name="RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TOPDIRS"></a><span class="term"><code class="varname">topdirs</code></span></dt>
|
|
<dd>
|
|
<p>Space-separated list of files or directories
|
|
to recursively index. Default to ~ (indexes
|
|
$HOME). You can use symbolic links in the list,
|
|
they will be followed, independently of the value
|
|
of the followLinks variable.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONITORDIRS"></a><span class="term"><code class="varname">monitordirs</code></span></dt>
|
|
<dd>
|
|
<p>Space-separated list of files or directories
|
|
to monitor for updates. When running the
|
|
real-time indexer, this allows monitoring only a
|
|
subset of the whole indexed area. The elements
|
|
must be included in the tree defined by the
|
|
'topdirs' members.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES"></a><span class="term"><code class="varname">skippedNames</code></span></dt>
|
|
<dd>
|
|
<p>Files and directories which should be ignored.
|
|
White space separated list of wildcard patterns
|
|
(simple ones, not paths, must contain no / ),
|
|
which will be tested against file and directory
|
|
names. The list in the default configuration does
|
|
not exclude hidden directories (names beginning
|
|
with a dot), which means that it may index quite
|
|
a few things that you do not want. On the other
|
|
hand, email user agents like Thunderbird usually
|
|
store messages in hidden directories, and you
|
|
probably want this indexed. One possible solution
|
|
is to have ".*" in "skippedNames", and add things
|
|
like "~/.thunderbird" "~/.evolution" to
|
|
"topdirs". Not even the file names are indexed
|
|
for patterns in this list, see the
|
|
"noContentSuffixes" variable for an alternative
|
|
approach which indexes the file names. Can be
|
|
redefined for any subtree.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES-" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES-"></a><span class="term"><code class="varname">skippedNames-</code></span></dt>
|
|
<dd>
|
|
<p>List of name endings to remove from the
|
|
default skippedNames list.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES+" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDNAMES+"></a><span class="term"><code class="varname">skippedNames+</code></span></dt>
|
|
<dd>
|
|
<p>List of name endings to add to the default
|
|
skippedNames list.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ONLYNAMES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ONLYNAMES"></a><span class="term"><code class="varname">onlyNames</code></span></dt>
|
|
<dd>
|
|
<p>Regular file name filter patterns If this is
|
|
set, only the file names not in skippedNames and
|
|
matching one of the patterns will be considered
|
|
for indexing. Can be redefined per subtree. Does
|
|
not apply to directories.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES"></a><span class="term"><code class="varname">noContentSuffixes</code></span></dt>
|
|
<dd>
|
|
<p>List of name endings (not necessarily
|
|
dot-separated suffixes) for which we don't try
|
|
MIME type identification, and don't uncompress or
|
|
index content. Only the names will be indexed.
|
|
This complements the now obsoleted recoll_noindex
|
|
list from the mimemap file, which will go away in
|
|
a future release (the move from mimemap to
|
|
recoll.conf allows editing the list through the
|
|
GUI). This is different from skippedNames because
|
|
these are name ending matches only (not wildcard
|
|
patterns), and the file name itself gets indexed
|
|
normally. This can be redefined for
|
|
subdirectories.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES-"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES-"></a><span class="term"><code class="varname">noContentSuffixes-</code></span></dt>
|
|
<dd>
|
|
<p>List of name endings to remove from the
|
|
default noContentSuffixes list.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES+"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCONTENTSUFFIXES+"></a><span class="term"><code class="varname">noContentSuffixes+</code></span></dt>
|
|
<dd>
|
|
<p>List of name endings to add to the default
|
|
noContentSuffixes list.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHS"></a><span class="term"><code class="varname">skippedPaths</code></span></dt>
|
|
<dd>
|
|
<p>Absolute paths we should not go into.
|
|
Space-separated list of wildcard expressions for
|
|
absolute filesystem paths. Must be defined at the
|
|
top level of the configuration file, not in a
|
|
subsection. Can contain files and directories.
|
|
The database and configuration directories will
|
|
automatically be added. The expressions are
|
|
matched using 'fnmatch(3)' with the FNM_PATHNAME
|
|
flag set by default. This means that '/'
|
|
characters must be matched explicitly. You can
|
|
set 'skippedPathsFnmPathname' to 0 to disable the
|
|
use of FNM_PATHNAME (meaning that '/*/dir3' will
|
|
match '/dir1/dir2/dir3'). The default value
|
|
contains the usual mount point for removable
|
|
media to remind you that it is a bad idea to have
|
|
Recoll work on these (esp. with the monitor:
|
|
media gets indexed on mount, all data gets erased
|
|
on unmount). Explicitly adding '/media/xxx' to
|
|
the 'topdirs' variable will override this.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHSFNMPATHNAME"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SKIPPEDPATHSFNMPATHNAME">
|
|
</a><span class="term"><code class=
|
|
"varname">skippedPathsFnmPathname</code></span></dt>
|
|
<dd>
|
|
<p>Set to 0 to override use of FNM_PATHNAME for
|
|
matching skipped paths.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOWALKFN" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOWALKFN"></a><span class="term"><code class="varname">nowalkfn</code></span></dt>
|
|
<dd>
|
|
<p>File name which will cause its parent
|
|
directory to be skipped. Any directory containing
|
|
a file with this name will be skipped as if it
|
|
was part of the skippedPaths list. Ex:
|
|
.recoll-noindex</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DAEMSKIPPEDPATHS"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.DAEMSKIPPEDPATHS">
|
|
</a><span class="term"><code class=
|
|
"varname">daemSkippedPaths</code></span></dt>
|
|
<dd>
|
|
<p>skippedPaths equivalent specific to real time
|
|
indexing. This enables having parts of the tree
|
|
which are initially indexed but not monitored. If
|
|
daemSkippedPaths is not set, the daemon uses
|
|
skippedPaths.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ZIPUSESKIPPEDNAMES"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ZIPUSESKIPPEDNAMES"></a><span class="term"><code class="varname">zipUseSkippedNames</code></span></dt>
|
|
<dd>
|
|
<p>Use skippedNames inside Zip archives. Fetched
|
|
directly by the rclzip handler. Skip the patterns
|
|
defined by skippedNames inside Zip archives. Can
|
|
be redefined for subdirectories. See
|
|
https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ZIPSKIPPEDNAMES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ZIPSKIPPEDNAMES"></a><span class="term"><code class="varname">zipSkippedNames</code></span></dt>
|
|
<dd>
|
|
<p>Space-separated list of wildcard expressions
|
|
for names that should be ignored inside zip
|
|
archives. This is used directly by the zip
|
|
handler. If zipUseSkippedNames is not set,
|
|
zipSkippedNames defines the patterns to be
|
|
skipped inside archives. If zipUseSkippedNames is
|
|
set, the two lists are concatenated and used. Can
|
|
be redefined for subdirectories. See
|
|
https://www.lesbonscomptes.com/recoll/faqsandhowtos/FilteringOutZipArchiveMembers.html</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FOLLOWLINKS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FOLLOWLINKS"></a><span class="term"><code class="varname">followLinks</code></span></dt>
|
|
<dd>
|
|
<p>Follow symbolic links during indexing. The
|
|
default is to ignore symbolic links to avoid
|
|
multiple indexing of linked files. No effort is
|
|
made to avoid duplication when this option is set
|
|
to true. This option can be set individually for
|
|
each of the 'topdirs' members by using sections.
|
|
It can not be changed below the 'topdirs' level.
|
|
Links in the 'topdirs' list itself are always
|
|
followed.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXEDMIMETYPES"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.INDEXEDMIMETYPES">
|
|
</a><span class="term"><code class=
|
|
"varname">indexedmimetypes</code></span></dt>
|
|
<dd>
|
|
<p>Restrictive list of indexed mime types.
|
|
Normally not set (in which case all supported
|
|
types are indexed). If it is set, only the types
|
|
from the list will have their contents indexed.
|
|
The names will be indexed anyway if
|
|
indexallfilenames is set (default). MIME type
|
|
names should be taken from the mimemap file (the
|
|
values may be different from xdg-mime or file -i
|
|
output in some cases). Can be redefined for
|
|
subtrees.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.EXCLUDEDMIMETYPES"></a><span class="term"><code class="varname">excludedmimetypes</code></span></dt>
|
|
<dd>
|
|
<p>List of excluded MIME types. Lets you exclude
|
|
some types from indexing. MIME type names should
|
|
be taken from the mimemap file (the values may be
|
|
different from xdg-mime or file -i output in some
|
|
cases) Can be redefined for subtrees.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOMD5TYPES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOMD5TYPES"></a><span class="term"><code class="varname">nomd5types</code></span></dt>
|
|
<dd>
|
|
<p>Don't compute md5 for these types. md5
|
|
checksums are used only for deduplicating
|
|
results, and can be very expensive to compute on
|
|
multimedia or other big files. This list lets you
|
|
turn off md5 computation for selected types. It
|
|
is global (no redefinition for subtrees). At the
|
|
moment, it only has an effect for external
|
|
handlers (exec and execm). The file types can be
|
|
specified by listing either MIME types (e.g.
|
|
audio/mpeg) or handler names (e.g. rclaudio).</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.COMPRESSEDFILEMAXKBS"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.COMPRESSEDFILEMAXKBS">
|
|
</a><span class="term"><code class=
|
|
"varname">compressedfilemaxkbs</code></span></dt>
|
|
<dd>
|
|
<p>Size limit for compressed files. We need to
|
|
decompress these in a temporary directory for
|
|
identification, which can be wasteful in some
|
|
cases. Limit the waste. Negative means no limit.
|
|
0 results in no processing of any compressed
|
|
file. Default 50 MB.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TEXTFILEMAXMBS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TEXTFILEMAXMBS"></a><span class="term"><code class="varname">textfilemaxmbs</code></span></dt>
|
|
<dd>
|
|
<p>Size limit for text files. Mostly for skipping
|
|
monster logs. Default 20 MB.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXALLFILENAMES"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXALLFILENAMES"></a><span class="term"><code class="varname">indexallfilenames</code></span></dt>
|
|
<dd>
|
|
<p>Index the file names of unprocessed files
|
|
Index the names of files the contents of which we
|
|
don't index because of an excluded or unsupported
|
|
MIME type.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.USESYSTEMFILECOMMAND"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.USESYSTEMFILECOMMAND">
|
|
</a><span class="term"><code class=
|
|
"varname">usesystemfilecommand</code></span></dt>
|
|
<dd>
|
|
<p>Use a system command for file MIME type
|
|
guessing as a final step in file type
|
|
identification This is generally useful, but will
|
|
usually cause the indexing of many bogus 'text'
|
|
files. See 'systemfilecommand' for the command
|
|
used.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SYSTEMFILECOMMAND"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SYSTEMFILECOMMAND"></a><span class="term"><code class="varname">systemfilecommand</code></span></dt>
|
|
<dd>
|
|
<p>Command used to guess MIME types if the
|
|
internal methods fails This should be a "file -i"
|
|
workalike. The file path will be added as a last
|
|
parameter to the command line. "xdg-mime" works
|
|
better than the traditional "file" command, and
|
|
is now the configured default (with a hard-coded
|
|
fallback to "file")</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PROCESSWEBQUEUE" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PROCESSWEBQUEUE"></a><span class="term"><code class="varname">processwebqueue</code></span></dt>
|
|
<dd>
|
|
<p>Decide if we process the Web queue. The queue
|
|
is a directory where the Recoll Web browser
|
|
plugins create the copies of visited pages.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TEXTFILEPAGEKBS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TEXTFILEPAGEKBS"></a><span class="term"><code class="varname">textfilepagekbs</code></span></dt>
|
|
<dd>
|
|
<p>Page size for text files. If this is set,
|
|
text/plain files will be divided into documents
|
|
of approximately this size. Will reduce memory
|
|
usage at index time and help with loading data in
|
|
the preview window at query time. Particularly
|
|
useful with very big files, such as application
|
|
or system logs. Also see textfilemaxmbs and
|
|
compressedfilemaxkbs.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MEMBERMAXKBS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MEMBERMAXKBS"></a><span class="term"><code class="varname">membermaxkbs</code></span></dt>
|
|
<dd>
|
|
<p>Size limit for archive members. This is passed
|
|
to the filters in the environment as
|
|
RECOLL_FILTER_MAXMEMBERKB.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TERMS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TERMS"></a>Parameters
|
|
affecting how we generate terms and organize the
|
|
index</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXSTRIPCHARS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXSTRIPCHARS"></a><span class="term"><code class="varname">indexStripChars</code></span></dt>
|
|
<dd>
|
|
<p>Decide if we store character case and
|
|
diacritics in the index. If we do, searches
|
|
sensitive to case and diacritics can be
|
|
performed, but the index will be bigger, and some
|
|
marginal weirdness may sometimes occur. The
|
|
default is a stripped index. When using multiple
|
|
indexes for a search, this parameter must be
|
|
defined identically for all. Changing the value
|
|
implies an index reset.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXSTOREDOCTEXT"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXSTOREDOCTEXT"></a><span class="term"><code class="varname">indexStoreDocText</code></span></dt>
|
|
<dd>
|
|
<p>Decide if we store the documents' text content
|
|
in the index. Storing the text allows extracting
|
|
snippets from it at query time, instead of
|
|
building them from index position data. Newer
|
|
Xapian index formats have rendered our use of
|
|
positions list unacceptably slow in some cases.
|
|
The last Xapian index format with good
|
|
performance for the old method is Chert, which is
|
|
default for 1.2, still supported but not default
|
|
in 1.4 and will be dropped in 1.6. The stored
|
|
document text is translated from its original
|
|
format to UTF-8 plain text, but not stripped of
|
|
upper-case, diacritics, or punctuation signs.
|
|
Storing it increases the index size by 10-20%
|
|
typically, but also allows for nicer snippets, so
|
|
it may be worth enabling it even if not strictly
|
|
needed for performance if you can afford the
|
|
space. The variable only has an effect when
|
|
creating an index, meaning that the xapiandb
|
|
directory must not exist yet. Its exact effect
|
|
depends on the Xapian version. For Xapian 1.4, if
|
|
the variable is set to 0, the Chert format will
|
|
be used, and the text will not be stored. If the
|
|
variable is 1, Glass will be used, and the text
|
|
stored. For Xapian 1.2, and for versions after
|
|
1.5 and newer, the index format is always the
|
|
default, but the variable controls if the text is
|
|
stored or not, and the abstract generation
|
|
method. With Xapian 1.5 and later, and the
|
|
variable set to 0, abstract generation may be
|
|
very slow, but this setting may still be useful
|
|
to save space if you do not use abstract
|
|
generation at all.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NONUMBERS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NONUMBERS"></a><span class="term"><code class="varname">nonumbers</code></span></dt>
|
|
<dd>
|
|
<p>Decides if terms will be generated for
|
|
numbers. For example "123", "1.5e6", 192.168.1.4,
|
|
would not be indexed if nonumbers is set
|
|
("value123" would still be). Numbers are often
|
|
quite interesting to search for, and this should
|
|
probably not be set except for special
|
|
situations, ie, scientific documents with huge
|
|
amounts of numbers in them, where setting
|
|
nonumbers will reduce the index size. This can
|
|
only be set for a whole index, not for a
|
|
subtree.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DEHYPHENATE"></a><span class="term"><code class="varname">dehyphenate</code></span></dt>
|
|
<dd>
|
|
<p>Determines if we index 'coworker' also when
|
|
the input is 'co-worker'. This is new in version
|
|
1.22, and on by default. Setting the variable to
|
|
off allows restoring the previous behaviour.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.BACKSLASHASLETTER"></a><span class="term"><code class="varname">backslashasletter</code></span></dt>
|
|
<dd>
|
|
<p>Process backslash as normal letter. This may
|
|
make sense for people wanting to index TeX
|
|
commands as such but is not of much general
|
|
use.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.UNDERSCOREASLETTER"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.UNDERSCOREASLETTER"></a><span class="term"><code class="varname">underscoreasletter</code></span></dt>
|
|
<dd>
|
|
<p>Process underscore as normal letter. This
|
|
makes sense in so many cases that one wonders if
|
|
it should not be the default.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXTERMLENGTH" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXTERMLENGTH"></a><span class="term"><code class="varname">maxtermlength</code></span></dt>
|
|
<dd>
|
|
<p>Maximum term length. Words longer than this
|
|
will be discarded. The default is 40 and used to
|
|
be hard-coded, but it can now be adjusted. You
|
|
need an index reset if you change the value.</p>
|
|
</dd>
|
|
<dt><a name="RCL.INSTALL.CONFIG.RECOLLCONF.NOCJK"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOCJK"></a><span class="term"><code class="varname">nocjk</code></span></dt>
|
|
<dd>
|
|
<p>Decides if specific East Asian (Chinese Korean
|
|
Japanese) characters/word splitting is turned
|
|
off. This will save a small amount of CPU if you
|
|
have no CJK documents. If your document base does
|
|
include such text but you are not interested in
|
|
searching it, setting nocjk may be a significant
|
|
time and space saver.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CJKNGRAMLEN" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CJKNGRAMLEN"></a><span class="term"><code class="varname">cjkngramlen</code></span></dt>
|
|
<dd>
|
|
<p>This lets you adjust the size of n-grams used
|
|
for indexing CJK text. The default value of 2 is
|
|
probably appropriate in most cases. A value of 3
|
|
would allow more precision and efficiency on
|
|
longer words, but the index will be approximately
|
|
twice as large.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXSTEMMINGLANGUAGES"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.INDEXSTEMMINGLANGUAGES">
|
|
</a><span class="term"><code class=
|
|
"varname">indexstemminglanguages</code></span></dt>
|
|
<dd>
|
|
<p>Languages for which to create stemming
|
|
expansion data. Stemmer names can be found by
|
|
executing 'recollindex -l', or this can also be
|
|
set from a list in the GUI. The values are full
|
|
language names, e.g. english, french...</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DEFAULTCHARSET"></a><span class="term"><code class="varname">defaultcharset</code></span></dt>
|
|
<dd>
|
|
<p>Default character set. This is used for files
|
|
which do not contain a character set definition
|
|
(e.g.: text/plain). Values found inside files,
|
|
e.g. a 'charset' tag in HTML documents, will
|
|
override it. If this is not set, the default
|
|
character set is the one defined by the NLS
|
|
environment ($LC_ALL, $LC_CTYPE, $LANG), or
|
|
ultimately iso-8859-1 (cp-1252 in fact). If for
|
|
some reason you want a general default which does
|
|
not match your LANG and is not 8859-1, use this
|
|
variable. This can be redefined for any
|
|
sub-directory.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.UNAC_EXCEPT_TRANS"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.UNAC_EXCEPT_TRANS"></a><span class="term"><code class="varname">unac_except_trans</code></span></dt>
|
|
<dd>
|
|
<p>A list of characters, encoded in UTF-8, which
|
|
should be handled specially when converting text
|
|
to unaccented lowercase. For example, in Swedish,
|
|
the letter a with diaeresis has full alphabet
|
|
citizenship and should not be turned into an a.
|
|
Each element in the space-separated list has the
|
|
special character as first element and the
|
|
translation following. The handling of both the
|
|
lowercase and upper-case versions of a character
|
|
should be specified, as appartenance to the list
|
|
will turn-off both standard accent and case
|
|
processing. The value is global and affects both
|
|
indexing and querying. Examples: Swedish:
|
|
unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe
|
|
æae Æae ffff fifi flfl åå Åå . German:
|
|
unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe
|
|
æae Æae ffff fifi flfl In French, you probably want
|
|
to decompose oe and ae and nobody would type a
|
|
German ß unac_except_trans = ßss œoe Œoe æae Æae
|
|
ffff fifi flfl . The default for all until someone
|
|
protests follows. These decompositions are not
|
|
performed by unac, but it is unlikely that
|
|
someone would type the composed forms in a
|
|
search. unac_except_trans = ßss œoe Œoe æae Æae
|
|
ffff fifi flfl</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAILDEFCHARSET" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAILDEFCHARSET"></a><span class="term"><code class="varname">maildefcharset</code></span></dt>
|
|
<dd>
|
|
<p>Overrides the default character set for email
|
|
messages which don't specify one. This is mainly
|
|
useful for readpst (libpst) dumps, which are
|
|
utf-8 but do not say so.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.LOCALFIELDS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.LOCALFIELDS"></a><span class="term"><code class="varname">localfields</code></span></dt>
|
|
<dd>
|
|
<p>Set fields on all files (usually of a specific
|
|
fs area). Syntax is the usual: name = value ;
|
|
attr1 = val1 ; [...] value is empty so this needs
|
|
an initial semi-colon. This is useful, e.g., for
|
|
setting the rclaptg field for application
|
|
selection inside mimeview.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TESTMODIFUSEMTIME"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TESTMODIFUSEMTIME"></a><span class="term"><code class="varname">testmodifusemtime</code></span></dt>
|
|
<dd>
|
|
<p>Use mtime instead of ctime to test if a file
|
|
has been modified. The time is used in addition
|
|
to the size, which is always used. Setting this
|
|
can reduce re-indexing on systems where extended
|
|
attributes are used (by some other application),
|
|
but not indexed, because changing extended
|
|
attributes only affects ctime. Notes: - This may
|
|
prevent detection of change in some marginal file
|
|
rename cases (the target would need to have the
|
|
same size and mtime). - You should probably also
|
|
set noxattrfields to 1 in this case, except if
|
|
you still prefer to perform xattr indexing, for
|
|
example if the local file update pattern makes it
|
|
of value (as in general, there is a risk for pure
|
|
extended attributes updates without file
|
|
modification to go undetected). Perform a full
|
|
index reset after changing this.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOXATTRFIELDS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOXATTRFIELDS"></a><span class="term"><code class="varname">noxattrfields</code></span></dt>
|
|
<dd>
|
|
<p>Disable extended attributes conversion to
|
|
metadata fields. This probably needs to be set if
|
|
testmodifusemtime is set.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.METADATACMDS"></a><span class="term"><code class="varname">metadatacmds</code></span></dt>
|
|
<dd>
|
|
<p>Define commands to gather external metadata,
|
|
e.g. tmsu tags. There can be several entries,
|
|
separated by semi-colons, each defining which
|
|
field name the data goes into and the command to
|
|
use. Don't forget the initial semi-colon. All the
|
|
field names must be different. You can use
|
|
aliases in the "field" file if necessary. As a
|
|
not too pretty hack conceded to convenience, any
|
|
field name beginning with "rclmulti" will be
|
|
taken as an indication that the command returns
|
|
multiple field values inside a text blob
|
|
formatted as a recoll configuration file
|
|
("fieldname = fieldvalue" lines). The rclmultixx
|
|
name will be ignored, and field names and values
|
|
will be parsed from the data. Example:
|
|
metadatacmds = ; tags = tmsu tags %f; rclmulti1 =
|
|
cmdOutputsConf %f</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.STORE" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.STORE"></a>Parameters
|
|
affecting where and how we store things</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CACHEDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CACHEDIR"></a><span class="term"><code class="varname">cachedir</code></span></dt>
|
|
<dd>
|
|
<p>Top directory for Recoll data. Recoll data
|
|
directories are normally located relative to the
|
|
configuration directory (e.g. ~/.recoll/xapiandb,
|
|
~/.recoll/mboxcache). If 'cachedir' is set, the
|
|
directories are stored under the specified value
|
|
instead (e.g. if cachedir is ~/.cache/recoll, the
|
|
default dbdir would be ~/.cache/recoll/xapiandb).
|
|
This affects dbdir, webcachedir, mboxcachedir,
|
|
aspellDicDir, which can still be individually
|
|
specified to override cachedir. Note that if you
|
|
have multiple configurations, each must have a
|
|
different cachedir, there is no automatic
|
|
computation of a subpath under cachedir.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXFSOCCUPPC" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXFSOCCUPPC"></a><span class="term"><code class="varname">maxfsoccuppc</code></span></dt>
|
|
<dd>
|
|
<p>Maximum file system occupation over which we
|
|
stop indexing. The value is a percentage,
|
|
corresponding to what the "Capacity" df output
|
|
column shows. The default value is 0, meaning no
|
|
checking.</p>
|
|
</dd>
|
|
<dt><a name="RCL.INSTALL.CONFIG.RECOLLCONF.DBDIR"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DBDIR"></a><span class="term"><code class="varname">dbdir</code></span></dt>
|
|
<dd>
|
|
<p>Xapian database directory location. This will
|
|
be created on first indexing. If the value is not
|
|
an absolute path, it will be interpreted as
|
|
relative to cachedir if set, or the configuration
|
|
directory (-c argument or $RECOLL_CONFDIR). If
|
|
nothing is specified, the default is then
|
|
~/.recoll/xapiandb/</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXSTATUSFILE" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXSTATUSFILE"></a><span class="term"><code class="varname">idxstatusfile</code></span></dt>
|
|
<dd>
|
|
<p>Name of the scratch file where the indexer
|
|
process updates its status. Default:
|
|
idxstatus.txt inside the configuration
|
|
directory.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MBOXCACHEDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MBOXCACHEDIR"></a><span class="term"><code class="varname">mboxcachedir</code></span></dt>
|
|
<dd>
|
|
<p>Directory location for storing mbox message
|
|
offsets cache files. This is normally 'mboxcache'
|
|
under cachedir if set, or else under the
|
|
configuration directory, but it may be useful to
|
|
share a directory between different
|
|
configurations.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MBOXCACHEMINMBS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MBOXCACHEMINMBS"></a><span class="term"><code class="varname">mboxcacheminmbs</code></span></dt>
|
|
<dd>
|
|
<p>Minimum mbox file size over which we cache the
|
|
offsets. There is really no sense in caching
|
|
offsets for small files. The default is 5 MB.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MBOXMAXMSGMBS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MBOXMAXMSGMBS"></a><span class="term"><code class="varname">mboxmaxmsgmbs</code></span></dt>
|
|
<dd>
|
|
<p>Maximum mbox member message size in megabytes.
|
|
Size over which we assume that the mbox format is
|
|
bad or we misinterpreted it, at which point we
|
|
just stop processing the file.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBCACHEDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBCACHEDIR"></a><span class="term"><code class="varname">webcachedir</code></span></dt>
|
|
<dd>
|
|
<p>Directory where we store the archived web
|
|
pages. This is only used by the web history
|
|
indexing code Default: cachedir/webcache if
|
|
cachedir is set, else
|
|
$RECOLL_CONFDIR/webcache</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBCACHEMAXMBS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBCACHEMAXMBS"></a><span class="term"><code class="varname">webcachemaxmbs</code></span></dt>
|
|
<dd>
|
|
<p>Maximum size in MB of the Web archive. This is
|
|
only used by the web history indexing code.
|
|
Default: 40 MB. Reducing the size will not
|
|
physically truncate the file.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBQUEUEDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBQUEUEDIR"></a><span class="term"><code class="varname">webqueuedir</code></span></dt>
|
|
<dd>
|
|
<p>The path to the Web indexing queue. This used
|
|
to be hard-coded in the old plugin as
|
|
~/.recollweb/ToIndex so there would be no need or
|
|
possibility to change it, but the WebExtensions
|
|
plugin now downloads the files to the user
|
|
Downloads directory, and a script moves them to
|
|
webqueuedir. The script reads this value from the
|
|
config so it has become possible to change
|
|
it.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBDOWNLOADSDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.WEBDOWNLOADSDIR"></a><span class="term"><code class="varname">webdownloadsdir</code></span></dt>
|
|
<dd>
|
|
<p>The path to browser downloads directory. This
|
|
is where the new browser add-on extension has to
|
|
create the files. They are then moved by a script
|
|
to webqueuedir.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLDICDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLDICDIR"></a><span class="term"><code class="varname">aspellDicDir</code></span></dt>
|
|
<dd>
|
|
<p>Aspell dictionary storage directory location.
|
|
The aspell dictionary (aspdict.(lang).rws) is
|
|
normally stored in the directory specified by
|
|
cachedir if set, or under the configuration
|
|
directory.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILTERSDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILTERSDIR"></a><span class="term"><code class="varname">filtersdir</code></span></dt>
|
|
<dd>
|
|
<p>Directory location for executable input
|
|
handlers. If RECOLL_FILTERSDIR is set in the
|
|
environment, we use it instead. Defaults to
|
|
$prefix/share/recoll/filters. Can be redefined
|
|
for subdirectories.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ICONSDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ICONSDIR"></a><span class="term"><code class="varname">iconsdir</code></span></dt>
|
|
<dd>
|
|
<p>Directory location for icons. The only reason
|
|
to change this would be if you want to change the
|
|
icons displayed in the result list. Defaults to
|
|
$prefix/share/recoll/images</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PERFS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PERFS"></a>Parameters
|
|
affecting indexing performance and resource
|
|
usage</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXFLUSHMB" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXFLUSHMB"></a><span class="term"><code class="varname">idxflushmb</code></span></dt>
|
|
<dd>
|
|
<p>Threshold (megabytes of new data) where we
|
|
flush from memory to disk index. Setting this
|
|
allows some control over memory usage by the
|
|
indexer process. A value of 0 means no explicit
|
|
flushing, which lets Xapian perform its own
|
|
thing, meaning flushing every
|
|
$XAPIAN_FLUSH_THRESHOLD documents created,
|
|
modified or deleted: as memory usage depends on
|
|
average document size, not only document count,
|
|
the Xapian approach is is not very useful, and
|
|
you should let Recoll manage the flushes. The
|
|
program compiled value is 0. The configured
|
|
default value (from this file) is now 50 MB, and
|
|
should be ok in many cases. You can set it as low
|
|
as 10 to conserve memory, but if you are looking
|
|
for maximum speed, you may want to experiment
|
|
with values between 20 and 200. In my experience,
|
|
values beyond this are always counterproductive.
|
|
If you find otherwise, please drop me a note.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILTERMAXSECONDS"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.FILTERMAXSECONDS">
|
|
</a><span class="term"><code class=
|
|
"varname">filtermaxseconds</code></span></dt>
|
|
<dd>
|
|
<p>Maximum external filter execution time in
|
|
seconds. Default 1200 (20mn). Set to 0 for no
|
|
limit. This is mainly to avoid infinite loops in
|
|
postscript files (loop.ps)</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILTERMAXMBYTES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.FILTERMAXMBYTES"></a><span class="term"><code class="varname">filtermaxmbytes</code></span></dt>
|
|
<dd>
|
|
<p>Maximum virtual memory space for filter
|
|
processes (setrlimit(RLIMIT_AS)), in megabytes.
|
|
Note that this includes any mapped libs (there is
|
|
no reliable Linux way to limit the data space
|
|
only), so we need to be a bit generous here.
|
|
Anything over 2000 will be ignored on 32 bits
|
|
machines.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.THRQSIZES" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.THRQSIZES"></a><span class="term"><code class="varname">thrQSizes</code></span></dt>
|
|
<dd>
|
|
<p>Stage input queues configuration. There are
|
|
three internal queues in the indexing pipeline
|
|
stages (file data extraction, terms generation,
|
|
index update). This parameter defines the queue
|
|
depths for each stage (three integer values). If
|
|
a value of -1 is given for a given stage, no
|
|
queue is used, and the thread will go on
|
|
performing the next stage. In practise, deep
|
|
queues have not been shown to increase
|
|
performance. Default: a value of 0 for the first
|
|
queue tells Recoll to perform autoconfiguration
|
|
based on the detected number of CPUs (no need for
|
|
the two other values in this case). Use thrQSizes
|
|
= -1 -1 -1 to disable multithreading
|
|
entirely.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.THRTCOUNTS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.THRTCOUNTS"></a><span class="term"><code class="varname">thrTCounts</code></span></dt>
|
|
<dd>
|
|
<p>Number of threads used for each indexing
|
|
stage. The three stages are: file data
|
|
extraction, terms generation, index update). The
|
|
use of the counts is also controlled by some
|
|
special values in thrQSizes: if the first queue
|
|
depth is 0, all counts are ignored
|
|
(autoconfigured); if a value of -1 is used for a
|
|
queue depth, the corresponding thread count is
|
|
ignored. It makes no sense to use a value other
|
|
than 1 for the last stage because updating the
|
|
Xapian index is necessarily single-threaded (and
|
|
protected by a mutex).</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MISC" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MISC"></a>Miscellaneous
|
|
parameters</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.LOGLEVEL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.LOGLEVEL"></a><span class="term"><code class="varname">loglevel</code></span></dt>
|
|
<dd>
|
|
<p>Log file verbosity 1-6. A value of 2 will
|
|
print only errors and warnings. 3 will print
|
|
information like document updates, 4 is quite
|
|
verbose and 6 very verbose.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.LOGFILENAME" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.LOGFILENAME"></a><span class="term"><code class="varname">logfilename</code></span></dt>
|
|
<dd>
|
|
<p>Log file destination. Use 'stderr' (default)
|
|
to write to the console.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXLOGLEVEL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXLOGLEVEL"></a><span class="term"><code class="varname">idxloglevel</code></span></dt>
|
|
<dd>
|
|
<p>Override loglevel for the indexer.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXLOGFILENAME" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXLOGFILENAME"></a><span class="term"><code class="varname">idxlogfilename</code></span></dt>
|
|
<dd>
|
|
<p>Override logfilename for the indexer.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DAEMLOGLEVEL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DAEMLOGLEVEL"></a><span class="term"><code class="varname">daemloglevel</code></span></dt>
|
|
<dd>
|
|
<p>Override loglevel for the indexer in real time
|
|
mode. The default is to use the idx... values if
|
|
set, else the log... values.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DAEMLOGFILENAME" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.DAEMLOGFILENAME"></a><span class="term"><code class="varname">daemlogfilename</code></span></dt>
|
|
<dd>
|
|
<p>Override logfilename for the indexer in real
|
|
time mode. The default is to use the idx...
|
|
values if set, else the log... values.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PYLOGLEVEL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PYLOGLEVEL"></a><span class="term"><code class="varname">pyloglevel</code></span></dt>
|
|
<dd>
|
|
<p>Override loglevel for the python module.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PYLOGFILENAME" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PYLOGFILENAME"></a><span class="term"><code class="varname">pylogfilename</code></span></dt>
|
|
<dd>
|
|
<p>Override logfilename for the python
|
|
module.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ORGIDXCONFDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ORGIDXCONFDIR"></a><span class="term"><code class="varname">orgidxconfdir</code></span></dt>
|
|
<dd>
|
|
<p>Original location of the configuration
|
|
directory. This is used exclusively for movable
|
|
datasets. Locating the configuration directory
|
|
inside the directory tree makes it possible to
|
|
provide automatic query time path translations
|
|
once the data set has moved (for example, because
|
|
it has been mounted on another location).</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CURIDXCONFDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CURIDXCONFDIR"></a><span class="term"><code class="varname">curidxconfdir</code></span></dt>
|
|
<dd>
|
|
<p>Current location of the configuration
|
|
directory. Complement orgidxconfdir for movable
|
|
datasets. This should be used if the
|
|
configuration directory has been copied from the
|
|
dataset to another location, either because the
|
|
dataset is readonly and an r/w copy is desired,
|
|
or for performance reasons. This records the
|
|
original moved location before copy, to allow
|
|
path translation computations. For example if a
|
|
dataset originally indexed as
|
|
'/home/me/mydata/config' has been mounted to
|
|
'/media/me/mydata', and the GUI is running from a
|
|
copied configuration, orgidxconfdir would be
|
|
'/home/me/mydata/config', and curidxconfdir (as
|
|
set in the copied configuration) would be
|
|
'/media/me/mydata/config'.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXRUNDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXRUNDIR"></a><span class="term"><code class="varname">idxrundir</code></span></dt>
|
|
<dd>
|
|
<p>Indexing process current directory. The input
|
|
handlers sometimes leave temporary files in the
|
|
current directory, so it makes sense to have
|
|
recollindex chdir to some temporary directory. If
|
|
the value is empty, the current directory is not
|
|
changed. If the value is (literal) tmp, we use
|
|
the temporary directory as set by the environment
|
|
(RECOLL_TMPDIR else TMPDIR else /tmp). If the
|
|
value is an absolute path to a directory, we go
|
|
there.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CHECKNEEDRETRYINDEXSCRIPT"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.CHECKNEEDRETRYINDEXSCRIPT">
|
|
</a><span class="term"><code class=
|
|
"varname">checkneedretryindexscript</code></span></dt>
|
|
<dd>
|
|
<p>Script used to heuristically check if we need
|
|
to retry indexing files which previously failed.
|
|
The default script checks the modified dates on
|
|
/usr/bin and /usr/local/bin. A relative path will
|
|
be looked up in the filters dirs, then in the
|
|
path. Use an absolute path to do otherwise.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.RECOLLHELPERPATH"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.RECOLLHELPERPATH">
|
|
</a><span class="term"><code class=
|
|
"varname">recollhelperpath</code></span></dt>
|
|
<dd>
|
|
<p>Additional places to search for helper
|
|
executables. This is only used on Windows for
|
|
now.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXABSMLEN" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXABSMLEN"></a><span class="term"><code class="varname">idxabsmlen</code></span></dt>
|
|
<dd>
|
|
<p>Length of abstracts we store while indexing.
|
|
Recoll stores an abstract for each indexed file.
|
|
The text can come from an actual 'abstract'
|
|
section in the document or will just be the
|
|
beginning of the document. It is stored in the
|
|
index so that it can be displayed inside the
|
|
result lists without decoding the original file.
|
|
The idxabsmlen parameter defines the size of the
|
|
stored abstract. The default value is 250 bytes.
|
|
The search interface gives you the choice to
|
|
display this stored text or a synthetic abstract
|
|
built by extracting text around the search terms.
|
|
If you always prefer the synthetic abstract, you
|
|
can reduce this value and save a little
|
|
space.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXMETASTOREDLEN"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.IDXMETASTOREDLEN">
|
|
</a><span class="term"><code class=
|
|
"varname">idxmetastoredlen</code></span></dt>
|
|
<dd>
|
|
<p>Truncation length of stored metadata fields.
|
|
This does not affect indexing (the whole field is
|
|
processed anyway), just the amount of data stored
|
|
in the index for the purpose of displaying fields
|
|
inside result lists or previews. The default
|
|
value is 150 bytes which may be too low if you
|
|
have custom fields.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXTEXTTRUNCATELEN"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXTEXTTRUNCATELEN"></a><span class="term"><code class="varname">idxtexttruncatelen</code></span></dt>
|
|
<dd>
|
|
<p>Truncation length for all document texts. Only
|
|
index the beginning of documents. This is not
|
|
recommended except if you are sure that the
|
|
interesting keywords are at the top and have
|
|
severe disk space issues.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLLANGUAGE" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLLANGUAGE"></a><span class="term"><code class="varname">aspellLanguage</code></span></dt>
|
|
<dd>
|
|
<p>Language definitions to use when creating the
|
|
aspell dictionary. The value must match a set of
|
|
aspell language definition files. You can type
|
|
"aspell dicts" to see a list The default if this
|
|
is not set is to use the NLS environment to guess
|
|
the value. The values are the 2-letter language
|
|
codes (e.g. 'en', 'fr'...)</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLADDCREATEPARAM">
|
|
</a><span class="term"><code class=
|
|
"varname">aspellAddCreateParam</code></span></dt>
|
|
<dd>
|
|
<p>Additional option and parameter to aspell
|
|
dictionary creation command. Some aspell packages
|
|
may need an additional option (e.g. on Debian
|
|
Jessie: --local-data-dir=/usr/lib/aspell). See
|
|
Debian bug 772415.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLKEEPSTDERR"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.ASPELLKEEPSTDERR">
|
|
</a><span class="term"><code class=
|
|
"varname">aspellKeepStderr</code></span></dt>
|
|
<dd>
|
|
<p>Set this to have a look at aspell dictionary
|
|
creation errors. There are always many, so this
|
|
is mostly for debugging.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOASPELL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.NOASPELL"></a><span class="term"><code class="varname">noaspell</code></span></dt>
|
|
<dd>
|
|
<p>Disable aspell use. The aspell dictionary
|
|
generation takes time, and some combinations of
|
|
aspell version, language, and local terms, result
|
|
in aspell crashing, so it sometimes makes sense
|
|
to just disable the thing.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONAUXINTERVAL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONAUXINTERVAL"></a><span class="term"><code class="varname">monauxinterval</code></span></dt>
|
|
<dd>
|
|
<p>Auxiliary database update interval. The real
|
|
time indexer only updates the auxiliary databases
|
|
(stemdb, aspell) periodically, because it would
|
|
be too costly to do it for every document change.
|
|
The default period is one hour.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIXINTERVAL" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIXINTERVAL"></a><span class="term"><code class="varname">monixinterval</code></span></dt>
|
|
<dd>
|
|
<p>Minimum interval (seconds) between processings
|
|
of the indexing queue. The real time indexer does
|
|
not process each event when it comes in, but lets
|
|
the queue accumulate, to diminish overhead and to
|
|
aggregate multiple events affecting the same
|
|
file. Default 30 S.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONDELAYPATTERNS"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.MONDELAYPATTERNS">
|
|
</a><span class="term"><code class=
|
|
"varname">mondelaypatterns</code></span></dt>
|
|
<dd>
|
|
<p>Timing parameters for the real time indexing.
|
|
Definitions for files which get a longer delay
|
|
before reindexing is allowed. This is for
|
|
fast-changing files, that should only be
|
|
reindexed once in a while. A list of
|
|
wildcardPattern:seconds pairs. The patterns are
|
|
matched with fnmatch(pattern, path, 0) You can
|
|
quote entries containing white space with double
|
|
quotes (quote the whole entry, not the pattern).
|
|
The default is empty. Example: mondelaypatterns =
|
|
*.log:20 "*with spaces.*:30"</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.IDXNICEPRIO"></a><span class="term"><code class="varname">idxniceprio</code></span></dt>
|
|
<dd>
|
|
<p>"nice" process priority for the indexing
|
|
processes. Default: 19 (lowest) Appeared with
|
|
1.26.5. Prior versions were fixed at 19.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASS"></a><span class="term"><code class="varname">monioniceclass</code></span></dt>
|
|
<dd>
|
|
<p>ionice class for the indexing process. Despite
|
|
the misleading name, and on platforms where this
|
|
is supported, this affects all indexing
|
|
processes, not only the real time/monitoring
|
|
ones. The default value is 3 (use lowest "Idle"
|
|
priority).</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MONIONICECLASSDATA"></a><span class="term"><code class="varname">monioniceclassdata</code></span></dt>
|
|
<dd>
|
|
<p>ionice class level parameter if the class
|
|
supports it. The default is empty, as the default
|
|
"Idle" class has no levels.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.QUERY" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.QUERY"></a>Query-time
|
|
parameters (no impact on the index)</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.AUTODIACSENS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.AUTODIACSENS"></a><span class="term"><code class="varname">autodiacsens</code></span></dt>
|
|
<dd>
|
|
<p>auto-trigger diacritics sensitivity (raw index
|
|
only). IF the index is not stripped, decide if we
|
|
automatically trigger diacritics sensitivity if
|
|
the search term has accented characters (not in
|
|
unac_except_trans). Else you need to use the
|
|
query language and the "D" modifier to specify
|
|
diacritics sensitivity. Default is no.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.AUTOCASESENS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.AUTOCASESENS"></a><span class="term"><code class="varname">autocasesens</code></span></dt>
|
|
<dd>
|
|
<p>auto-trigger case sensitivity (raw index
|
|
only). IF the index is not stripped (see
|
|
indexStripChars), decide if we automatically
|
|
trigger character case sensitivity if the search
|
|
term has upper-case characters in any but the
|
|
first position. Else you need to use the query
|
|
language and the "C" modifier to specify
|
|
character-case sensitivity. Default is yes.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXTERMEXPAND" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXTERMEXPAND"></a><span class="term"><code class="varname">maxTermExpand</code></span></dt>
|
|
<dd>
|
|
<p>Maximum query expansion count for a single
|
|
term (e.g.: when using wildcards). This only
|
|
affects queries, not indexing. We used to not
|
|
limit this at all (except for filenames where the
|
|
limit was too low at 1000), but it is
|
|
unreasonable with a big index. Default 10000.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MAXXAPIANCLAUSES"
|
|
id="RCL.INSTALL.CONFIG.RECOLLCONF.MAXXAPIANCLAUSES">
|
|
</a><span class="term"><code class=
|
|
"varname">maxXapianClauses</code></span></dt>
|
|
<dd>
|
|
<p>Maximum number of clauses we add to a single
|
|
Xapian query. This only affects queries, not
|
|
indexing. In some cases, the result of term
|
|
expansion can be multiplicative, and we want to
|
|
avoid eating all the memory. Default 50000.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SNIPPETMAXPOSWALK"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SNIPPETMAXPOSWALK"></a><span class="term"><code class="varname">snippetMaxPosWalk</code></span></dt>
|
|
<dd>
|
|
<p>Maximum number of positions we walk while
|
|
populating a snippet for the result list. The
|
|
default of 1,000,000 may be insufficient for very
|
|
big documents, the consequence would be snippets
|
|
with possibly meaning-altering missing words.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDF" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDF"></a>Parameters
|
|
for the PDF input script</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"
|
|
id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR"></a><span class="term"><code class="varname">pdfocr</code></span></dt>
|
|
<dd>
|
|
<p>Attempt OCR of PDF files with no text content.
|
|
This can be defined in subdirectories. The
|
|
default is off because OCR is so very slow.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH"></a><span class="term"><code class="varname">pdfattach</code></span></dt>
|
|
<dd>
|
|
<p>Enable PDF attachment extraction by executing
|
|
pdftk (if available). This is normally disabled,
|
|
because it does slow down PDF indexing a bit even
|
|
if not one attachment is ever found.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA"></a><span class="term"><code class="varname">pdfextrameta</code></span></dt>
|
|
<dd>
|
|
<p>Extract text from selected XMP metadata tags.
|
|
This is a space-separated list of qualified XMP
|
|
tag names. Each element can also include a
|
|
translation to a Recoll field name, separated by
|
|
a '|' character. If the second element is absent,
|
|
the tag name is used as the Recoll field names.
|
|
You will also need to add specifications to the
|
|
"fields" file to direct processing of the
|
|
extracted data.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX"></a><span class="term"><code class="varname">pdfextrametafix</code></span></dt>
|
|
<dd>
|
|
<p>Define name of XMP field editing script. This
|
|
defines the name of a script to be loaded for
|
|
editing XMP field values. The script should
|
|
define a 'MetaFixer' class with a metafix()
|
|
method which will be called with the qualified
|
|
tag name and value of each selected field, for
|
|
editing or erasing. A new instance is created for
|
|
each document, so that the object can keep state
|
|
for, e.g. eliminating duplicate values.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.OCR"></a>Parameters
|
|
for OCR processing</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRPROGS"></a><span class="term"><code class="varname">ocrprogs</code></span></dt>
|
|
<dd>
|
|
<p>OCR modules to try. The top OCR script will
|
|
try to load the corresponding modules in order
|
|
and use the first which reports being capable of
|
|
performing OCR on the input file. Modules for
|
|
tesseract (tesseract) and ABBYY FineReader
|
|
(abbyy) are present in the standard distribution.
|
|
For compatibility with the previous version, if
|
|
this is not defined at all, the default value is
|
|
"tesseract". Use an explicit empty value if
|
|
needed. A value of "abbyy tesseract" will try
|
|
everything.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.OCRCACHEDIR"></a><span class="term"><code class="varname">ocrcachedir</code></span></dt>
|
|
<dd>
|
|
<p>Location for caching OCR data. The default if
|
|
this is empty or undefined is to store the cached
|
|
OCR data under $RECOLL_CONFDIR/ocrcache.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTLANG"></a><span class="term"><code class="varname">tesseractlang</code></span></dt>
|
|
<dd>
|
|
<p>Language to assume for tesseract OCR.
|
|
Important for improving the OCR accuracy. This
|
|
can also be set through the contents of a file in
|
|
the currently processed directory. See the
|
|
rclocrtesseract.py script. Example values: eng,
|
|
fra... See the tesseract documentation.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.TESSERACTCMD"></a><span class="term"><code class="varname">tesseractcmd</code></span></dt>
|
|
<dd>
|
|
<p>Path for the tesseract command. Do not quote.
|
|
This is mostly useful on Windows, or for
|
|
specifying a non-default tesseract command. E.g.
|
|
on Windows. tesseractcmd =
|
|
C:/ProgramFiles(x86)/Tesseract-OCR/tesseract.exe</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYLANG"></a><span class="term"><code class="varname">abbyylang</code></span></dt>
|
|
<dd>
|
|
<p>Language to assume for abbyy OCR. Important
|
|
for improving the OCR accuracy. This can also be
|
|
set through the contents of a file in the
|
|
currently processed directory. See the
|
|
rclocrabbyy.py script. Typical values: English,
|
|
French... See the ABBYY documentation.</p>
|
|
</dd>
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.ABBYYCMD"></a><span class="term"><code class="varname">abbyycmd</code></span></dt>
|
|
<dd>
|
|
<p>Path for the abbyy command The ABBY directory
|
|
is usually not in the path, so you should set
|
|
this.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.SPECLOCATIONS"></a>Parameters
|
|
set for specific locations</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><a name=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS" id=
|
|
"RCL.INSTALL.CONFIG.RECOLLCONF.MHMBOXQUIRKS"></a><span class="term"><code class="varname">mhmboxquirks</code></span></dt>
|
|
<dd>
|
|
<p>Enable thunderbird/mozilla-seamonkey mbox
|
|
format quirks Set this for the directory where
|
|
the email mbox files are stored.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.FIELDS" id=
|
|
"RCL.INSTALL.CONFIG.FIELDS"></a>5.4.3. The
|
|
fields file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>This file contains information about dynamic fields
|
|
handling in <span class="application">Recoll</span>. Some
|
|
very basic fields have hard-wired behaviour, and, mostly,
|
|
you should not change the original data inside the
|
|
<code class="filename">fields</code> file. But you can
|
|
create custom fields fitting your data and handle them
|
|
just like they were native ones.</p>
|
|
<p>The <code class="filename">fields</code> file has
|
|
several sections, which each define an aspect of fields
|
|
processing. Quite often, you'll have to modify several
|
|
sections to obtain the desired behaviour.</p>
|
|
<p>We will only give a short description here, you should
|
|
refer to the comments inside the default file for more
|
|
detailed information.</p>
|
|
<p>Field names should be lowercase alphabetic ASCII.</p>
|
|
<div class="variablelist">
|
|
<dl class="variablelist">
|
|
<dt><span class="term">[prefixes]</span></dt>
|
|
<dd>
|
|
<p>A field becomes indexed (searchable) by having a
|
|
prefix defined in this section. There is a more
|
|
complete explanation of what prefixes are in used
|
|
by a standard recoll installation. In a nutshell:
|
|
extension prefixes should be all caps, begin with
|
|
XY, and short. E.g. XYMFLD.</p>
|
|
</dd>
|
|
<dt><span class="term">[values]</span></dt>
|
|
<dd>
|
|
<p>Fields listed in this section will be stored as
|
|
<span class="application">Xapian</span>
|
|
<code class="literal">values</code> inside the
|
|
index. This makes them available for range queries,
|
|
allowing to filter results according to the field
|
|
value. This feature currently supports string and
|
|
integer data. See the comments in the file for more
|
|
detail</p>
|
|
</dd>
|
|
<dt><span class="term">[stored]</span></dt>
|
|
<dd>
|
|
<p>A field becomes stored (displayable inside
|
|
results) by having its name listed in this section
|
|
(typically with an empty value).</p>
|
|
</dd>
|
|
<dt><span class="term">[aliases]</span></dt>
|
|
<dd>
|
|
<p>This section defines lists of synonyms for the
|
|
canonical names used inside the <code class=
|
|
"literal">[prefixes]</code> and <code class=
|
|
"literal">[stored]</code> sections</p>
|
|
</dd>
|
|
<dt><span class="term">[queryaliases]</span></dt>
|
|
<dd>
|
|
<p>This section also defines aliases for the
|
|
canonic field names, with the difference that the
|
|
substitution will only be used at query time,
|
|
avoiding any possibility that the value would
|
|
pick-up random metadata from documents.</p>
|
|
</dd>
|
|
<dt><span class="term">handler-specific
|
|
sections</span></dt>
|
|
<dd>
|
|
<p>Some input handlers may need specific
|
|
configuration for handling fields. Only the email
|
|
message handler currently has such a section (named
|
|
<code class="literal">[mail]</code>). It allows
|
|
indexing arbitrary email headers in addition to the
|
|
ones indexed by default. Other such sections may
|
|
appear in the future.</p>
|
|
</dd>
|
|
</dl>
|
|
</div>
|
|
<p>Here follows a small example of a personal
|
|
<code class="filename">fields</code> file. This would
|
|
extract a specific email header and use it as a
|
|
searchable field, with data displayable inside result
|
|
lists. (Side note: as the email handler does no decoding
|
|
on the values, only plain ascii headers can be indexed,
|
|
and only the first occurrence will be used for headers
|
|
that occur several times).</p>
|
|
<pre class="programlisting">[prefixes]
|
|
# Index mailmytag contents (with the given prefix)
|
|
mailmytag = XMTAG
|
|
|
|
[stored]
|
|
# Store mailmytag inside the document data record (so that it can be
|
|
# displayed - as %(mailmytag) - in result lists).
|
|
mailmytag =
|
|
|
|
[queryaliases]
|
|
filename = fn
|
|
containerfilename = cfn
|
|
|
|
[mail]
|
|
# Extract the X-My-Tag mail header, and use it internally with the
|
|
# mailmytag field name
|
|
x-my-tag = mailmytag
|
|
</pre>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.FIELDS.XATTR" id=
|
|
"RCL.INSTALL.CONFIG.FIELDS.XATTR"></a>Extended
|
|
attributes in the fields file</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><span class="application">Recoll</span> versions
|
|
1.19 and later process user extended file attributes as
|
|
documents fields by default.</p>
|
|
<p>Attributes are processed as fields of the same name,
|
|
after removing the <code class="literal">user</code>
|
|
prefix on Linux.</p>
|
|
<p>The <code class="literal">[xattrtofields]</code>
|
|
section of the <code class="filename">fields</code>
|
|
file allows specifying translations from extended
|
|
attributes names to <span class=
|
|
"application">Recoll</span> field names. An empty
|
|
translation disables use of the corresponding attribute
|
|
data.</p>
|
|
</div>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.MIMEMAP" id=
|
|
"RCL.INSTALL.CONFIG.MIMEMAP"></a>5.4.4. The
|
|
mimemap file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><code class="filename">mimemap</code> specifies the
|
|
file name extension to MIME type mappings.</p>
|
|
<p>For file names without an extension, or with an
|
|
unknown one, a system command (<span class=
|
|
"command"><strong>file</strong></span> <code class=
|
|
"option">-i</code>, or <span class=
|
|
"command"><strong>xdg-mime</strong></span>) will be
|
|
executed to determine the MIME type (this can be switched
|
|
off, or the command changed inside the main configuration
|
|
file).</p>
|
|
<p>All extension values in <code class=
|
|
"filename">mimemap</code> must be entered in lower case.
|
|
File names extensions are lower-cased for comparison
|
|
during indexing, meaning that an upper case <code class=
|
|
"filename">mimemap</code> entry will never be
|
|
matched.</p>
|
|
<p>The mappings can be specified on a per-subtree basis,
|
|
which may be useful in some cases. Example: <span class=
|
|
"application">okular</span> notes have a <code class=
|
|
"filename">.xml</code> extension but should be handled
|
|
specially, which is possible because they are usually all
|
|
located in one place. Example:</p>
|
|
<pre class=
|
|
"programlisting">[~/.kde/share/apps/okular/docdata]
|
|
.xml = application/x-okular-notes</pre>
|
|
<p>The <code class="varname">recoll_noindex</code>
|
|
<code class="filename">mimemap</code> variable has been
|
|
moved to <code class="filename">recoll.conf</code> and
|
|
renamed to <code class=
|
|
"varname">noContentSuffixes</code>, while keeping the
|
|
same function, as of <span class=
|
|
"application">Recoll</span> version 1.21. For older
|
|
<span class="application">Recoll</span> versions, see the
|
|
documentation for <code class=
|
|
"varname">noContentSuffixes</code> but use <code class=
|
|
"varname">recoll_noindex</code> in <code class=
|
|
"filename">mimemap</code>.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.MIMECONF" id=
|
|
"RCL.INSTALL.CONFIG.MIMECONF"></a>5.4.5. The
|
|
mimeconf file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>The main purpose of the <code class=
|
|
"filename">mimeconf</code> file is to specify how the
|
|
different MIME types are handled for indexing. This is
|
|
done in the <code class="literal">[index]</code> section,
|
|
which should not be modified casually. See the comments
|
|
in the file.</p>
|
|
<p>The file also contains other definitions which affect
|
|
the query language and the GUI, and which, in retrospect,
|
|
should have been stored elsewhere.</p>
|
|
<p>The <code class="literal">[icons]</code> section
|
|
allows you to change the icons which are displayed by the
|
|
<span class="command"><strong>recoll</strong></span> GUI
|
|
in the result lists (the values are the basenames of the
|
|
<code class="literal">png</code> images inside the
|
|
<code class="filename">iconsdir</code> directory (which
|
|
is itself defined in <code class=
|
|
"filename">recoll.conf</code>).</p>
|
|
<p>The <code class="literal">[categories]</code> section
|
|
defines the groupings of MIME types into <code class=
|
|
"literal">categories</code> as used when adding an
|
|
<code class="literal">rclcat</code> clause to a <a class=
|
|
"link" href="#RCL.SEARCH.LANG" title=
|
|
"3.5. The query language">query language</a> query.
|
|
<code class="literal">rclcat</code> clauses are also used
|
|
by the default <code class="literal">guifilters</code>
|
|
buttons in the GUI (see next).</p>
|
|
<p>The filter controls appear at the top of the
|
|
<span class="command"><strong>recoll</strong></span> GUI,
|
|
either as checkboxes just above the result list, or as a
|
|
dropbox in the tool area.</p>
|
|
<p>By default, they are labeled: <code class=
|
|
"literal">media</code>, <code class=
|
|
"literal">message</code>, <code class=
|
|
"literal">other</code>, <code class=
|
|
"literal">presentation</code>, <code class=
|
|
"literal">spreadsheet</code> and <code class=
|
|
"literal">text</code>, and each maps to a document
|
|
category. This is determined in the <code class=
|
|
"literal">[guifilters]</code> section, where each control
|
|
is defined by a variable naming a query language
|
|
fragment.</p>
|
|
<p>A simple example will hopefully make things
|
|
clearer.</p>
|
|
<pre class="programlisting">[guifilters]
|
|
|
|
Big Books = dir:"~/My Books" size>10K
|
|
My Docs = dir:"~/My Documents"
|
|
Small Books = dir:"~/My Books" size<10K
|
|
System Docs = dir:/usr/share/doc
|
|
</pre>
|
|
<p>The above definition would create four filter
|
|
checkboxes, labelled <code class="literal">Big
|
|
Books</code>, <code class="literal">My Docs</code>,
|
|
etc.</p>
|
|
<p>The text after the equal sign must be a valid query
|
|
language fragment, and, when the button is checked, it
|
|
will be combined with the rest of the query with an AND
|
|
conjunction.</p>
|
|
<p>Any name text before a colon character will be erased
|
|
in the display, but used for sorting. You can use this to
|
|
display the checkboxes in any order you like. For
|
|
example, the following would do exactly the same as
|
|
above, but ordering the checkboxes in the reverse
|
|
order.</p>
|
|
<pre class="programlisting">[guifilters]
|
|
|
|
d:Big Books = dir:"~/My Books" size>10K
|
|
c:My Docs = dir:"~/My Documents"
|
|
b:Small Books = dir:"~/My Books" size<10K
|
|
a:System Docs = dir:/usr/share/doc
|
|
</pre>
|
|
<p>As you may have guessed, The default <code class=
|
|
"literal">[guifilters]</code> section looks like:</p>
|
|
<pre class="programlisting">[guifilters]
|
|
text = rclcat:text
|
|
spreadsheet = rclcat:spreadsheet
|
|
presentation = rclcat:presentation
|
|
media = rclcat:media
|
|
message = rclcat:message
|
|
other = rclcat:other
|
|
</pre>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.MIMEVIEW" id=
|
|
"RCL.INSTALL.CONFIG.MIMEVIEW"></a>5.4.6. The
|
|
mimeview file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><code class="filename">mimeview</code> specifies which
|
|
programs are started when you click on an <span class=
|
|
"guilabel">Open</span> link in a result list. Ie: HTML is
|
|
normally displayed using <span class=
|
|
"application">firefox</span>, but you may prefer
|
|
<span class="application">Konqueror</span>, your
|
|
<span class="application">openoffice.org</span> program
|
|
might be named <span class=
|
|
"command"><strong>oofice</strong></span> instead of
|
|
<span class="command"><strong>openoffice</strong></span>
|
|
etc.</p>
|
|
<p>Changes to this file can be done by direct editing, or
|
|
through the <span class=
|
|
"command"><strong>recoll</strong></span> GUI preferences
|
|
dialog.</p>
|
|
<p>If <span class="guilabel">Use desktop preferences to
|
|
choose document editor</span> is checked in the
|
|
<span class="application">Recoll</span> GUI preferences,
|
|
all <code class="filename">mimeview</code> entries will
|
|
be ignored except the one labelled <code class=
|
|
"literal">application/x-all</code> (which is set to use
|
|
<span class="command"><strong>xdg-open</strong></span> by
|
|
default).</p>
|
|
<p>In this case, the <code class=
|
|
"literal">xallexcepts</code> top level variable defines a
|
|
list of MIME type exceptions which will be processed
|
|
according to the local entries instead of being passed to
|
|
the desktop. This is so that specific <span class=
|
|
"application">Recoll</span> options such as a page number
|
|
or a search string can be passed to applications that
|
|
support them, such as the <span class=
|
|
"application">evince</span> viewer.</p>
|
|
<p>As for the other configuration files, the normal usage
|
|
is to have a <code class="filename">mimeview</code>
|
|
inside your own configuration directory, with just the
|
|
non-default entries, which will override those from the
|
|
central configuration file.</p>
|
|
<p>All viewer definition entries must be placed under a
|
|
<code class="literal">[view]</code> section.</p>
|
|
<p>The keys in the file are normally MIME types. You can
|
|
add an application tag to specialize the choice for an
|
|
area of the filesystem (using a <code class=
|
|
"varname">localfields</code> specification in
|
|
<code class="filename">mimeconf</code>). The syntax for
|
|
the key is <em class=
|
|
"replaceable"><code>mimetype</code></em><code class=
|
|
"literal">|</code><em class=
|
|
"replaceable"><code>tag</code></em></p>
|
|
<p>The <code class="varname">nouncompforviewmts</code>
|
|
entry, (placed at the top level, outside of the
|
|
<code class="literal">[view]</code> section), holds a
|
|
list of MIME types that should not be uncompressed before
|
|
starting the viewer (if they are found compressed, ie:
|
|
<em class=
|
|
"replaceable"><code>mydoc.doc.gz</code></em>).</p>
|
|
<p>The right side of each assignment holds a command to
|
|
be executed for opening the file. The following
|
|
substitutions are performed:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p><b>%D. </b>Document date</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%f. </b>File name. This may be the name
|
|
of a temporary file if it was necessary to create
|
|
one (ie: to extract a subdocument from a
|
|
container).</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%i. </b>Internal path, for subdocuments
|
|
of containers. The format depends on the container
|
|
type. If this appears in the command line,
|
|
<span class="application">Recoll</span> will not
|
|
create a temporary file to extract the subdocument,
|
|
expecting the called application (possibly a
|
|
script) to be able to handle it.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%M. </b>MIME type</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%p. </b>Page index. Only significant for
|
|
a subset of document types, currently only PDF,
|
|
Postscript and DVI files. Can be used to start the
|
|
editor at the right page for a match or
|
|
snippet.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%s. </b>Search term. The value will only
|
|
be set for documents with indexed page numbers (ie:
|
|
PDF). The value will be one of the matched search
|
|
terms. It would allow pre-setting the value in the
|
|
"Find" entry inside Evince for example, for easy
|
|
highlighting of the term.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p><b>%u. </b>Url.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>In addition to the predefined values above, all
|
|
strings like <code class="literal">%(fieldname)</code>
|
|
will be replaced by the value of the field named
|
|
<code class="literal">fieldname</code> for the document.
|
|
This could be used in combination with field
|
|
customisation to help with opening the document.</p>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.PTRANS" id=
|
|
"RCL.INSTALL.CONFIG.PTRANS"></a>5.4.7. The
|
|
<code class="filename">ptrans</code> file</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p><code class="filename">ptrans</code> specifies
|
|
query-time path translations. These can be useful in
|
|
<a class="link" href="#RCL.SEARCH.PTRANS" title=
|
|
"3.8. Path translations">multiple cases</a>.</p>
|
|
<p>The file has a section for any index which needs
|
|
translations, either the main one or additional query
|
|
indexes. The sections are named with the <span class=
|
|
"application">Xapian</span> index directory names. No
|
|
slash character should exist at the end of the paths (all
|
|
comparisons are textual). An example should make things
|
|
sufficiently clear</p>
|
|
<pre class="programlisting">
|
|
[/home/me/.recoll/xapiandb]
|
|
/this/directory/moved = /to/this/place
|
|
|
|
[/path/to/additional/xapiandb]
|
|
/server/volume1/docdir = /net/server/volume1/docdir
|
|
/server/volume2/docdir = /net/server/volume2/docdir
|
|
</pre>
|
|
</div>
|
|
<div class="sect2">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h3 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES" id=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES"></a>5.4.8. Examples
|
|
of configuration adjustments</h3>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW" id=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDVIEW"></a>Adding
|
|
an external viewer for an non-indexed type</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Imagine that you have some kind of file which does
|
|
not have indexable content, but for which you would
|
|
like to have a functional <span class=
|
|
"guilabel">Open</span> link in the result list (when
|
|
found by file name). The file names end in <em class=
|
|
"replaceable"><code>.blob</code></em> and can be
|
|
displayed by application <em class=
|
|
"replaceable"><code>blobviewer</code></em>.</p>
|
|
<p>You need two entries in the configuration files for
|
|
this to work:</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>In <code class=
|
|
"filename">$RECOLL_CONFDIR/mimemap</code>
|
|
(typically <code class=
|
|
"filename">~/.recoll/mimemap</code>), add the
|
|
following line:</p>
|
|
<pre class="programlisting">
|
|
.blob = application/x-blobapp
|
|
</pre>
|
|
<p>Note that the MIME type is made up here, and
|
|
you could call it <em class=
|
|
"replaceable"><code>diesel/oil</code></em> just
|
|
the same.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>In <code class=
|
|
"filename">$RECOLL_CONFDIR/mimeview</code> under
|
|
the <code class="literal">[view]</code> section,
|
|
add:</p>
|
|
<pre class="programlisting">
|
|
application/x-blobapp = blobviewer %f
|
|
</pre>
|
|
<p>We are supposing that <em class=
|
|
"replaceable"><code>blobviewer</code></em> wants
|
|
a file name parameter here, you would use
|
|
<code class="literal">%u</code> if it liked URLs
|
|
better.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>If you just wanted to change the application used by
|
|
<span class="application">Recoll</span> to display a
|
|
MIME type which it already knows, you would just need
|
|
to edit <code class="filename">mimeview</code>. The
|
|
entries you add in your personal file override those in
|
|
the central configuration, which you do not need to
|
|
alter. <code class="filename">mimeview</code> can also
|
|
be modified from the Gui.</p>
|
|
</div>
|
|
<div class="sect3">
|
|
<div class="titlepage">
|
|
<div>
|
|
<div>
|
|
<h4 class="title"><a name=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX" id=
|
|
"RCL.INSTALL.CONFIG.EXAMPLES.ADDINDEX"></a>Adding
|
|
indexing support for a new file type</h4>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<p>Let us now imagine that the above <em class=
|
|
"replaceable"><code>.blob</code></em> files actually
|
|
contain indexable text and that you know how to extract
|
|
it with a command line program. Getting <span class=
|
|
"application">Recoll</span> to index the files is easy.
|
|
You need to perform the above alteration, and also to
|
|
add data to the <code class="filename">mimeconf</code>
|
|
file (typically in <code class=
|
|
"filename">~/.recoll/mimeconf</code>):</p>
|
|
<div class="itemizedlist">
|
|
<ul class="itemizedlist" style=
|
|
"list-style-type: disc;">
|
|
<li class="listitem">
|
|
<p>Under the <code class="literal">[index]</code>
|
|
section, add the following line (more about the
|
|
<em class="replaceable"><code>rclblob</code></em>
|
|
indexing script later):</p>
|
|
<pre class="programlisting">
|
|
application/x-blobapp = exec rclblob</pre>
|
|
<p>Or if the files are mostly text and you don't
|
|
need to process them for indexing:</p>
|
|
<pre class="programlisting">
|
|
application/x-blobapp = internal text/plain</pre>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Under the <code class="literal">[icons]</code>
|
|
section, you should choose an icon to be
|
|
displayed for the files inside the result lists.
|
|
Icons are normally 64x64 pixels PNG files which
|
|
live in <code class=
|
|
"filename">/usr/share/recoll/images</code>.</p>
|
|
</li>
|
|
<li class="listitem">
|
|
<p>Under the <code class=
|
|
"literal">[categories]</code> section, you should
|
|
add the MIME type where it makes sense (you can
|
|
also create a category). Categories may be used
|
|
for filtering in advanced search.</p>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<p>The <em class=
|
|
"replaceable"><code>rclblob</code></em> handler should
|
|
be an executable program or script which exists inside
|
|
<code class=
|
|
"filename">/usr/share/recoll/filters</code>. It will be
|
|
given a file name as argument and should output the
|
|
text or html contents on the standard output.</p>
|
|
<p>The <a class="link" href="#RCL.PROGRAM.FILTERS"
|
|
title=
|
|
"4.1. Writing a document input handler">filter
|
|
programming</a> section describes in more detail how to
|
|
write an input handler.</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</body>
|
|
</html>
|