Periodic (or
+ title="2.10. Periodic indexing">Periodic (or
batch) indexing . recollindex is
executed at discrete times. On
Real
+ "2.11. Unix-like systems: real time indexing">Real
time indexing . (Only available on
Unix-like
systems).
indexing on a small home directory), or, with
Recoll 1.24 and newer,
by configuring
+ "2.11. Unix-like systems: real time indexing">configuring
the index so that only a subset of the tree will be
monitored. The choice of method and the parameters used can be
@@ -1136,8 +1135,8 @@ alink="#0000FF">
different areas of the file system to different
indexes. For example, if you were to issue the
following command: Then Recoll would
use configuration files stored in If both tesseract and
- pdftoppm
- (generally from the poppler-utils package) are
- installed, the PDF handler may attempt OCR on PDF files
- with no text content. This is controlled by the pdfocr
- configuration variable, which is false by default because
- OCR is very slow. The choice of language is very important for
- successfull OCR. Recoll has currently no way to determine
- this from the document itself. You can set the language
- to use through the contents of a The PDF handler can execute an external program to run
+ OCR if no text is found in the document. This is now
+ described in a separate section. This is new in Recoll
+ 1.26.5. Older versions had a more limited, non-caching
+ capability to execute an external OCR program in the PDF
+ handler. The new function has the following features: The OCR output is cached, stored as separate
+ files. The caching is ultimately based on a hash
+ value of the original file contents, so that it is
+ immune to file renames. A first path-based layer
+ ensures fast operation for unchanged (unmoved files),
+ and the data hash (which is still orders of magnitude
+ faster than OCR) is only re-computed if the file has
+ moved. OCR is only performed if the file was not
+ previously processed or if it changed. The support for a specific program is implemented
+ in a simple Python module. It should be
+ straightforward to add support for any OCR engine
+ with a capability to run from the command line. Modules initially exist for tesseract (Linux and Windows),
+ and ABBYY FineReader
+ (Linux, tested with version 11). ABBYY FineReader is
+ a commercial closed source program, but it sometimes
+ perform better than tesseract. The OCR is currently only called from the PDF
+ handler, but there should be no problem using it for
+ other image types. Configuration. See the relevant section. All
+ parameters can be localized in subdirectories through the
+ usual main configuration mechanism (path sections). Another environment variable, A link target defined as Languages for which to create stemming
expansion data. Stemmer names can be found by
executing 'recollindex -l', or this can also be
- set from a list in the GUI. "nice" process priority for the indexing
+ processes. Default: 19 (lowest) Appeared with
+ 1.26.5. Prior versions were fixed at 19. ionice class for the real time indexing
- process On platforms where this is supported. The
- default value is 3. ionice class for the indexing process. Despite
+ the misleading name, and on platforms where this
+ is supported, this affects all indexing
+ processes, not only the real time/monitoring
+ ones. The default value is 3 (use lowest "Idle"
+ priority). ionice class parameter for the real time
- indexing process. On platforms where this is
- supported. The default is empty. ionice class level parameter if the class
+ supports it. The default is empty, as the default
+ "Idle" class has no levels. Attempt OCR of PDF files with no text content
- if both tesseract and pdftoppm are installed.
+ Attempt OCR of PDF files with no text content.
This can be defined in subdirectories. The
- default is off because OCR is so very slow. Language to assume for PDF OCR. This is very
- important for having a reasonable rate of errors
- with tesseract. This can also be set through a
- configuration variable or directory-local
- parameters. See the rclpdf.py script. OCR modules to try. The top OCR script will
+ try to load the corresponding modules in order
+ and use the first which reports being capable of
+ performing OCR on the input file. Modules for
+ tesseract and ABBYY FineReader are present in the
+ standard distribution. Location for caching OCR data. The default if
+ this is empty or undefined is to store the cached
+ OCR data under $RECOLL_CONFDIR/ocrcache. Language to assume for tesseract OCR.
+ Important for improving the OCR accuracy. This
+ can also be set through the contents of a file in
+ the currently processed directory. See the
+ rclocrtesseract.py script. Example values: eng,
+ fra... See the tesseract documentation. Path for the tesseract command. This is mostly
+ useful on Windows, or for specifying a
+ non-default tesseract command. e.g. on Windows:
+ C:/Program Files (x86)/Tesseract-OCR/tesseract.exe Language to assume for abbyy OCR. Important
+ for improving the OCR accuracy. This can also be
+ set through the contents of a file in the
+ currently processed directory. See the
+ rclocrabbyy.py script. Typical values: English,
+ French... See the ABBYY documentation. Path for the abbyy command The ABBY directory
+ is usually not in the path, so you should set
+ this. The
- recoll -c ~/.indexes-email
+ recoll -c ~/.indexes-email
~/.indexes-email/ and, (unless
@@ -2141,45 +2140,16 @@ metadatacmds = ;
- .ocrpdflang text file in the same
- directory as the PDF document, or through the
- RECOLL_TESSERACT_LANG
- environment variable, or through the contents of an
- ocrpdf text file inside the
- configuration directory. If none of the above are used,
- Recoll will try to guess
- the language from the NLS environment.
+
+
- export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
+ export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
RECOLL_ACTIVE_EXTRA_DBS allows adding to
the active list of indexes. This variable was suggested
@@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768
parent folder expansion, usually creating a file
manager window on the folder where the container file
resides. E.g.:
- <a href="F%N">%P</a>
+ <a href="F%N">%P</a>
R%N|
@@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768
javascript program to
the documents, like the following example, which would
initiate a search by double-clicking any term:scriptname
- <script language="JavaScript">
+
<script language="JavaScript">
function recollsearch() {
var t = document.getSelection();
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
@@ -8838,7 +8862,8 @@ for i in range(nres):
idxnicepriomonioniceclassmonioniceclassdatapdfocr
pdfocrlang
+
+ ocrprogsocrcachedirtesseractlangtesseractcmdabbyylangabbyycmd
- [~/.kde/share/apps/okular/docdata]
+
[~/.kde/share/apps/okular/docdata]
.xml = application/x-okular-notes
recoll_noindex
mimemap variable has been
diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml
index 8723ad59..5533e342 100644
--- a/src/doc/user/usermanual.xml
+++ b/src/doc/user/usermanual.xml
@@ -1414,30 +1414,9 @@ metadatacmds = ;