From 17d29774b0dbdb371ef06905307bd384fa52f7e5 Mon Sep 17 00:00:00 2001 From: Jean-Francois Dockes Date: Thu, 27 Feb 2020 18:17:51 +0100 Subject: [PATCH] document the new ocr function and its config --- src/doc/user/recoll.conf.xml | 72 +++++++--- src/doc/user/usermanual.html | 255 ++++++++++++++++++++++++----------- src/doc/user/usermanual.xml | 68 ++++++---- src/sampleconf/recoll.conf | 77 ++++++++--- 4 files changed, 338 insertions(+), 134 deletions(-) diff --git a/src/doc/user/recoll.conf.xml b/src/doc/user/recoll.conf.xml index b442218d..30bc4eef 100644 --- a/src/doc/user/recoll.conf.xml +++ b/src/doc/user/recoll.conf.xml @@ -247,8 +247,8 @@ will reduce the index size. This can only be set for a whole index, not for a subtree. dehyphenate -Determines if we index -'coworker' also when the input is 'co-worker'. This is new +Determines if we index 'coworker' +also when the input is 'co-worker'. This is new in version 1.22, and on by default. Setting the variable to off allows restoring the previous behaviour. @@ -279,7 +279,8 @@ as large. indexstemminglanguages Languages for which to create stemming expansion data. Stemmer names can be found by executing 'recollindex --l', or this can also be set from a list in the GUI. +-l', or this can also be set from a list in the GUI. The values are full +language names, e.g. english, french... defaultcharset Default character @@ -608,9 +609,9 @@ space issues. aspellLanguage Language definitions to use when creating the aspell dictionary. The value must match a set of aspell language -definition files. You can type "aspell dicts" to see a list The default -if this is not set is to use the NLS environment to guess the -value. +definition files. You can type "aspell dicts" to see a list The default +if this is not set is to use the NLS environment to guess the value. The +values are the 2-letter language codes (e.g. 'en', 'fr'...) aspellAddCreateParam Additional option and parameter to aspell dictionary creation @@ -650,14 +651,20 @@ patterns are matched with fnmatch(pattern, path, 0) You can quote entries containing white space with double quotes (quote the whole entry, not the pattern). The default is empty. Example: mondelaypatterns = *.log:20 "*with spaces.*:30" + +idxniceprio +"nice" process priority for the indexing processes. Default: 19 +(lowest) Appeared with 1.26.5. Prior versions were fixed at 19. monioniceclass -ionice class for the real time indexing process On platforms where this is supported. The default value is -3. +ionice class for the indexing process. Despite the misleading name, and on platforms where this is +supported, this affects all indexing processes, +not only the real time/monitoring ones. The default value is 3 (use +lowest "Idle" priority). monioniceclassdata -ionice class parameter for the real time indexing process. On platforms where this is supported. The default is -empty. +ionice class level parameter if the class supports it. The default is empty, as the default "Idle" class has no +levels. Query-time parameters (no impact on the index) @@ -700,14 +707,8 @@ with possibly meaning-altering missing words. Parameters for the PDF input script pdfocr -Attempt OCR of PDF files with no text content if both tesseract and -pdftoppm are installed. This can be defined in subdirectories. The default is off because -OCR is so very slow. - -pdfocrlang -Language to assume for PDF OCR. This is very important for having a reasonable rate of errors -with tesseract. This can also be set through a configuration variable -or directory-local parameters. See the rclpdf.py script. +Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The default is off because +OCR is so very slow. Will only do anything if ocrprogs is defined. pdfattach Enable PDF attachment extraction by executing pdftk (if @@ -732,6 +733,41 @@ selected field, for editing or erasing. A new instance is created for each document, so that the object can keep state for, e.g. eliminating duplicate values. + +Parameters for OCR processing + +ocrprogs +OCR modules to try. The top OCR script will try to load the corresponding modules in +order and use the first which reports being capable of performing OCR on +the input file. Modules for tesseract and ABBYY FineReader are present in +the standard distribution. + +ocrcachedir +Location for caching OCR data. The default if this is empty or undefined is to store the cached +OCR data under $RECOLL_CONFDIR/ocrcache. + +tesseractlang +Language to assume for tesseract OCR. Important for improving the OCR accuracy. This can also be set +through the contents of a file in +the currently processed directory. See the rclocrtesseract.py +script. Example values: eng, fra... See the tesseract documentation. + +tesseractcmd +Path for the tesseract command. This is mostly useful on Windows, or for specifying a non-default +tesseract command. e.g. on Windows: +C:/Program Files (x86)/Tesseract-OCR/tesseract.exe + +abbyylang +Language to assume for abbyy OCR. Important for improving the OCR accuracy. This can also be set +through the contents of a file in +the currently processed directory. See the rclocrabbyy.py +script. Typical values: English, French... See the ABBYY documentation. + + +abbyycmd +Path for the abbyy command The ABBY directory is usually not in the path, so you should set this. + + Parameters set for specific locations diff --git a/src/doc/user/usermanual.html b/src/doc/user/usermanual.html index fbea9846..9f038bf2 100644 --- a/src/doc/user/usermanual.html +++ b/src/doc/user/usermanual.html @@ -3,7 +3,7 @@ + "HTML Tidy for HTML5 for Linux version 5.6.0"> Recoll user manual @@ -157,20 +157,19 @@ alink="#0000FF">
2.8.1. OCR with - Tesseract
-
2.8.2. XMP fields extraction
-
2.8.3. 2.8.2. PDF attachment indexing
2.9. Recoll and OCR
+
2.10. Periodic indexing
-
2.10. 2.11. Unix-like systems: real time indexing
@@ -781,7 +780,7 @@ alink="#0000FF"> "list-style-type: disc;">
  • Periodic (or + title="2.10. Periodic indexing">Periodic (or batch) indexingrecollindex is executed at discrete times. On

  • Real + "2.11. Unix-like systems: real time indexing">Real time indexing(Only available on Unix-like systems). indexing on a small home directory), or, with Recoll 1.24 and newer, by configuring + "2.11. Unix-like systems: real time indexing">configuring the index so that only a subset of the tree will be monitored.

    The choice of method and the parameters used can be @@ -1136,8 +1135,8 @@ alink="#0000FF"> different areas of the file system to different indexes. For example, if you were to issue the following command:

    -
    -              recoll -c ~/.indexes-email
    +
    recoll -c ~/.indexes-email

    Then Recoll would use configuration files stored in ~/.indexes-email/ and, (unless @@ -2141,45 +2140,16 @@ metadatacmds = ; -

    -
    -
    -
    -

    2.8.1. OCR with - Tesseract

    -
    -
    -
    -

    If both tesseract and - pdftoppm - (generally from the poppler-utils package) are - installed, the PDF handler may attempt OCR on PDF files - with no text content. This is controlled by the pdfocr - configuration variable, which is false by default because - OCR is very slow.

    -

    The choice of language is very important for - successfull OCR. Recoll has currently no way to determine - this from the document itself. You can set the language - to use through the contents of a .ocrpdflang text file in the same - directory as the PDF document, or through the - RECOLL_TESSERACT_LANG - environment variable, or through the contents of an - ocrpdf text file inside the - configuration directory. If none of the above are used, - Recoll will try to guess - the language from the NLS environment.

    -
    +

    The PDF handler can execute an external program to run + OCR if no text is found in the document. This is now + described in a separate section.

    2.8.2. XMP + id="RCL.INDEXING.PDF.XMP">2.8.1. XMP fields extraction

    @@ -2236,7 +2206,7 @@ metadatacmds = ;

    2.8.3. PDF + id="RCL.INDEXING.PDF.ATTACH">2.8.2. PDF attachment indexing

    @@ -2252,13 +2222,67 @@ metadatacmds = ;
    +
    +
    +
    +
    +

    2.9. Recoll and OCR

    +
    +
    +
    +

    This is new in Recoll + 1.26.5. Older versions had a more limited, non-caching + capability to execute an external OCR program in the PDF + handler. The new function has the following features:

    +
    +
      +
    • +

      The OCR output is cached, stored as separate + files. The caching is ultimately based on a hash + value of the original file contents, so that it is + immune to file renames. A first path-based layer + ensures fast operation for unchanged (unmoved files), + and the data hash (which is still orders of magnitude + faster than OCR) is only re-computed if the file has + moved. OCR is only performed if the file was not + previously processed or if it changed.

      +
    • +
    • +

      The support for a specific program is implemented + in a simple Python module. It should be + straightforward to add support for any OCR engine + with a capability to run from the command line.

      +
    • +
    • +

      Modules initially exist for tesseract (Linux and Windows), + and ABBYY FineReader + (Linux, tested with version 11). ABBYY FineReader is + a commercial closed source program, but it sometimes + perform better than tesseract.

      +
    • +
    • +

      The OCR is currently only called from the PDF + handler, but there should be no problem using it for + other image types.

      +
    • +
    +
    +

    Configuration. See the relevant section. All + parameters can be localized in subdirectories through the + usual main configuration mechanism (path sections).

    +

    2.9. Periodic + "RCL.INDEXING.PERIODIC">2.10. Periodic indexing

    @@ -2431,7 +2455,7 @@ metadatacmds = ;

    2.10. 2.11. Unix-like systems: real time indexing

    @@ -3759,8 +3783,8 @@ fs.inotify.max_user_watches=32768 that every user does not have to do it. The variable should define a colon-separated list of index directories, ie:

    -
    -          export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db
    +
    export RECOLL_EXTRA_DBS=/some/place/xapiandb:/some/other/db

    Another environment variable, RECOLL_ACTIVE_EXTRA_DBS allows adding to the active list of indexes. This variable was suggested @@ -4565,8 +4589,8 @@ fs.inotify.max_user_watches=32768 parent folder expansion, usually creating a file manager window on the folder where the container file resides. E.g.:

    -
    -              <a href="F%N">%P</a>
    +
    <a href="F%N">%P</a>

    A link target defined as R%N|scriptname @@ -4708,8 +4732,8 @@ fs.inotify.max_user_watches=32768 javascript program to the documents, like the following example, which would initiate a search by double-clicking any term:

    -
    -          <script language="JavaScript">
    +          
    <script language="JavaScript">
             function recollsearch() {
             var t = document.getSelection();
             window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
    @@ -8838,7 +8862,8 @@ for i in range(nres):
                       

    Languages for which to create stemming expansion data. Stemmer names can be found by executing 'recollindex -l', or this can also be - set from a list in the GUI.

    + set from a list in the GUI. The values are full + language names, e.g. english, french...

    + the value. The values are the 2-letter language + codes (e.g. 'en', 'fr'...)

    idxniceprio
    +
    +

    "nice" process priority for the indexing + processes. Default: 19 (lowest) Appeared with + 1.26.5. Prior versions were fixed at 19.

    +
    +
    monioniceclass
    -

    ionice class for the real time indexing - process On platforms where this is supported. The - default value is 3.

    +

    ionice class for the indexing process. Despite + the misleading name, and on platforms where this + is supported, this affects all indexing + processes, not only the real time/monitoring + ones. The default value is 3 (use lowest "Idle" + priority).

    monioniceclassdata
    -

    ionice class parameter for the real time - indexing process. On platforms where this is - supported. The default is empty.

    +

    ionice class level parameter if the class + supports it. The default is empty, as the default + "Idle" class has no levels.

    @@ -9611,20 +9648,10 @@ for i in range(nres): id= "RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr
    -

    Attempt OCR of PDF files with no text content - if both tesseract and pdftoppm are installed. +

    Attempt OCR of PDF files with no text content. This can be defined in subdirectories. The - default is off because OCR is so very slow.

    -
    -
    pdfocrlang
    -
    -

    Language to assume for PDF OCR. This is very - important for having a reasonable rate of errors - with tesseract. This can also be set through a - configuration variable or directory-local - parameters. See the rclpdf.py script.

    + default is off because OCR is so very slow. Will + only do anything if ocrprogs is defined.

    +
    +
    +
    +
    +

    Parameters + for OCR processing

    +
    +
    +
    +
    +
    +
    ocrprogs
    +
    +

    OCR modules to try. The top OCR script will + try to load the corresponding modules in order + and use the first which reports being capable of + performing OCR on the input file. Modules for + tesseract and ABBYY FineReader are present in the + standard distribution.

    +
    +
    ocrcachedir
    +
    +

    Location for caching OCR data. The default if + this is empty or undefined is to store the cached + OCR data under $RECOLL_CONFDIR/ocrcache.

    +
    +
    tesseractlang
    +
    +

    Language to assume for tesseract OCR. + Important for improving the OCR accuracy. This + can also be set through the contents of a file in + the currently processed directory. See the + rclocrtesseract.py script. Example values: eng, + fra... See the tesseract documentation.

    +
    +
    tesseractcmd
    +
    +

    Path for the tesseract command. This is mostly + useful on Windows, or for specifying a + non-default tesseract command. e.g. on Windows: + C:/Program Files (x86)/Tesseract-OCR/tesseract.exe

    +
    +
    abbyylang
    +
    +

    Language to assume for abbyy OCR. Important + for improving the OCR accuracy. This can also be + set through the contents of a file in the + currently processed directory. See the + rclocrabbyy.py script. Typical values: English, + French... See the ABBYY documentation.

    +
    +
    abbyycmd
    +
    +

    Path for the abbyy command The ABBY directory + is usually not in the path, so you should set + this.

    +
    +
    +
    +
    @@ -9858,8 +9959,8 @@ for i in range(nres): "filename">.xml extension but should be handled specially, which is possible because they are usually all located in one place. Example:

    -
    -          [~/.kde/share/apps/okular/docdata]
    +          
    [~/.kde/share/apps/okular/docdata]
             .xml = application/x-okular-notes

    The recoll_noindex mimemap variable has been diff --git a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml index 8723ad59..5533e342 100644 --- a/src/doc/user/usermanual.xml +++ b/src/doc/user/usermanual.xml @@ -1414,30 +1414,9 @@ metadatacmds = ; tags = tmsu tags %f specific metadata tags from an XMP packet, and to extract PDF attachments. - - OCR with Tesseract - - If both tesseract and - pdftoppm (generally from the - poppler-utils package) are installed, - the PDF handler may attempt OCR on PDF files with no text - content. This is controlled by the - pdfocr - configuration variable, which is false by default because - OCR is very slow. - - The choice of language is very important for successfull - OCR. Recoll has currently no way to determine this from the - document itself. You can set the language to use through the - contents of a .ocrpdflang text file in the - same directory as the PDF document, or through the - RECOLL_TESSERACT_LANG environment variable, or - through the contents of an ocrpdf text file - inside the configuration directory. If none of the above are used, - &RCL; will try to guess the language from the NLS - environment. - - + The PDF handler can execute an external program to run OCR if + no text is found in the document. This is now described in a + separate section. XMP fields extraction @@ -1510,6 +1489,47 @@ metadatacmds = ; tags = tmsu tags %f + + Recoll and OCR + + This is new in &RCL; 1.26.5. Older versions had a more limited, + non-caching capability to execute an external OCR program in the PDF + handler. The new function has the following features: + + + The OCR output is cached, stored as separate + files. The caching is ultimately based on a hash value of the + original file contents, so that it is immune to file renames. A + first path-based layer ensures fast operation for unchanged + (unmoved files), and the data hash (which is still orders of + magnitude faster than OCR) is only re-computed if the file has + moved. OCR is only performed if the file was not previously + processed or if it changed. + The support for a specific program is implemented + in a simple Python module. It should be straightforward to add + support for any OCR engine with a capability to run from the + command line. + Modules initially exist for + tesseract (Linux and Windows), and + ABBYY FineReader (Linux, tested with + version 11). ABBYY FineReader is a commercial closed source + program, but it sometimes perform better than + tesseract. + The OCR is currently only called from the PDF + handler, but there should be no problem using it for other image + types. + + + + Configuration. See the + + relevant section. All parameters can be localized in + subdirectories through the usual main configuration mechanism (path + sections). + + + + Periodic indexing diff --git a/src/sampleconf/recoll.conf b/src/sampleconf/recoll.conf index e390db68..a72e1f42 100644 --- a/src/sampleconf/recoll.conf +++ b/src/sampleconf/recoll.conf @@ -350,7 +350,8 @@ indexStoreDocText = 1 # # Languages for which to create stemming expansion # data.Stemmer names can be found by executing 'recollindex -# -l', or this can also be set from a list in the GUI. +# -l', or this can also be set from a list in the GUI. The values are full +# language names, e.g. english, french... indexstemminglanguages = english # Default character @@ -760,9 +761,9 @@ checkneedretryindexscript = rclcheckneedretry.sh # # Language definitions to use when creating the aspell # dictionary.The value must match a set of aspell language -# definition files. You can type "aspell dicts" to see a list The default -# if this is not set is to use the NLS environment to guess the -# value. +# definition files. You can type "aspell dicts" to see a list The default +# if this is not set is to use the NLS environment to guess the value. The +# values are the 2-letter language codes (e.g. 'en', 'fr'...) #aspellLanguage = en # @@ -902,19 +903,11 @@ snippetMaxPosWalk = 1000000 # # -# Attempt OCR of PDF files with no text content if both tesseract and -# pdftoppm are installed. +# Attempt OCR of PDF files with no text content. # This can be defined in subdirectories. The default is off because -# OCR is so very slow. -#pdfocr = 0 - -# -# Language to assume for PDF OCR. -# This is very important for having a reasonable rate of errors -# with tesseract. This can also be set through a configuration variable -# or directory-local parameters. See the rclpdf.py script. +# OCR is so very slow. Will only do anything if ocrprogs is defined. # -#pdfocrlang = eng +#pdfocr = 0 # # @@ -946,6 +939,60 @@ snippetMaxPosWalk = 1000000 #pdfextrametafix = /path/to/fixerscript.py +# Parameters for OCR processing + + +# +# OCR modules to try. +# The top OCR script will try to load the corresponding modules in +# order and use the first which reports being capable of performing OCR on +# the input file. Modules for tesseract and ABBYY FineReader are present in +# the standard distribution. +# +#ocrprogs = abbyy tesseract + +# +# Location for caching OCR data. +# The default if this is empty or undefined is to store the cached +# OCR data under $RECOLL_CONFDIR/ocrcache. +# +#ocrcachedir= + + +# +# Language to assume for tesseract OCR. +# Important for improving the OCR accuracy. This can also be set +# through the contents of a file in +# the currently processed directory. See the rclocrtesseract.py +# script. Example values: eng, fra... See the tesseract documentation. +# +#tesseractlang = eng + +# +# Path for the tesseract command. +# This is mostly useful on Windows, or for specifying a non-default +# tesseract command. e.g. on Windows: +# C:/Program Files (x86)/Tesseract-OCR/tesseract.exe +# +#tesseractcmd = c:/Program Files (x86)/Tesseract-OCR/tesseract.exe + +# +# Language to assume for abbyy OCR. +# Important for improving the OCR accuracy. This can also be set +# through the contents of a file in +# the currently processed directory. See the rclocrabbyy.py +# script. Typical values: English, French... See the ABBYY documentation. +# +# +#abbyylang = English + +# +# Path for the abbyy command +# The ABBY directory is usually not in the path, so you should set this. +# +# +abbyycmd = /opt/ABBYYOCR11/abbyyocr11 + # Parameters set for specific # locations