This commit is contained in:
Jean-Francois Dockes 2019-03-10 14:47:05 +01:00
parent ed45e5f00e
commit 3b55d03b39
3 changed files with 110 additions and 106 deletions

View File

@ -6920,96 +6920,94 @@ recollindex -c "$confdir"
</div> </div>
</div> </div>
</div> </div>
<p>Index queries do not provide document content (only <p>Prior to <span class="application">Recoll</span>
a partial and unprecise reconstruction is performed to 1.25, index queries never provide document content
show the snippets text). In order to access the actual because it is not stored. More recent versions usually
document data, the data extraction part of the indexing store the document text, which can be optionally
process must be performed (subdocument access and retrieved when running a query (see <code class=
format translation). This is not trivial in the case of "literal">query.execute()</code> above - the result is
embedded documents. The <code class= always plain text).</p>
"literal">rclextract</code> module provides a single <p>The <code class="literal">rclextract</code> module
class which can be used to access the data content for can give access to the original document and to the
result documents.</p> document text content (if not stored by the index, or
to access an HTML version of the text). Acessing the
original document is particularly useful if it is
embedded (e.g. an email attachment).</p>
<p>You need to import the <code class=
"literal">recoll</code> module before the <code class=
"literal">rclextract</code> module.</p>
<div class="sect4"> <div class="sect4">
<div class="titlepage"> <div class="titlepage">
<div> <div>
<div> <div>
<h5 class="title"><a name= <h5 class="title"><a name=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES" id= "RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES"></a>Classes</h5> id=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
</a>The Extractor class</h5>
</div> </div>
</div> </div>
</div> </div>
<div class="sect5"> <div class="variablelist">
<div class="titlepage"> <dl class="variablelist">
<div> <dt><span class="term">Extractor(doc)</span></dt>
<div> <dd>
<h6 class="title"><a name= <p>An <code class="literal">Extractor</code>
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR" object is built from a <code class=
id= "literal">Doc</code> object, output from a
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"> query.</p>
</a>The Extractor class</h6> </dd>
</div> <dt><span class=
</div> "term">Extractor.textextract(ipath)</span></dt>
</div> <dd>
<div class="variablelist"> <p>Extract document defined by <em class=
<dl class="variablelist"> "replaceable"><code>ipath</code></em> and
<dt><span class= return a <code class="literal">Doc</code>
"term">Extractor(doc)</span></dt> object. The <code class=
<dd> "literal">doc.text</code> field has the
<p>An <code class="literal">Extractor</code> document text converted to either text/plain or
object is built from a <code class= text/html according to <code class=
"literal">Doc</code> object, output from a "literal">doc.mimetype</code>. The typical use
query.</p> would be as follows:</p>
</dd> <pre class="programlisting">
<dt><span class= from recoll import recoll, rclextract
"term">Extractor.textextract(ipath)</span></dt>
<dd>
<p>Extract document defined by <em class=
"replaceable"><code>ipath</code></em> and
return a <code class="literal">Doc</code>
object. The <code class=
"literal">doc.text</code> field has the
document text converted to either text/plain
or text/html according to <code class=
"literal">doc.mimetype</code>. The typical
use would be as follows:</p>
<pre class="programlisting">
qdoc = query.fetchone() qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc) extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath) doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing</pre> # use doc.text, e.g. for previewing</pre>
<p>Passing <code class= <p>Passing <code class=
"literal">qdoc.ipath</code> to <code class= "literal">qdoc.ipath</code> to <code class=
"literal">textextract()</code> is redundant, "literal">textextract()</code> is redundant,
but reflects the fact that the <code class= but reflects the fact that the <code class=
"literal">Extractor</code> object actually "literal">Extractor</code> object actually has
has the capability to access the other the capability to access the other entries in a
entries in a compound document.</p> compound document.</p>
</dd> </dd>
<dt><span class= <dt><span class=
"term">Extractor.idoctofile(ipath, targetmtype, "term">Extractor.idoctofile(ipath, targetmtype,
outfile='')</span></dt> outfile='')</span></dt>
<dd> <dd>
<p>Extracts document into an output file, <p>Extracts document into an output file, which
which can be given explicitly or will be can be given explicitly or will be created as a
created as a temporary file to be deleted by temporary file to be deleted by the caller.
the caller. Typical use:</p> Typical use:</p>
<pre class="programlisting"> <pre class="programlisting">
from recoll import recoll, rclextract
qdoc = query.fetchone() qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc) extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre> filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
<p>In all cases the output is a copy, even if <p>In all cases the output is a copy, even if
the requested document is a regular system the requested document is a regular system
file, which may be wasteful in some cases. If file, which may be wasteful in some cases. If
you want to avoid this, you can test for a you want to avoid this, you can test for a
simple file document as follows:</p> simple file document as follows:</p>
<pre class="programlisting"> <pre class="programlisting">
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS") not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
</pre> </pre>
</dd> </dd>
</dl> </dl>
</div>
</div> </div>
</div> </div>
</div> </div>

View File

@ -5349,40 +5349,45 @@ recollindex -c "$confdir"
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT"> <sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
<title>The rclextract module</title> <title>The rclextract module</title>
<para>Index queries do not provide document content (only a
partial and unprecise reconstruction is performed to show the
snippets text). In order to access the actual document data, the
data extraction part of the indexing process must be performed
(subdocument access and format translation). This is not trivial
in the case of embedded documents. The
<literal>rclextract</literal> module provides a single class
which can be used to access the data content for result
documents.</para>
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES"> <para>Prior to &RCL; 1.25, index queries never provide document
<title>Classes</title> content because it is not stored. More recent versions usually
store the document text, which can be optionally retrieved when
running a query (see <literal>query.execute()</literal>
above - the result is always plain text).</para>
<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"> <para>The <literal>rclextract</literal> module can give access to
<title>The Extractor class</title> the original document and to the document text content (if not
stored by the index, or to access an HTML version of the text).
Acessing the original document is particularly useful if it is
embedded (e.g. an email attachment).</para>
<variablelist> <para>You need to import the <literal>recoll</literal> module
before the <literal>rclextract</literal> module.</para>
<varlistentry> <sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
<term>Extractor(doc)</term> <title>The Extractor class</title>
<listitem><para>An <literal>Extractor</literal> object is
built from a <literal>Doc</literal> object, output <variablelist>
from a query.</para></listitem>
</varlistentry> <varlistentry>
<varlistentry> <term>Extractor(doc)</term>
<term>Extractor.textextract(ipath)</term> <listitem><para>An <literal>Extractor</literal> object is
<listitem><para>Extract document defined by built from a <literal>Doc</literal> object, output
<replaceable>ipath</replaceable> and return a from a query.</para></listitem>
<literal>Doc</literal> object. The </varlistentry>
<literal>doc.text</literal> field has the document text <varlistentry>
converted to either text/plain or text/html according to <term>Extractor.textextract(ipath)</term>
<literal>doc.mimetype</literal>. The typical use would be <listitem><para>Extract document defined by
as follows:</para> <replaceable>ipath</replaceable> and return a
<literal>Doc</literal> object. The
<literal>doc.text</literal> field has the document text
converted to either text/plain or text/html according to
<literal>doc.mimetype</literal>. The typical use would be
as follows:</para>
<programlisting> <programlisting>
from recoll import recoll, rclextract
qdoc = query.fetchone() qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc) extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath) doc = extractor.textextract(qdoc.ipath)
@ -5401,6 +5406,8 @@ doc = extractor.textextract(qdoc.ipath)
temporary file to be deleted by the caller. Typical temporary file to be deleted by the caller. Typical
use:</para> use:</para>
<programlisting> <programlisting>
from recoll import recoll, rclextract
qdoc = query.fetchone() qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc) extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting> filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
@ -5417,8 +5424,7 @@ not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
</variablelist> </variablelist>
</sect5> <!-- Extractor class --> </sect4>
</sect4> <!-- rclextract classes -->
</sect3> <!-- rclextract module --> </sect3> <!-- rclextract module -->

View File

@ -1,6 +1,6 @@
# Configuration # Configuration
# The name of the source DocBook xml file # The name of the source DocBook xml file
INPUT_XML = ../usermanual.xml ../recoll.conf.xml INPUT_XML = ../usermanual.xml
# The makefile assumes that you have a # The makefile assumes that you have a
# directory named images that contains # directory named images that contains