This commit is contained in:
Jean-Francois Dockes 2019-03-10 14:47:05 +01:00
parent ed45e5f00e
commit 3b55d03b39
3 changed files with 110 additions and 106 deletions

View File

@ -6920,96 +6920,94 @@ recollindex -c "$confdir"
</div>
</div>
</div>
<p>Index queries do not provide document content (only
a partial and unprecise reconstruction is performed to
show the snippets text). In order to access the actual
document data, the data extraction part of the indexing
process must be performed (subdocument access and
format translation). This is not trivial in the case of
embedded documents. The <code class=
"literal">rclextract</code> module provides a single
class which can be used to access the data content for
result documents.</p>
<p>Prior to <span class="application">Recoll</span>
1.25, index queries never provide document content
because it is not stored. More recent versions usually
store the document text, which can be optionally
retrieved when running a query (see <code class=
"literal">query.execute()</code> above - the result is
always plain text).</p>
<p>The <code class="literal">rclextract</code> module
can give access to the original document and to the
document text content (if not stored by the index, or
to access an HTML version of the text). Acessing the
original document is particularly useful if it is
embedded (e.g. an email attachment).</p>
<p>You need to import the <code class=
"literal">recoll</code> module before the <code class=
"literal">rclextract</code> module.</p>
<div class="sect4">
<div class="titlepage">
<div>
<div>
<h5 class="title"><a name=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES" id=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES"></a>Classes</h5>
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"
id=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
</a>The Extractor class</h5>
</div>
</div>
</div>
<div class="sect5">
<div class="titlepage">
<div>
<div>
<h6 class="title"><a name=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR"
id=
"RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
</a>The Extractor class</h6>
</div>
</div>
</div>
<div class="variablelist">
<dl class="variablelist">
<dt><span class=
"term">Extractor(doc)</span></dt>
<dd>
<p>An <code class="literal">Extractor</code>
object is built from a <code class=
"literal">Doc</code> object, output from a
query.</p>
</dd>
<dt><span class=
"term">Extractor.textextract(ipath)</span></dt>
<dd>
<p>Extract document defined by <em class=
"replaceable"><code>ipath</code></em> and
return a <code class="literal">Doc</code>
object. The <code class=
"literal">doc.text</code> field has the
document text converted to either text/plain
or text/html according to <code class=
"literal">doc.mimetype</code>. The typical
use would be as follows:</p>
<pre class="programlisting">
<div class="variablelist">
<dl class="variablelist">
<dt><span class="term">Extractor(doc)</span></dt>
<dd>
<p>An <code class="literal">Extractor</code>
object is built from a <code class=
"literal">Doc</code> object, output from a
query.</p>
</dd>
<dt><span class=
"term">Extractor.textextract(ipath)</span></dt>
<dd>
<p>Extract document defined by <em class=
"replaceable"><code>ipath</code></em> and
return a <code class="literal">Doc</code>
object. The <code class=
"literal">doc.text</code> field has the
document text converted to either text/plain or
text/html according to <code class=
"literal">doc.mimetype</code>. The typical use
would be as follows:</p>
<pre class="programlisting">
from recoll import recoll, rclextract
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing</pre>
<p>Passing <code class=
"literal">qdoc.ipath</code> to <code class=
"literal">textextract()</code> is redundant,
but reflects the fact that the <code class=
"literal">Extractor</code> object actually
has the capability to access the other
entries in a compound document.</p>
</dd>
<dt><span class=
"term">Extractor.idoctofile(ipath, targetmtype,
outfile='')</span></dt>
<dd>
<p>Extracts document into an output file,
which can be given explicitly or will be
created as a temporary file to be deleted by
the caller. Typical use:</p>
<pre class="programlisting">
<p>Passing <code class=
"literal">qdoc.ipath</code> to <code class=
"literal">textextract()</code> is redundant,
but reflects the fact that the <code class=
"literal">Extractor</code> object actually has
the capability to access the other entries in a
compound document.</p>
</dd>
<dt><span class=
"term">Extractor.idoctofile(ipath, targetmtype,
outfile='')</span></dt>
<dd>
<p>Extracts document into an output file, which
can be given explicitly or will be created as a
temporary file to be deleted by the caller.
Typical use:</p>
<pre class="programlisting">
from recoll import recoll, rclextract
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
<p>In all cases the output is a copy, even if
the requested document is a regular system
file, which may be wasteful in some cases. If
you want to avoid this, you can test for a
simple file document as follows:</p>
<pre class="programlisting">
<p>In all cases the output is a copy, even if
the requested document is a regular system
file, which may be wasteful in some cases. If
you want to avoid this, you can test for a
simple file document as follows:</p>
<pre class="programlisting">
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
</pre>
</dd>
</dl>
</div>
</dd>
</dl>
</div>
</div>
</div>

View File

@ -5349,40 +5349,45 @@ recollindex -c "$confdir"
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
<title>The rclextract module</title>
<para>Index queries do not provide document content (only a
partial and unprecise reconstruction is performed to show the
snippets text). In order to access the actual document data, the
data extraction part of the indexing process must be performed
(subdocument access and format translation). This is not trivial
in the case of embedded documents. The
<literal>rclextract</literal> module provides a single class
which can be used to access the data content for result
documents.</para>
<para>Prior to &RCL; 1.25, index queries never provide document
content because it is not stored. More recent versions usually
store the document text, which can be optionally retrieved when
running a query (see <literal>query.execute()</literal>
above - the result is always plain text).</para>
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
<title>Classes</title>
<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
<title>The Extractor class</title>
<para>The <literal>rclextract</literal> module can give access to
the original document and to the document text content (if not
stored by the index, or to access an HTML version of the text).
Acessing the original document is particularly useful if it is
embedded (e.g. an email attachment).</para>
<variablelist>
<para>You need to import the <literal>recoll</literal> module
before the <literal>rclextract</literal> module.</para>
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
<title>The Extractor class</title>
<varlistentry>
<term>Extractor(doc)</term>
<listitem><para>An <literal>Extractor</literal> object is
built from a <literal>Doc</literal> object, output
from a query.</para></listitem>
</varlistentry>
<varlistentry>
<term>Extractor.textextract(ipath)</term>
<listitem><para>Extract document defined by
<replaceable>ipath</replaceable> and return a
<literal>Doc</literal> object. The
<literal>doc.text</literal> field has the document text
converted to either text/plain or text/html according to
<literal>doc.mimetype</literal>. The typical use would be
as follows:</para>
<variablelist>
<varlistentry>
<term>Extractor(doc)</term>
<listitem><para>An <literal>Extractor</literal> object is
built from a <literal>Doc</literal> object, output
from a query.</para></listitem>
</varlistentry>
<varlistentry>
<term>Extractor.textextract(ipath)</term>
<listitem><para>Extract document defined by
<replaceable>ipath</replaceable> and return a
<literal>Doc</literal> object. The
<literal>doc.text</literal> field has the document text
converted to either text/plain or text/html according to
<literal>doc.mimetype</literal>. The typical use would be
as follows:</para>
<programlisting>
from recoll import recoll, rclextract
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
@ -5401,6 +5406,8 @@ doc = extractor.textextract(qdoc.ipath)
temporary file to be deleted by the caller. Typical
use:</para>
<programlisting>
from recoll import recoll, rclextract
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
@ -5417,8 +5424,7 @@ not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
</variablelist>
</sect5> <!-- Extractor class -->
</sect4> <!-- rclextract classes -->
</sect4>
</sect3> <!-- rclextract module -->

View File

@ -1,6 +1,6 @@
# Configuration
# The name of the source DocBook xml file
INPUT_XML = ../usermanual.xml ../recoll.conf.xml
INPUT_XML = ../usermanual.xml
# The makefile assumes that you have a
# directory named images that contains