recoll

Author	SHA1	Message	Date
Jean-Francois Dockes	9565663f09	textsplit: create isNGRAMMED() method to replace isCJK() and let the latter actually return what it says	2020-04-14 09:27:26 +02:00
Jean-Francois Dockes	eb53b598d6	Textsplit: lost char at korean->ascii transition	2020-04-10 14:54:13 +01:00
Jean-Francois Dockes	de246349da	textsplit: use more regular test for ISHANGUL. CJK: do not ignore whitespace, break on alphabetic non cjk character	2020-04-10 14:28:14 +02:00
Jean-Francois Dockes	1afc606718	textsplit: break on it.error() not only it.eof(). Seems to make a difference in rare cases? Add Komoran support but this one often fails	2020-03-26 09:31:19 +01:00
Jean-Francois Dockes	9719177c82	Korean external splitter: add some support for Mecab	2020-03-23 16:20:32 +01:00
Jean-Francois Dockes	384e3a1087	korean textsplit with extern help from konlpy, first step	2020-03-22 10:09:50 +01:00
Jean-Francois Dockes	5be3ed89c5	comments	2020-03-21 10:16:44 +01:00
Jean-Francois Dockes	bbf8c90185	experiment: ignore all ascii whitespace when generating cjk ngrams	2019-07-21 19:13:24 +02:00
Jean-Francois Dockes	baa6062de1	Do not process hangul as words, but as ngrams. Same issues as with Katakana: word separation too hard	2019-07-21 19:09:51 +02:00
Jean-Francois Dockes	6b058e9758	Regularise processing of hangul characters (there was a mixup of cjk/regular processing), and add a build-time option to either use cjk/ngram or regular term splitting for them	2019-07-21 19:09:51 +02:00
Jean-Francois Dockes	34bb62a8d9	got rid of a few unused variable warnings	2019-04-11 15:31:27 +02:00
Jean-Francois Dockes	0cbc46732f	Fixed the FSF address	2019-03-04 11:19:14 +01:00
Jean-Francois Dockes	bbeaebf632	textsplit: process unicode apostrophes and right quotation mark as ascii single quote	2019-02-01 16:10:51 +01:00
Jean-Francois Dockes	b1ff34407d	Simplify initialization by moving static config textsplit init from rclconfig to textsplit	2019-02-01 09:09:15 +01:00
Jean-Francois Dockes	bdc8d3eb38	Add config variable to process backslashes as letters	2019-01-29 18:32:19 +01:00
Jean-Francois Dockes	55e2fe5d27	Prevent text splitter bad array access and stl assertion crash (fedora rpmbuild) in marginal case. There were probably no real consequence beyond triggering the assertion	2018-11-15 18:19:39 +01:00
Jean-Francois Dockes	9244e31574	fixed a few spelling errors, mostly in comments and debug messages	2018-05-03 16:20:36 +02:00
Jean-Francois Dockes	5b35ecfe36	Windows warning suppression (no real changes)	2018-01-19 17:26:43 +01:00
Jean-Francois Dockes	15ea565e9f	m_words_in_span was always properly reset between invocations (if discardspan() was not called for some reason), resulting in crashes	2017-05-15 10:26:38 +02:00
Jean-Francois Dockes	f853f39ef3	Partially revert change treating Katakana as words, going back to n-grams. Did not work well because of separator-less compounds mostly	2017-04-25 10:20:38 +02:00
Jean-Francois Dockes	adaf7c77f9	Process katakana-western transitions as word breaks	2017-04-21 12:08:43 +02:00
Jean-Francois Dockes	9661a4431e	wen	2017-04-18 14:39:12 +02:00
Jean-Francois Dockes	0b0385e459	got rid of the STD_SHARED_XX std/tr1 defines	2016-07-13 15:12:25 +02:00
Jean-Francois Dockes	d8f4500f90	fix debuglog ref in test driver + std=c++11	2016-07-12 19:32:02 +02:00
Jean-Francois Dockes	f6a999de84	logging now uses c++ streams	2016-07-12 09:41:04 +02:00
Jean-Francois Dockes	a905a92328	arrange so that ' .net' is split as .net and net. Previously it only produced .net, which meant that matching filename extensions, like in fn:pdf$ did not work well because of cases where a special char or a space occurred before the .	2016-06-20 17:25:25 +02:00
Jean-Francois Dockes	0a9d55e790	Suppressed a couple warnings (unsigned issues) + small windows release fixes	2016-01-29 17:30:50 +01:00
Jean-Francois Dockes	c1c73573d8	more int fixups --HG-- branch : WINDOWSPORT	2015-09-02 07:34:59 +02:00
Jean-Francois Dockes	1cbf02f713	Suppressed many integer size warnings by a mix of type adjustments and casts, none of which should have a real effect. --HG-- branch : WINDOWSPORT	2015-09-01 19:39:20 +02:00
Jean-Francois Dockes	82295328cc	Test for end() after lower_bound call before dereferencing! --HG-- branch : WINDOWSPORT	2015-09-01 14:44:30 +02:00
Jean-Francois Dockes	e1bb1a3022	Make dehyphenate (co-worker->coworker) optional	2015-08-19 11:34:26 +02:00
Jean-Francois Dockes	94eb3119ce	Generate an additional unhyphenated term for singly hyphenated words: co-worker will index as [co worker], [co-worker] and [coworker]. Only produce terms for alphanumeric hashtags (discard #,xyz)	2015-08-13 18:18:49 +02:00
Jean-Francois Dockes	abe9fb671f	clean up autoconf of unordered_xx, prepare change to shared_ptr	2015-08-09 10:21:46 +02:00
Jean-Francois Dockes	f70ec1cab7	comment	2015-07-31 11:23:17 +02:00
Jean-Francois Dockes	0be78cfe48	index #hashtags as such	2014-07-29 09:56:00 +02:00
Jean-Francois Dockes	efaa1fb3a3	fix textsplit core dump caused by interaction of new 1.20 code with little-tested camelcase splitting section	2014-07-28 22:12:35 +02:00
Jean-Francois Dockes	8e7bac08c1	test driver	2014-06-10 17:41:46 +02:00
Jean-Francois Dockes	96d99ad6e5	textsplit: check for underflow while trimming the span	2014-05-19 18:52:51 +02:00
Jean-Francois Dockes	0145234b60	translate unicode hyphen (0x2010) in to ascii minus	2014-04-30 09:59:51 +02:00
Jean-Francois Dockes	077aed3018	fix term byte offsets produced by new textsplit: for highlighting	2014-04-24 12:42:10 +02:00
Jean-Francois Dockes	ece15318ab	New text splitter with word accumulator and full partial span generation. Search/Index seem ok. Still a pb with use for highlighting (preview)	2014-04-24 10:13:19 +02:00
Jean-Francois Dockes	e12d66865e	Deal with tr1 being gone in c0x11 compilers	2013-10-18 13:02:48 +02:00
Jean-Francois Dockes	b4c7efe490	Added (unifdefd) code to detect garbage data like undecoded base64 by looking at word length stats	2013-04-27 08:29:55 +02:00
Jean-Francois Dockes	d06e45946a	Handle wildcards as normal chars everywhere when splitting for query	2013-03-30 12:49:31 +01:00
Jean-Francois Dockes	0ae8ec99f6	more utf-8 err checking prevents bogus terms in index	2013-03-30 10:24:10 +01:00
Jean-Francois Dockes	df49598a8d	make comma a normal wordsplit char	2013-03-22 10:06:02 +01:00
Jean-Francois Dockes	dcf937d650	remove use of - as span-building character.	2013-03-04 12:16:11 +01:00
Jean-Francois Dockes	d2f7f11715	Use dynamic lib for shared recoll code	2012-12-29 14:27:01 +01:00
Jean-Francois Dockes	4544571490	term generation: only keep @ when not at start of term	2012-11-18 08:25:19 +01:00
Jean-Francois Dockes	d86f74a9e8	missing include	2012-10-16 16:10:14 +02:00

1 2 3

109 Commits