views:

209

answers:

2

I use Zend Lucene to index swedish texts. The problem is that lucene tokenizes words at swedish chars åäö. For example the word "världens" becomes two words "v" and "ldens" in the index.

Is there a way to add characters that zend lucene should accept and not tokenize at?

+1  A: 

Using Analysers. See the docs about text analysis, using utf8 and docs about writing your own analyser. I recommend you just use a UTF-8 analyser.

Yacoby
+3  A: 

use an UTF-8 compatible text analyzer instead of the default text analyzer for tokenization. note that this requires PHP's PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support (the default if you use the PCRE library bundled with PHP, but possibly not enabled if you use a shared library). for case insensitive versions of the UTF-8 compatible analyzers, you also need the mbstring extension to be enabled.

ax