Zend Lucene - tokenizing swedish chars

tags:

lucene
zend-framework
zend-lucene
zend-search-lucene

views:

209

answers:

+2 Q:

Zend Lucene - tokenizing swedish chars

I use Zend Lucene to index swedish texts. The problem is that lucene tokenizes words at swedish chars åäö. For example the word "världens" becomes two words "v" and "ldens" in the index.

Is there a way to add characters that zend lucene should accept and not tokenize at?

+1 A:

Using Analysers. See the docs about text analysis, using utf8 and docs about writing your own analyser. I recommend you just use a UTF-8 analyser.

Yacoby 2009-12-30 14:35:30

+3 A:

use an UTF-8 compatible text analyzer instead of the default text analyzer for tokenization. note that this requires PHP's PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support (the default if you use the PCRE library bundled with PHP, but possibly not enabled if you use a shared library). for case insensitive versions of the UTF-8 compatible analyzers, you also need the mbstring extension to be enabled.

ax 2009-12-30 14:36:27

ansaurus

tags:

views:

answers:

Zend Lucene - tokenizing swedish chars

related questions