views:

637

answers:

1

I'm having some serious issues using Zend_Lucene and foreign characters like åäö. These issues appear both when the index is created and when it's queried. I've tried both iso-8859-1 and utf-8.

ISO-8859-1

The query that doesn't work looks like "+_area:skåne". With Zend_Lucene I'm getting no matches, but if I run this query in Luke I get many matching docuements.

The index contains 20 fields. The "_area" field is added with the following syntax:

$doc->addField(Zend_Search_Lucene_Field::keyword('_area', strtolower($item['area']), 'iso-8859-1'));

I am using the Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive analyzer.

While running indexing, the error message below appeared sometimes (the documents indexed were randomly selected from DB with iso-8859-1 encoding)

Notice: iconv(): Detected an illegal character in input string in TextNum.php.

This was "solved" by checking if $this->_input is empty, as it seemed that this caused the notices. Note: The weird query results were a pre-existing condition.

When I search keyword fields using foreign characters I receive the error above, but when I search text fields it behaves differently. Then it generates about a hundred of the error below.

Notice: Undefined offset: 1996 in \Zend\Search\Lucene\Search\Query\MultiTerm.php on line 472

But it produces what looks like a correct result set! On a side note, this second query doesn't generate any results in Luke.

UTF-8

I've also tried UTF-8 because, to my knowledge, Zend_Lucene uses it internally. Since the data set is ISO-8859-1, I convert it using utf8_encode. But the indexing produces the following errors.

Notice: Undefined offset: 266979 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 632

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 196

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 200

Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231

Notice: Undefined offset: 250595 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020

Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 465 ...


So. Can someone please shed some light? :) I believe (after days of googling) that I'm not the only one experiencing this.

+1  A: 

I suggest you try using a UTF-8 compatible text analyzer. It looks like the analyzer you are using destroys the non-ASCII characters. You should make sure that the text is input properly, and that it reaches Lucene in the proper format.

Yuval F
I actually was able to solve this right after I posted, but your answer would probably have lead me to the solution.I went the UTF-8 way, and using all mb_* functions and such, I managed to run the indexing without errors. It seems to be working now, and I can query the index and it returns valid results.Thanks for the quick response!
Znarkus