I'm having some serious issues using Zend_Lucene and foreign characters like åäö. These issues appear both when the index is created and when it's queried. I've tried both iso-8859-1 and utf-8.
ISO-8859-1
The query that doesn't work looks like "+_area:skåne
". With Zend_Lucene I'm getting no matches, but if I run this query in Luke I get many matching docuements.
The index contains 20 fields. The "_area" field is added with the following syntax:
$doc->addField(Zend_Search_Lucene_Field::keyword('_area', strtolower($item['area']), 'iso-8859-1'));
I am using the Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive
analyzer.
While running indexing, the error message below appeared sometimes (the documents indexed were randomly selected from DB with iso-8859-1 encoding)
Notice: iconv(): Detected an illegal character in input string in TextNum.php.
This was "solved" by checking if $this->_input is empty, as it seemed that this caused the notices. Note: The weird query results were a pre-existing condition.
When I search keyword fields using foreign characters I receive the error above, but when I search text fields it behaves differently. Then it generates about a hundred of the error below.
Notice: Undefined offset: 1996 in \Zend\Search\Lucene\Search\Query\MultiTerm.php on line 472
But it produces what looks like a correct result set! On a side note, this second query doesn't generate any results in Luke.
UTF-8
I've also tried UTF-8 because, to my knowledge, Zend_Lucene uses it internally. Since the data set is ISO-8859-1, I convert it using utf8_encode
. But the indexing produces the following errors.
Notice: Undefined offset: 266979 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 632
Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 196
Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 200
Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231
Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231
Notice: Undefined offset: 250595 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020
Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020
Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 465 ...
So. Can someone please shed some light? :) I believe (after days of googling) that I'm not the only one experiencing this.