views:

281

answers:

3

Eventually, I'm creating a website on three languages: english, russian and chinese. I hope that if I use UTF-8 in application and database, there won't be any problems with input-output (will there?)

But the most frightening part of it is a search. It should be cool enough. It should be full-text, it should index, etc. I hope it will understand morfology, use stemming, etc.

First, I've looked at Zend_Search_Lucene, but as I realised from http://framework.zend.com/issues/browse/ZF/component/10021 it has problems with Chinese. :(

Now I'm thinking about Sphinx. It supports both English and Russian stemming. I'm not sure how good is it with Chinese and I have no idea how hard will it be for me to add support for it. http://www.sphinxsearch.com/forum/view.html?id=1554 is a silver lining but, as not experienced Sphinx user, I don't think I understand what is said there.


So,

does anyone have any experience in such 'language-agnostic' search and can share it with me, please?

and can you give me something to test the search. As a native Russian speaker with some basic knowledge of English I can test both Russian and English searches by myself, but I don't even know about which parts of this Chinese pics are words. Please, give me some Chinese strings to put them into index and some queries with expected results!

A: 

From Xapian docs :

Xapian uses the Snowball Stemming Algorithms. At present, these support Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish. There are also implementations of Lovins' English stemmer, Porter's original English stemmer, the Kraaij-Pohlmann Dutch stemmer, and a variation of the German stemmer which normalises umlauts.

or some of the world's languages, Chinese for example, the concept of stemming is not applicable, but it is certainly meaningful for the many languages of the Indo-European group.

http://xapian.org/docs/stemming.html

peufeu
Thats just a direct copy from the doc- the chinese problem is about how to split text into the database- without doing some really dumb things like inserting space in between character.
goodwill
A: 

Isn't Google User Search enough for your needs? What exactly don't you like in it?

FractalizeR
I can't use it in the situation :(
valya
+3  A: 

Hi!

Ideographic characters in languages such as Chinese or Japanese require two terminal character positions, so you will have problems with UTF8 and you should use UTF16 instead.

Apart from that, any search engine supporting UTF16 and your requirements (e.g. stemming) should work fine - that is, if you like Sphinx, go for it!

Seb
oh! thanks for the comment! shpinx doesn't support chinese morphology, do it?
valya
Sure it does! As long as you are consistent with your encoding throughout the whole application, it can handle everything. Take a look here: http://www.sphinxsearch.com/faq.html#encoding
Seb
I've tried to use some texts from http://zh.wikipedia.org/zh-tw/Wikipedia:%E9%A6%96%E9%A1%B5 in my application just like the english ones. The texts was saved and are displayed correctly. Maybe I've realized your words wrong?
valya
Are you still using UTF8 or changed to UTF16 *the whole app*? If you're still with UTF8, then inconsistencies may appear...
Seb
No, I haven't changed yet. But I'm going to do it. Can you please tell me if there are any underlying potential problems in changing utf8 to utf16? I haven't worked with utf16 in my life. Thanks!
valya
Well, that could be it. I'd suggest trying in a new, separate project before converting everything, so you gain experience with UTF16. You can't just change encoding and expect everything to work fine: if you have strings in UTF8 you'll have to convert them into UTF16. Also, you'll want to consider the Multi-byte strings functions in PHP: http://php.net/manual/en/book.mbstring.php. Good luck!
Seb