views:

277

answers:

3

so I have lots of users posting articles with names in different languages. I need some lib to translate thouse article names to english letters for example turn russian 'р' into eng 'r' and so on for all european languages, russian and asian languages. Where to get such lib?

45 seconds of google gave me this "This extension allows you to transliterate text in non-latin characters (such as Chinese, Cyrillic, Greek etc) to latin characters." It seems to be what I realy needed. Has any one tried this in real life?

+1  A: 

Will iconv do?

With this module, you can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set.

From PHP manual:

$text = "This is the Euro symbol '€'.";

echo 'Original : ', $text, PHP_EOL;
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
echo 'Plain    : ', iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;

If that won't do, check out these

As an alternative, define the character map in an array and use str_replace or mb_substitute_character to do the conversion.

Gordon
Will iconv actually turn `Москва` into `Moskva`? Wow if it does. Can't try out right now...
Pekka
@Pekka no clue. I haven't used iconv too often, let alone with russian character sets, but basically, yes, this is what it should be able to do.
Gordon
@Gordon doesn't seem to work for me, it just drops the cyrillic languages. I think this is more complex, see my answer.
Pekka
+1  A: 

I am not a linguist, far from it, but I submit to you the possibility that what you are trying to do is impossible, or extremely complex to implement.

After all, translating names is more than just "converting alphabets." It is comparably easy in russian because every cyrillic character actually has a latin counterpart (they are sister alphabets).

I don't know about arabic, but for chinese you will need a romanization system like Pinyin to get anywhere. It's more complex than a simple replacing of characters.

Here's a full list of ISO Romanizations - If I understand correctly, a solution that works for you would have to implement those rules.

So the task would be:

  • Analyze a text containing numerous different character ranges

  • Identify every word for which character range it belongs to (อักษรไทย is Thai; Москва is cyrillic; and so on)

  • Apply the correct method of romanization to every word.

Now I'm very interested to hear about any libraries that can do this in PHP, but it is well possible that there are none.

Pekka
@Pekka The way I understood the OP, if there is Åå, he wants it to be Aa. That's what iconv does, *if* it has a mapping. It finds the closest approximate, nothing more, nothing less. So either it will turn Москва into Mockba or Moskva. Like I said, I'm not sure (that's why I posed the answer as a question), but even if it doesn't he can still use the custom mapping approach or try any of the other libs, like Recode or MB_*
Gordon
@Gordon I'm sure this can be made work for russian. But he wants `all european languages, russian and asian languages` the latter is going to be *very* tough.
Pekka
@Gordon: The problem is mainly with languages like Chinese and Japanese. Japanese is workable if you have kana (kanji can be read in many different ways, depending on context), but for Chinese, they only use Hanzi characters. To the best of my knowledge, Chinese does not have a single pronunciation for every Hanzi character. There's also a problem for languages like Hebrew and Arabic, since wovels are not necessarily present, but implied. You can mechanically transliterate the stuff that *is* there, but the result may well be useless.
Michael Madsen
There may be other languages with similar problems, but I can't think of any.
Michael Madsen
@Michael: Right you are about Japanese; most Chinese characters have just one pronunciation, and for those that have multiple, the alternates typically occur only in certain words and phrases, so the situation is far less extreme than in Japanese. Hebrew and Arabic are a huge problem for transliteration, because vowels are entirely dependent on the grammatical structure of the sentence. Think of the obvious (to a human) difference between "see" and "saw" in the English sentence "I saw him yesterday" and then represent it as "' s hm ystrd".
Jon Purdy
+2  A: 

Google has an AJAX transliteration API which does a good job on many major scripts.

Edit: Damn, it appears on further inspection that this only allows conversions from the Latin alphabet. It's kind of silly that Google hasn't made the reverse functionality available, since they're already using it in Google Translate to provide romanisations for Cyrillic, Chinese, Thai, Hindi, and others, though notably not abugidas such as Hebrew and Arabic.

Further Edit: I thought of a possible workaround: detect the language and use an AJAX query to run it through Google Translate using the same source language as destination language, e.g. Chinese-to-Chinese. Firebug reveals that the transliteration is output in a div whose ID is translit. Transliterations are typically heavily accented, so you'll need to convert them. This is by no means something to rely on (though Google typically doesn't make frequent structural changes to their HTML), but it is certainly an interesting possibility.

Jon Purdy
Still a big +1 for pointing this out!
Pekka
Thank you very much!
Jon Purdy
+1 and there is 1 way else - there is something they call "romanization" and it is composition of readable and esely translitable roman leters.
Blender
http://code.google.com/apis/ajaxlanguage/documentation/referenceTransliteration.html
Blender
@Ole Jak: Romanisation is just transliteration to the Latin alphabet. And that's a better link, though you'll still note at the bottom that English unfortunately isn't in the list of valid destination languages.
Jon Purdy