Hey guys!
Does anyone know if there is support for Chinese pinyin? I get the results here with correct Chinese pinyin (see "Show romanization" link).
Thank you.
Hey guys!
Does anyone know if there is support for Chinese pinyin? I get the results here with correct Chinese pinyin (see "Show romanization" link).
Thank you.
I don't know if Google AJAX Language APIs have support for converting to pinyin, but if they don't it actually isn't too hard to do a passable conversion on your on. (The reverse conversion, from pinyin to hanzi (characters) is much more tricky, because pinyin is very lossy.)
To do the conversion yourself, grab the Unihan.zip, a downloaable verion of the Unihan database. The file you actually care about is Unihan_Readings.txt. It also contains a bunch of stuff you don't care about, and it's also stored in a pretty inefficient way, so don't be too worried about the large file sizes. You should extract the stuff you care about and store it in a more efficient way.
In it you'll find tab-delimited lines like this:
U+597D kCantonese hou2 hou3
U+597D kDefinition good, excellent, fine; well
U+597D kHangul 호
U+597D kHanyuPinlu hao3(6060) hao1(142) hao4(115)
U+597D kHanyuPinyin 21028.010:hǎo,hào
U+597D kJapaneseKun KONOMU SUKU YOI
U+597D kJapaneseOn KOU
U+597D kKorean HO
U+597D kMandarin HAO3 HAO4
U+597D kTang *xɑ̀u *xɑ̌u
U+597D kVietnamese háo
U+597D kXHC1983 0445.030:hǎo 0448.030:hào
The left column ("U+597D") is the unicode codepoint, the middle column is an attribute name, and the right column is the attribute value. You can extract either the kHanyuPinyin attributes or the kMandarin attributes. They encode basically the same information -- just go with whichever is an easier format for you to deal with. (hǎo == HAO3, hào == HAO4, if that isn't obvious)
You'll note that for some characters (like the example I've chosen here) there are multiple pronunciations. This is the one tricky bit. Depending on how much precision you want, you may be able to get away with just using the first romanization listed, as they're in order of decreasing frequency. (Actually, this is one of the places where kHanyuPinyin is a bit different from kMandarin -- it actually has multiple lists of pronunciations, each ordered by frequency.)
Google translate includes "show/hide romanization" which is BETTER than UNIHAN for two reasons. First, known words are logically grouped together in the proper manner (at least it tries to do that). Secondly, Chinese characters have more than one possible pronunciation. It is not a trivial problem to figure out which pinyin transliteration is the right one. That's what the translation engine does.