ansaurus

Question

Google AJAX Language API with Chinese language

Answer 1

+1 A:

I don't know if Google AJAX Language APIs have support for converting to pinyin, but if they don't it actually isn't too hard to do a passable conversion on your on. (The reverse conversion, from pinyin to hanzi (characters) is much more tricky, because pinyin is very lossy.)

To do the conversion yourself, grab the Unihan.zip, a downloaable verion of the Unihan database. The file you actually care about is Unihan_Readings.txt. It also contains a bunch of stuff you don't care about, and it's also stored in a pretty inefficient way, so don't be too worried about the large file sizes. You should extract the stuff you care about and store it in a more efficient way.

In it you'll find tab-delimited lines like this:

U+597D  kCantonese      hou2 hou3
U+597D  kDefinition     good, excellent, fine; well
U+597D  kHangul         호
U+597D  kHanyuPinlu     hao3(6060) hao1(142) hao4(115)
U+597D  kHanyuPinyin    21028.010:hǎo,hào
U+597D  kJapaneseKun    KONOMU SUKU YOI
U+597D  kJapaneseOn     KOU
U+597D  kKorean         HO
U+597D  kMandarin       HAO3 HAO4
U+597D  kTang           *xɑ̀u *xɑ̌u
U+597D  kVietnamese     háo
U+597D  kXHC1983        0445.030:hǎo 0448.030:hào

The left column ("U+597D") is the unicode codepoint, the middle column is an attribute name, and the right column is the attribute value. You can extract either the kHanyuPinyin attributes or the kMandarin attributes. They encode basically the same information -- just go with whichever is an easier format for you to deal with. (hǎo == HAO3, hào == HAO4, if that isn't obvious)

You'll note that for some characters (like the example I've chosen here) there are multiple pronunciations. This is the one tricky bit. Depending on how much precision you want, you may be able to get away with just using the first romanization listed, as they're in order of decreasing frequency. (Actually, this is one of the places where kHanyuPinyin is a bit different from kMandarin -- it actually has multiple lists of pronunciations, each ordered by frequency.)

Laurence Gonsalves 2010-01-08 19:05:08

Yeah, I was thinking about this too, but getting data from Unihan is another query to db and it's not the best solution for long words. I'm pretty sure Google AJAX Language API uses the same dictionary as Google Translate does, but the question is how to retrieve pinyin as well as the translation itself?

Den Thomas 2010-01-08 19:11:57

I agree that it'd be nice to get this info from the API you're already using, if it has it. This is more of a "Plan B". I'm not sure what sort of DB you'r referring to, but you can probably store the data you extract from Unihan on the client. Take a look at http://xenomachina.com/toys/pinyin2hanzi.html, a page on my site that does the inverse mapping. It has a js file with the entire pinyin-to-hanzi mapping (extracted from Unihan.txt), and it's only 50K.

Laurence Gonsalves 2010-01-08 19:42:36

Thanks for the suggestion!

Den Thomas 2010-01-08 19:56:19

Answer 2

A:

Google translate includes "show/hide romanization" which is BETTER than UNIHAN for two reasons. First, known words are logically grouped together in the proper manner (at least it tries to do that). Secondly, Chinese characters have more than one possible pronunciation. It is not a trivial problem to figure out which pinyin transliteration is the right one. That's what the translation engine does.

Jeffrey Singer 2010-04-03 18:11:07

Answer 3

A:

You can trick the API into giving you Pinyin by translating from Chinese to Chinese. Sample link.

mdm 2010-09-15 13:30:03

ansaurus

tags:

views:

answers:

Google AJAX Language API with Chinese language

related questions