ansaurus

Question

Encode East Asian languages using Python

Answer 1

+2 A:

Normally you would query the format for what encoding the data is in. Having said that, Shift-JIS is quite a popular encoding for Japanese text.

>>> u'あいうえお'.encode('shift-jis')
'\x82\xa0\x82\xa2\x82\xa4\x82\xa6\x82\xa8'

Ignacio Vazquez-Abrams 2010-02-16 05:50:16

What encoding scheme would you suggest for Slavic languages or South Asian languages?

rohanbk 2010-02-16 18:49:51

I would look through Python's standard encodings for clues. http://docs.python.org/library/codecs.html#standard-encodings

Ignacio Vazquez-Abrams 2010-02-16 21:03:56

Answer 2

+2 A:

There should be a way to query the encoding of the tweets when read from Twitter. You then decode them to Unicode as you read them into your program, then encode them when you write them back out to an XML file. Chinese, for example, might be using gbk encoding:

import codecs
unicode_data = data.decode('gbk')
f = codecs.open('out.xml','w','utf-8')
f.write(unicode_data)
f.close()

Mark Tolonen 2010-02-16 16:04:34

Thanks for the suggestion.

rohanbk 2010-02-16 18:48:00

ansaurus

tags:

views:

answers:

Encode East Asian languages using Python

related questions