This may not really be a Python related question, but pertains to language encoding in general. I'm mining tweets from Twitter, and it appears that there is a large Japanese user community (with messages in Japanese). When I tried encoding the tweets for an XML file I used utf-8. e.g tweet=tweet.encode('utf-8') and none of the Japanese tweets appeared as they should have. My question that I am posing is, how should I have encoded them? What was my mistake? If I was to store the data in a CSV, what encoding scheme would I use in that case?
+2
A:
Normally you would query the format for what encoding the data is in. Having said that, Shift-JIS is quite a popular encoding for Japanese text.
>>> u'あいうえお'.encode('shift-jis')
'\x82\xa0\x82\xa2\x82\xa4\x82\xa6\x82\xa8'
Ignacio Vazquez-Abrams
2010-02-16 05:50:16
What encoding scheme would you suggest for Slavic languages or South Asian languages?
rohanbk
2010-02-16 18:49:51
I would look through Python's standard encodings for clues. http://docs.python.org/library/codecs.html#standard-encodings
Ignacio Vazquez-Abrams
2010-02-16 21:03:56
+2
A:
There should be a way to query the encoding of the tweets when read from Twitter. You then decode them to Unicode as you read them into your program, then encode them when you write them back out to an XML file. Chinese, for example, might be using gbk encoding:
import codecs
unicode_data = data.decode('gbk')
f = codecs.open('out.xml','w','utf-8')
f.write(unicode_data)
f.close()
Mark Tolonen
2010-02-16 16:04:34