ansaurus

Question

How can I make this Python2.6 function work with Unicode?

Answer 1

+3 A:

For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8'), if you have the text in UTF-8. You will end up with unicode objects. Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\n'.join(jotted) instead.

Update:

It appears that the NLTK library doesn't like unicode objects. Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. Try using this:

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

and this:

jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))

but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough:

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). Better be careful and check that it has processed your tokens correctly. Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings.

Ivo 2010-09-22 10:49:04

@Ivo I added that bit and got an error. I amended the question to show it and the amendment.

old Ixfoxleigh 2010-09-23 21:50:51

@Ivo Woo. *initiate euphoria*... they were all UTF-8... and yes, the .join function was right as-is, but the openbook function needed your two fixes... thank you!!!

old Ixfoxleigh 2010-09-26 00:49:32

ansaurus

tags:

views:

answers:

How can I make this Python2.6 function work with Unicode?

Update:

related questions