I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before.
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
tokens = nltk.wordpunct_tokenize(rawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. I'm sure some of you will know why that happened. I'm also sure that it's quite easy to fix. I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
I heard that what I had to do was encode the string into unicode after reading it from the file. I tried amending the function like so:
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
unirawness = rawness.decode('utf-8')
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
But that brought this error, when I used it on Hungarian. When I used it on German, I had no errors.
>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 9, in openbookreturnvocab
nltktext = nltk.Text(tokens)
File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)
I fixed the function that files the data like so:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
However, that brought this error, when I tried to file the German:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 23, in jotindex
filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>
...which is what you get when you try to write the u'\n'.join'ed data.
>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)