views:

99

answers:

2

Hi,

I am using Python 2.6.1 and am having utf-8 related problem with my code. This problem is reproducible with this code:

# -*- coding: utf-8 -*-
import os, sys
import string, time
import codecs, re
bDATA='"Domenick Lombardozzi","Eddie Marsan","Isaach De Bankolé","John Hawkes"'
print (bDATA)
fileObj = codecs.open("btvresp1.txt", "r", "utf-8")
data = fileObj.read()
print (data)

The first print of bDATA works just fine. However, if the same data is in the file btcresp1.txt file, python complains as follows:

cat btvresp2.txt
"Domenick Lombardozzi","Eddie Marsan","Isaach De Bankol?","John Hawkes"

python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> # -*- coding: utf-8 -*-
... 
>>> import os, sys
>>> import string, time
>>> import codecs, re
>>> bDATA='"Domenick Lombardozzi","Eddie Marsan","Isaach De Bankol","John Hawkes"'
>>> print (bDATA)
"Domenick Lombardozzi","Eddie Marsan","Isaach De Bankol","John Hawkes"
>>> fileObj = codecs.open("btvresp2.txt", "r", "utf-8")
>>> data = fileObj.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/codecs.py", line 666, in read
    return self.reader.read(size)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/codecs.py", line 472, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 55-57: invalid data

I am not sure as to why the same data when read from a file causes problems. Can someone shed light as to why this behavior and how I can resolve this?

Thanks in advance!

+2  A: 

It looks like the contents of your file are not encoded in UTF-8. Are you sure you didn't save it in some other encoding? When you cat the file, the terminal displays a ? instead of the é, which would also hint at an encoding problem in the file, since your terminal seems to use UTF-8.

Also you have two files, btvresp1.txt and btvresp2.txt. Are you using the correct one?

sth
My bad.. btvresp1.txt and btvresp2.txt differ very little (I deleted a few names from the bDATA list).Since I am using a Mac, the Terminal application is UTF-8 aware and am able to see é just fine.
Bapatla
+1  A: 

codecs.open returns an object whose read method returns a unicode string, not an encoded byte string -- that's the whole point of the codecs.open function. So, your print (data), if and when you get to it, will be entirely, drastically different from your working print (bDATA): the latter is printing utf-8 encoded byte strings, the latter will be trying to print unicode objects (which may or may not work depending on your environment -- but, you should be fine on a Terminal.app set to use utf-8 encoding).

However your problems come much earlier: the codecs-produced file-like object asserts that bytes 55 to 57 are not a valid utf-8 encoding. The way to check this is something like...:

>>> f = open("btvresp2.txt", "rb")
>>> print repr(f.read()[50:65])

where I'm also showing a few bytes before and after, for context. If you do that and edit your question to show us the results, we might be able to guess what encoding your file is actually in (the only certainty, at this point, is that it's not in utf-8 encoding).

Alex Martelli
Here is the modified code and its output:`$ cat btvresp2.txt``"Isaach De Bankol?","John Hawkes"``$ cat btvtestout.py``# -*- coding: utf-8 -*-``import os, sys``import string, time``import codecs, re``bDATA='"Isaach De Bankolé","John Hawkes"'``print (bDATA)``fileObj = open("btvresp2.txt", "rb")``print repr(fileObj.read()[11:25])``$ python btvtestout.py``"Isaach De Bankolé","John Hawkes"``'Bankol\xe9","John'`
Bapatla
@Bapatia, the `\xe9` for lowercase e with acute accent is **not** utf-8 -- it's probably some `ISO-8859-x` (where `x` could be e.g. `1` or `15`) or maybe `CP-1252` or something like that. Try reading it w/iso-8859-1 and see what happens.
Alex Martelli
Yes! Using iso-8859-1, I was able to print out the character properly... So, at present, it looks like the framework I am using has an issue with iso-8859-1 - it is defaulting to utf8 even when I specify other encodings.. Thanks for your help!
Bapatla