ansaurus

Question

Python utf-8 handling

Answer 1

+2 A:

It looks like the contents of your file are not encoded in UTF-8. Are you sure you didn't save it in some other encoding? When you cat the file, the terminal displays a ? instead of the é, which would also hint at an encoding problem in the file, since your terminal seems to use UTF-8.

Also you have two files, btvresp1.txt and btvresp2.txt. Are you using the correct one?

sth 2010-07-20 18:26:45

My bad.. btvresp1.txt and btvresp2.txt differ very little (I deleted a few names from the bDATA list).Since I am using a Mac, the Terminal application is UTF-8 aware and am able to see é just fine.

Bapatla 2010-07-20 19:00:41

Answer 2

+1 A:

codecs.open returns an object whose read method returns a unicode string, not an encoded byte string -- that's the whole point of the codecs.open function. So, your print (data), if and when you get to it, will be entirely, drastically different from your working print (bDATA): the latter is printing utf-8 encoded byte strings, the latter will be trying to print unicode objects (which may or may not work depending on your environment -- but, you should be fine on a Terminal.app set to use utf-8 encoding).

However your problems come much earlier: the codecs-produced file-like object asserts that bytes 55 to 57 are not a valid utf-8 encoding. The way to check this is something like...:

>>> f = open("btvresp2.txt", "rb")
>>> print repr(f.read()[50:65])

where I'm also showing a few bytes before and after, for context. If you do that and edit your question to show us the results, we might be able to guess what encoding your file is actually in (the only certainty, at this point, is that it's not in utf-8 encoding).

Alex Martelli 2010-07-20 19:58:17

Here is the modified code and its output:`$ cat btvresp2.txt``"Isaach De Bankol?","John Hawkes"``$ cat btvtestout.py``# -*- coding: utf-8 -*-``import os, sys``import string, time``import codecs, re``bDATA='"Isaach De Bankolé","John Hawkes"'``print (bDATA)``fileObj = open("btvresp2.txt", "rb")``print repr(fileObj.read()[11:25])``$ python btvtestout.py``"Isaach De Bankolé","John Hawkes"``'Bankol\xe9","John'`

Bapatla 2010-07-20 21:56:27

@Bapatia, the `\xe9` for lowercase e with acute accent is **not** utf-8 -- it's probably some `ISO-8859-x` (where `x` could be e.g. `1` or `15`) or maybe `CP-1252` or something like that. Try reading it w/iso-8859-1 and see what happens.

Alex Martelli 2010-07-21 01:01:24

Yes! Using iso-8859-1, I was able to print out the character properly... So, at present, it looks like the framework I am using has an issue with iso-8859-1 - it is defaulting to utf8 even when I specify other encodings.. Thanks for your help!

Bapatla 2010-07-26 19:30:13

ansaurus

tags:

views:

answers:

Python utf-8 handling

related questions