views:

369

answers:

3

I tried to do that, and I found this errors:

>>> import re  
>>> x = 'Ingl\xeas'  
>>> x  
'Ingl\xeas'  
>>> print x  
Ingl�s  
>>> x.decode('utf8')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data  
>>> x.decode('utf8', 'ignore')  
u'Ingl'  
>>> x.decode('utf8', 'replace')  
u'Ingl\ufffd'  
>>> print x.decode('utf8', 'replace')  
Ingl�  
>>> print x.decode('utf8', 'xmlcharrefreplace')  
Traceback (most recent call last):  
    File "<stdin>", line 1, in <module>  
    File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode  
        return codecs.utf_8_decode(input, errors, True)  
TypeError: don't know how to handle UnicodeDecodeError in error callback  

When I use the print statement, I want that:

>>> print x  
u'Inglês'  

Any help is welcome.

+4  A: 

You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:

>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês

You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.

Hope that helps!

Daniel Stutzbach
A: 
Ingl\xeas

is not UTF-8 but (probably) Windows-1252- or latin1-encoded. So you first need to decode it. Only then you can encode it to UTF-8.

Therefore:

>>> x = 'Ingl\xeas'
>>> print x.decode("cp1252")
Inglês

Similarly,

 >>> x.decode("cp1252").encode("UTF-8")
 'Ingl\xc3\xaas'

which is the correct UTF-8 representation.

By the way, in Python 3, you can (at least in the interactive console under Windows) simply type

>>> x = 'Ingl\xeas'
>>> print (x)
Inglês

since Python 3 strings are always Unicode strings (not counting bytes objects).

Tim Pietzcker
Python 3 has two string types, just like Python 2. 3's `str` is 2's `unicode` with trivial modifications. 3's `bytes` is 2's `str` with moderate modifications.
Mike Graham
Your Python 3 example throws a UnicodeEncodeError exception.
Daniel Stutzbach
@Daniel: Not in the interactive shell.
Tim Pietzcker
@Tim: it does for me. I guess it depends on how the installation is set up? I get: UnicodeEncodeError: 'ascii' codec can't encode character '\xea' in position 4: ordinal not in range(128)
Daniel Stutzbach
Oh, it might have to do with the local environment. I'm on Windows, therefore the interactive shell's encoding is Windows-1252. Under Linux, it might be UTF-8. Will edit my post.
Tim Pietzcker
A: 

Some observations:

(1) latin1 will decode ANY 8-bit byte without throwing an exception. Use latin1 only when you have exhausted all other possibilities. Use chardet to help deciding what a particular file or webpage or XML stream is encoded in.

(2) Possible alternatives based on very limited evidence (ONE character):

>>> import unicodedata as ucd
>>> for codepage in range(1250, 1259):
...    try:
...        uc = "\xea".decode(str(codepage))
...    except UnicodeDecodeError:
...        pass
...    if uc == u'\xea': print codepage, ucd.name(uc)
...
1252 LATIN SMALL LETTER E WITH CIRCUMFLEX
1254 LATIN SMALL LETTER E WITH CIRCUMFLEX
1256 LATIN SMALL LETTER E WITH CIRCUMFLEX
1258 LATIN SMALL LETTER E WITH CIRCUMFLEX
>>>

(3) The range U+0080 to U+009F (inclusive) is assigned to "C1 control characters" which nobody outside unicode.org knows what use they could be. No matter what encoding you are using (even UTF-8), after no-exception decoding to unicode, you are not out of the woods yet. Check for characters in that range. If you find any, your data is corrupt, or your choice of encoding is not correct.

def check_for_c1_control_characters(unicode_obj):
    return any('\u0080' <= c <= '\u009F' for c in unicode_obj)

or use a regex, as in this example of how to fix one of the many ways the data can be corrupted.

John Machin