ansaurus

Question

How to encode HTML non-ASCII data to UTF-8 in Python

Answer 1

+4 A:

You need to know how the input data is encoded before you decode it. In some of you're attempts, you're trying to decode it from UTF-8, but Python throws an exception because the input isn't valid UTF-8. It looks like it might be latin-1. This works for me:

>>> x = 'Ingl\xeas'
>>> print x.decode('latin1')
Inglês

You mention "non-ASCII HTML". If you're writing a web server script and you're getting data from an HTTP request, you should check the Content-Type header. In an ideal world, it will tell you which encoding the client is using for the data. Keep in mind that the client may be working incorrectly.

Hope that helps!

Daniel Stutzbach 2010-03-07 16:22:28

Answer 2

A:

Ingl\xeas

is not UTF-8 but (probably) Windows-1252- or latin1-encoded. So you first need to decode it. Only then you can encode it to UTF-8.

Therefore:

>>> x = 'Ingl\xeas'
>>> print x.decode("cp1252")
Inglês

Similarly,

 >>> x.decode("cp1252").encode("UTF-8")
 'Ingl\xc3\xaas'

which is the correct UTF-8 representation.

By the way, in Python 3, you can (at least in the interactive console under Windows) simply type

>>> x = 'Ingl\xeas'
>>> print (x)
Inglês

since Python 3 strings are always Unicode strings (not counting bytes objects).

Tim Pietzcker 2010-03-07 16:24:07

Python 3 has two string types, just like Python 2. 3's `str` is 2's `unicode` with trivial modifications. 3's `bytes` is 2's `str` with moderate modifications.

Mike Graham 2010-03-07 17:28:34

Your Python 3 example throws a UnicodeEncodeError exception.

Daniel Stutzbach 2010-03-07 20:57:13

@Daniel: Not in the interactive shell.

Tim Pietzcker 2010-03-08 07:31:24

@Tim: it does for me. I guess it depends on how the installation is set up? I get: UnicodeEncodeError: 'ascii' codec can't encode character '\xea' in position 4: ordinal not in range(128)

Daniel Stutzbach 2010-03-08 13:59:01

Oh, it might have to do with the local environment. I'm on Windows, therefore the interactive shell's encoding is Windows-1252. Under Linux, it might be UTF-8. Will edit my post.

Tim Pietzcker 2010-03-08 14:37:35

Answer 3

A:

Some observations:

(1) latin1 will decode ANY 8-bit byte without throwing an exception. Use latin1 only when you have exhausted all other possibilities. Use chardet to help deciding what a particular file or webpage or XML stream is encoded in.

(2) Possible alternatives based on very limited evidence (ONE character):

>>> import unicodedata as ucd
>>> for codepage in range(1250, 1259):
...    try:
...        uc = "\xea".decode(str(codepage))
...    except UnicodeDecodeError:
...        pass
...    if uc == u'\xea': print codepage, ucd.name(uc)
...
1252 LATIN SMALL LETTER E WITH CIRCUMFLEX
1254 LATIN SMALL LETTER E WITH CIRCUMFLEX
1256 LATIN SMALL LETTER E WITH CIRCUMFLEX
1258 LATIN SMALL LETTER E WITH CIRCUMFLEX
>>>

(3) The range U+0080 to U+009F (inclusive) is assigned to "C1 control characters" which nobody outside unicode.org knows what use they could be. No matter what encoding you are using (even UTF-8), after no-exception decoding to unicode, you are not out of the woods yet. Check for characters in that range. If you find any, your data is corrupt, or your choice of encoding is not correct.

def check_for_c1_control_characters(unicode_obj):
    return any('\u0080' <= c <= '\u009F' for c in unicode_obj)

or use a regex, as in this example of how to fix one of the many ways the data can be corrupted.

John Machin 2010-03-08 00:19:09

ansaurus

tags:

views:

answers:

How to encode HTML non-ASCII data to UTF-8 in Python

related questions