ansaurus

Question

How do I print a list of strings, when I can't know the char encoding in advance?

Answer 1

+1 A:

The UnicodeDammit module from BeautifulSoup can automagically detect the encoding.

from BeautifulSoup import UnicodeDammit

u = UnicodeDammit("Ólafur Jóhann Ólafsson")

print u.unicode
print u.originalEncoding

leoluk 2010-09-06 16:08:36

This is great. Thank you. I will allow a while for more people to answer, but this sounds like it'll do the trick.

kobrien 2010-09-06 16:13:44

Answer 2

+1 A:

This page may help you http://wiki.python.org/moin/PrintFails

The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.

dmitko 2010-09-06 20:10:02

Answer 3

+1 A:

First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.

The web service should declare the encoding in the headers or data received. print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding) or in absence of that just ascii. If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError.

Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding. Here's an example of this:

>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What I did here is call encode on a byte string. Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437.

Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:

>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½

Mark Tolonen 2010-09-07 04:19:04

ansaurus

tags:

views:

answers:

How do I print a list of strings, when I can't know the char encoding in advance?

related questions