ansaurus

Question

How do I properly work with unicode characters in python to keep from getting errors?

Answer 1

A:

This is a bit beyond my expertise, but I think # -*- coding: iso-8859-15 -*- at the top declares the text encoding that your Python source file is saved in.

Is it really saved in iso-8859-15?

Paul D. Waite 2010-02-10 17:49:18

Answer 2

+1 A:

I guess you're using Python 2.x.

The file encoding declaration specifies how string literals are read by the interpreter.

You should handle all strings as unicode values, not str ones. If you read a str from the outside world, you should decode it to unicode explicitely. The same applies to outputting strings.

# -*- coding: utf-8 -*-
u_dia_str = '\xc3\xbc'   # str
lambda_unicode = u'λ'    # unicode

# input value
u_dia = u_dia_str.decode('utf-8')

sig_unicode = u'%s%s' % (u_dia, lambda_unicode)
# => u'üλ'

# output value
sig_str = sig_unicode.encode('utf-8')
# => '\xc3\xbc\xce\xbb'

Andrey Vlasovskikh 2010-02-10 17:53:32

Ok, I decode the input as utf-8, and now I can get past that part. But immediately after that, I encode the string as an md5 hash with this: hashed_sig = hashlib.md5(the_sig).hexdigest()And now I get the same ascii codec error as before. Is this a limitation of hashlib? Or am I still doing something wrong?

Gordon Fontenot 2010-02-10 18:11:34

Nevermind. Got it. I didn;t realize that I had to re-encode. Thanks for the help.

Gordon Fontenot 2010-02-10 18:20:28

Answer 3

+4 A:

There are a few things you should do to fix this.

Convert all string literal that contain non-ASCII characters to Unicode literals. Example: u'über'.
Do intermediate processing on Unicode. In other words, if you receive an encoded string (no matter the encoding), decode it to Unicode before working on it. Example:
```
s = utf8_string.decode('utf8') + latin1_string.decode('latin1')
```
When outputting the string or sending it somewhere, encode it with an encoding that your receiver understands. Example: send(s.encode('utf8')).

Complete example:

input1 = get_possibly_nonascii_input().decode('iso-8859-1')
input2 = get_possibly_nonascii_input().decode('iso-8859-1')
input3 = u'üvw'

s =  u'%s -> %s' % (input3, (input1 + input2).upper())

send_output(s.encode('utf8'))

Max Shawabkeh 2010-02-10 17:57:18

Awesome. This worked. I had to decode, then reencode to utf 8 to send it to hashlib. Thanks a lot. It looks like it's working now.

Gordon Fontenot 2010-02-10 18:19:34

ansaurus

tags:

views:

answers:

How do I properly work with unicode characters in python to keep from getting errors?

related questions