views:

72

answers:

3

I'm working on a python plugin for Google Quick Search Box, and it's doing some odd things with non-ascii characters. It seems like the code works fine up until I try constructing a string containing the non-ascii characters (ü has been my test character). I am using the following code snippet for the construction, with new_task as the variable that is being input from GQSB.

the_sig = ("%sapi_key%sauth_token%smethod%sname%sparse%stimeline%s" %
           (api_secret, api_key, the_token, method, new_task, doParse, timeline))

It's giving me this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

I am understanding correctly, this is because I am trying to string together a unicode character inside an ascii string. Everything I could find told me to declare the encoding at the top with this:

# -*- coding: iso-8859-15 -*-

Which I have. And when I pull the code snippet that constructs the string into a new script, it works just fine. But for some reason, int he context of the rest of the code, it fails, every time. The only thing I can think of is that it is because it's inside it's own class, but that doesn't make any sense to me.

The full code can be found on GitHub here

Thanks in advance for any help. I am stumped on this one.

A: 

This is a bit beyond my expertise, but I think # -*- coding: iso-8859-15 -*- at the top declares the text encoding that your Python source file is saved in.

Is it really saved in iso-8859-15?

Paul D. Waite
+1  A: 

I guess you're using Python 2.x.

The file encoding declaration specifies how string literals are read by the interpreter.

You should handle all strings as unicode values, not str ones. If you read a str from the outside world, you should decode it to unicode explicitely. The same applies to outputting strings.

# -*- coding: utf-8 -*-
u_dia_str = '\xc3\xbc'   # str
lambda_unicode = u'λ'    # unicode

# input value
u_dia = u_dia_str.decode('utf-8')

sig_unicode = u'%s%s' % (u_dia, lambda_unicode)
# => u'üλ'

# output value
sig_str = sig_unicode.encode('utf-8')
# => '\xc3\xbc\xce\xbb'
Andrey Vlasovskikh
Ok, I decode the input as utf-8, and now I can get past that part. But immediately after that, I encode the string as an md5 hash with this: hashed_sig = hashlib.md5(the_sig).hexdigest()And now I get the same ascii codec error as before. Is this a limitation of hashlib? Or am I still doing something wrong?
Gordon Fontenot
Nevermind. Got it. I didn;t realize that I had to re-encode. Thanks for the help.
Gordon Fontenot
+4  A: 

There are a few things you should do to fix this.

  1. Convert all string literal that contain non-ASCII characters to Unicode literals. Example: u'über'.

  2. Do intermediate processing on Unicode. In other words, if you receive an encoded string (no matter the encoding), decode it to Unicode before working on it. Example:

    s = utf8_string.decode('utf8') + latin1_string.decode('latin1')
    
  3. When outputting the string or sending it somewhere, encode it with an encoding that your receiver understands. Example: send(s.encode('utf8')).

Complete example:

input1 = get_possibly_nonascii_input().decode('iso-8859-1')
input2 = get_possibly_nonascii_input().decode('iso-8859-1')
input3 = u'üvw'

s =  u'%s -> %s' % (input3, (input1 + input2).upper())

send_output(s.encode('utf8'))
Max Shawabkeh
Awesome. This worked. I had to decode, then reencode to utf 8 to send it to hashlib. Thanks a lot. It looks like it's working now.
Gordon Fontenot