views:

40

answers:

1

I am getting the very familiar:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 24: ordinal not in range(128)

I have checked out multiple posts on SO and they recommend - variable.encode('ascii', 'ignore')

however, this is not working. Even after this I am getting the same error ...

The stack trace:

'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)
Traceback (most recent call last):
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 513, in __call__
    handler.post(*groups)
  File "/base/data/home/apps/autominer1/1.343038273644030157/siteinfo.py", line 2160, in post
    imageAltTags.append(str(image["alt"]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 18: ordinal not in range(128)

The code responsible for the same:

siteUrl = urlfetch.fetch("http://www."+domainName, headers = { 'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5' } )


 webPage = siteUrl.content.decode('utf-8', 'replace').encode('ascii', 'replace')


 htmlDom = BeautifulSoup(webPage)

 imageTags = htmlDom.findAll('img', { 'alt' : True } )


 for image in imageTags :
                        if len(image["alt"]) > 3 :
                                imageAltTags.append(str(image["alt"]))

Any help would be greatly appreciated. thanks.

A: 

There are two different things that Python treats as strings - 'raw' strings and 'unicode' strings. Only the latter actually represent text. If you have a raw string, and you want to treat it as text, you first need to convert it to a unicode string. To do this, you need to know the encoding for the string - they way unicode codepoints are represented as bytes in the raw string - and call .decode(encoding) on the raw string.

When you call str() on a unicode string, the opposite transformation takes place - Python encodes the unicode string as bytes. If you don't specify a character set, it defaults to ascii, which is only capable of representing the first 128 codepoints.

Instead, you should do one of two things:

  • Represent 'imageAltTags' as a list of unicode strings, and thus dump the str() call - this is probably the best approach
  • Instead of str(x), call x.encode(encoding). The encoding to use will depend on what you're doing, but the most likely choice is utf-8 - eg, x.encode('utf-8').
Nick Johnson
this is very common issue that Python 2 users run into every day. it happens so much that i ended up blogging about it... http://wesc.livejournal.com/1743.html
wescpy
I dumped str() and now things work fine. I am dealing with everything as a unicode string. Thanks!
demos