views:

112

answers:

3

I'm reading and parsing an Amazon XML file and while the XML file shows a ' , when I try to print it I get the following error:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

From what I've read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?

Thanks!

A: 

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don't ever access the XML string directly, you might have to use a decoder object from the codecs module.

David Zaslavsky
It's already encoded in UTF-8 The error is specifically: myStrings = deque([u'Dorf and Svoboda\u2019s text builds on the str... and Computer Engineering\u2019s subdisciplines.'])The string is in UTF-8 as you can see, but it gets mad about the internal '\u2019'
Alex B
Oh, OK, I thought you were having a different problem.
David Zaslavsky
@Alex B: No, the string is Unicode, not Utf-8. To **encode** it as Utf-8 use `'...'.encode('utf-8')`
sth
A: 

Likely, your problem is that you parsed it okay, and now you're trying to print the contents of the XML and you can't because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the 'ignore' part will tell it to just skip those characters. From the python docs:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what's going on. After the read, you'll stop feeling like your just guessing what commands to use (or at least that happened to me).

Scott Stafford
Perfect, it removed the 's but at least itll print. Thanks!
Alex B
I'm trying to make the following string safe: ' foo “bar bar” df'(note the curly quotes), but the above still fails for me.
Rosarch
@Rosarch: Fails how? same error? And which error-handling rule did you use?
Scott Stafford
@Rosarch, your problem is probably earlier. Try this code: # -*- coding: latin-1 -*- u = u' foo “bar bar” df' print u.encode('ascii', 'ignore')For you, it was probably converting your string INTO unicode given the encoding you specified for the python scrip that threw the error.
Scott Stafford
@Scott Stafford: I went ahead and made my issue into its own question: http://stackoverflow.com/questions/3224427/python-sanitize-a-string-for-unicode
Rosarch
A: 

Please check also this answer to a related question: “Python UnicodeDecodeError - Am I misunderstanding encode?”

ΤΖΩΤΖΙΟΥ