ansaurus

Question

Answer 1

A:

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don't ever access the XML string directly, you might have to use a decoder object from the codecs module.

David Zaslavsky 2010-07-11 19:04:23

It's already encoded in UTF-8 The error is specifically: myStrings = deque([u'Dorf and Svoboda\u2019s text builds on the str... and Computer Engineering\u2019s subdisciplines.'])The string is in UTF-8 as you can see, but it gets mad about the internal '\u2019'

Alex B 2010-07-11 19:09:17

Oh, OK, I thought you were having a different problem.

David Zaslavsky 2010-07-11 19:25:07

@Alex B: No, the string is Unicode, not Utf-8. To **encode** it as Utf-8 use `'...'.encode('utf-8')`

sth 2010-07-11 19:33:45

Answer 2

A:

Likely, your problem is that you parsed it okay, and now you're trying to print the contents of the XML and you can't because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the 'ignore' part will tell it to just skip those characters. From the python docs:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what's going on. After the read, you'll stop feeling like your just guessing what commands to use (or at least that happened to me).

Scott Stafford 2010-07-11 19:10:51

Perfect, it removed the 's but at least itll print. Thanks!

Alex B 2010-07-11 19:14:44

I'm trying to make the following string safe: ' foo “bar bar” df'(note the curly quotes), but the above still fails for me.

Rosarch 2010-07-11 19:26:49

@Rosarch: Fails how? same error? And which error-handling rule did you use?

Scott Stafford 2010-07-11 20:17:43

@Rosarch, your problem is probably earlier. Try this code: # -*- coding: latin-1 -*- u = u' foo “bar bar” df' print u.encode('ascii', 'ignore')For you, it was probably converting your string INTO unicode given the encoding you specified for the python scrip that threw the error.

Scott Stafford 2010-07-11 20:48:52

@Scott Stafford: I went ahead and made my issue into its own question: http://stackoverflow.com/questions/3224427/python-sanitize-a-string-for-unicode

Rosarch 2010-07-11 21:12:07

Answer 3

A:

Please check also this answer to a related question: “Python UnicodeDecodeError - Am I misunderstanding encode?”

ΤΖΩΤΖΙΟΥ 2010-07-11 22:39:06

ansaurus

tags:

views:

answers:

Python Unicode Encode Error

related questions