views:

1522

answers:

3

When I feed a utf-8 encoded xml to an ExpatParser instance:

def test(filename):
    parser = xml.sax.make_parser()
    with codecs.open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parser.feed(line)

...I get the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "test.py", line 72, in search_test
    parser.feed(line)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128)

I'm probably missing something obvious here. How do I change the parser's encoding from 'ascii' to 'utf-8'?

+2  A: 

Your code fails in Python 2.6, but works in 3.0.

This does work in 2.6, presumably because it allows the parser itself to figure out the encoding (perhaps by reading the encoding optionally specified on the first line of the XML file, and otherwise defaulting to utf-8):

def test(filename):
    parser = xml.sax.make_parser()
    parser.parse(open(filename))
Stephan202
This worked in 2.5, too.
Daniel Weaver
+2  A: 

The SAX parser in Python 2.6 should be able to parse utf-8 without mangling it. Although you've left out the ContentHandler you're using with the parser, if that content handler attempts to print any non-ascii characters to your console, that will cause a crash.

For example, say I have this XML doc:

<?xml version="1.0" encoding="utf-8"?>
<test>
   <name>Champs-Élysées</name>
</test>

And this parsing apparatus:

import xml.sax

class MyHandler(xml.sax.handler.ContentHandler):

def startElement(self, name, attrs):
 print "StartElement: %s" % name

def endElement(self, name):
 print "EndElement: %s" % name

def characters(self, ch):
 #print "Characters: '%s'" % ch
 pass

parser = xml.sax.make_parser()
parser.setContentHandler(MyHandler())

for line in open('text.xml', 'r'):
    parser.feed(line)

This will parse just fine, and the content will indeed preserve the accented characters in the XML. The only issue is that line in def characters() that I've commented out. Running in the console in Python 2.6, this will produce the exception you're seeing because the print function must convert the characters to ascii for output.

You have 3 possible solutions:

One: Make sure your terminal supports unicode, then create a sitecustomize.py entry in your site-packages and set the default character set to utf-8:

import sys sys.setdefaultencoding('utf-8')

Two: Don't print the output to the terminal (tongue-in-cheek)

Three: Normalize the output using unicodedata.normalize to convert non-ascii chars to ascii equivalents, or encode the chars to ascii for text output: ch.encode('ascii', 'replace'). Of course, using this method you won't be able to properly evaluate the text.

Using option one above, your code worked just fine for my in Python 2.5.

Jarret Hardie
+2  A: 

Jarret Hardie already explained the issue. But those of you who are coding for the command line, and don't seem to have the "sys.setdefaultencoding" visible, the quick work around this bug (or "feature") is:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Hopefully reload(sys) won't break anything else.

More details in this old blog:

The Illusive setdefaultencoding

janpf