The SAX parser in Python 2.6 should be able to parse utf-8 without mangling it. Although you've left out the ContentHandler you're using with the parser, if that content handler attempts to print any non-ascii characters to your console, that will cause a crash.
For example, say I have this XML doc:
<?xml version="1.0" encoding="utf-8"?>
<test>
<name>Champs-Élysées</name>
</test>
And this parsing apparatus:
import xml.sax
class MyHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print "StartElement: %s" % name
def endElement(self, name):
print "EndElement: %s" % name
def characters(self, ch):
#print "Characters: '%s'" % ch
pass
parser = xml.sax.make_parser()
parser.setContentHandler(MyHandler())
for line in open('text.xml', 'r'):
parser.feed(line)
This will parse just fine, and the content will indeed preserve the accented characters in the XML. The only issue is that line in def characters()
that I've commented out. Running in the console in Python 2.6, this will produce the exception you're seeing because the print function must convert the characters to ascii for output.
You have 3 possible solutions:
One: Make sure your terminal supports unicode, then create a sitecustomize.py
entry in your site-packages
and set the default character set to utf-8:
import sys
sys.setdefaultencoding('utf-8')
Two: Don't print the output to the terminal (tongue-in-cheek)
Three: Normalize the output using unicodedata.normalize
to convert non-ascii chars to ascii equivalents, or encode
the chars to ascii for text output: ch.encode('ascii', 'replace')
. Of course, using this method you won't be able to properly evaluate the text.
Using option one above, your code worked just fine for my in Python 2.5.