views:

383

answers:

1

I'm using xml.sax with unicode strings of XML as input, originally entered in from a web form. On my local machine (python 2.5, using the default xmlreader expat, running through app engine), it works fine. However, the exact same code and input strings on production app engine servers fail with "not well-formed". For example, it happens with the code below:

from xml import sax
class MyHandler(sax.ContentHandler):
  pass

handler = MyHandler()
# Both of these unicode strings return 'not well-formed' 
# on app engine, but work locally
xml.parseString(u"<a>b</a>",handler) 
xml.parseString(u"<!DOCTYPE a[<!ELEMENT a (#PCDATA)> ]><a>b</a>",handler)

# Both of these work, but output unicode
xml.parseString("<a>b</a>",handler) 
xml.parseString("<!DOCTYPE a[<!ELEMENT a (#PCDATA)> ]><a>b</a>",handler)

resulting in the error:

  File "<string>", line 1, in <module>
  File "/base/python_dist/lib/python2.5/xml/sax/__init__.py", line 49, in parseString
    parser.parse(inpsrc)
  File "/base/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/base/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/base/python_dist/lib/python2.5/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/base/python_dist/lib/python2.5/xml/sax/handler.py", line 38, in fatalError
    raise exception
SAXParseException: <unknown>:1:1: not well-formed (invalid token)

Any reason why app engine's parser, which also uses python2.5 and expat, would fail when inputting unicode?

+1  A: 

You are not supposed to parse a unicode string, you should parse a UTF-8 encoded string. A unicode string is not a well-formed XML by default, according to XML 1.0 specification. So you need to convert unicode to UTF-8 encoding before feeding it to the parser.

vtd-xml-author
You're right, passing in original_string.encode('utf-8') fixes the problem. Odd that the standard parser allows straight unicode to be passed in.
Derek Dahmer