ansaurus

Question

XML parsing expat in python handling data

Answer 1

A:

expat does not mess up, < is simply the XML encoding for the character <. Quite to the contrary, if expat would return the literal <, this would be a bug with respect to the XML spec. That being said, you can of course get the escaped version back by using xml.sax.saxutils.escape:

>>> from xml.sax.saxutils import escape
>>> escape("<fail/>")
'&lt;fail/&gt;'

The expat parser is also free to report all string data in whatever chunks it seems fit, so you have to concatenate them yourself.

Torsten Marek 2009-07-17 18:49:20

Answer 2

A:

Both SAX and StAX parsers are free to break up the strings in whatever way is convenient for them (although StAX has a COALESCE mode for forcing it to assemble the pieces for you).

The reason is that it is often possible to write software in certain cases that streams and doesn't have to care about the overhead of reassembling the string fragments.

Usually I accumulate text in a variable, and use the contents when I see the next StartElement or EndElement event. At that point, I also reset the accumulated-text variable to empty.

lavinio 2009-07-17 21:37:28

ansaurus

tags:

views:

answers:

XML parsing expat in python handling data

related questions