views:

487

answers:

2

I am attempting to parse an XML file using python expat. I have the following line in my XML file:

<Action>&lt;fail/&gt;</Action>

expat identifies the start and end tags but converts the & lt; to the less than character and the same for the greater than character and thus parses it like this:

outcome:

START 'Action'
DATA '<'
DATA 'fail/'
DATA '>'
END 'Action'

instead of the desired:

START 'Action'
DATA '&lt;fail/&gt;'
END 'Action'

I would like to have the desired outcome, how do I prevent expat from messing up?

A: 

expat does not mess up, &lt; is simply the XML encoding for the character <. Quite to the contrary, if expat would return the literal &lt;, this would be a bug with respect to the XML spec. That being said, you can of course get the escaped version back by using xml.sax.saxutils.escape:

>>> from xml.sax.saxutils import escape
>>> escape("<fail/>")
'&lt;fail/&gt;'

The expat parser is also free to report all string data in whatever chunks it seems fit, so you have to concatenate them yourself.

Torsten Marek
A: 

Both SAX and StAX parsers are free to break up the strings in whatever way is convenient for them (although StAX has a COALESCE mode for forcing it to assemble the pieces for you).

The reason is that it is often possible to write software in certain cases that streams and doesn't have to care about the overhead of reassembling the string fragments.

Usually I accumulate text in a variable, and use the contents when I see the next StartElement or EndElement event. At that point, I also reset the accumulated-text variable to empty.

lavinio