tags:

views:

76

answers:

2

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.

However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):

Sample code (with small example):

import cStringIO
import libxml2

DOC = "<a>some broken & xml</a>"

reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)

ret = reader.Read()
while ret:
    print 'ret: %d' % ret
    print "node name: ", reader.Name(), reader.NodeType()
    ret = reader.Read()

Any ideas how to recover correctly?

A: 

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:

from cStringIO import StringIO
from lxml import etree

DOC = "<a>some broken & xml</a>"

reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())

The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.

Edit:

If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:

DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)

for data in StringIO(DOC).read():
    reader.feed(data)

tree = reader.close()
print etree.tostring(tree)
dcolish
Unfortunately I looked into lxml too, but your suggestion above uses the DOM api, due to the size of documents that isn't an option. The lxml iterparse API doesn't support recovery.
bee
If you're only trying to parse incrementally, look into the _FeedParser interface for lxml, I'll edit the above sample with its usage. I have not been able to find an iterative method for parsing that yields elements as they are parsed. http://codespeak.net/lxml/api/lxml.etree._FeedParser-class.html
dcolish
Thanks for all you efforts. Technically what we need is both incremental parsing and event-driven pulling of elements with recover. Shame lxml doesn't fit these requirements.
bee
A: 

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?

For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.

EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

Krab
In the examples I've looked at each individual source has broken xml and all in different ways! Other common mistakes are casing of opening and closing tags not matching. It'd be difficult to work around every single one, reliably at least.To top it off them fixing the sources isn't an option - we have to support them as the previous provider did!
bee