ansaurus

Question

python libxml2 reader and XML_PARSE_RECOVER

Answer 1

A:

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:

from cStringIO import StringIO
from lxml import etree

DOC = "<a>some broken & xml</a>"

reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())

The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.

Edit:

If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:

DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)

for data in StringIO(DOC).read():
    reader.feed(data)

tree = reader.close()
print etree.tostring(tree)

dcolish 2010-10-29 21:18:15

Unfortunately I looked into lxml too, but your suggestion above uses the DOM api, due to the size of documents that isn't an option. The lxml iterparse API doesn't support recovery.

bee 2010-10-30 09:58:15

If you're only trying to parse incrementally, look into the _FeedParser interface for lxml, I'll edit the above sample with its usage. I have not been able to find an iterative method for parsing that yields elements as they are parsed. http://codespeak.net/lxml/api/lxml.etree._FeedParser-class.html

dcolish 2010-10-30 17:00:43

Thanks for all you efforts. Technically what we need is both incremental parsing and event-driven pulling of elements with recover. Shame lxml doesn't fit these requirements.

bee 2010-10-30 18:28:08

Answer 2

A:

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?

For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.

EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

Krab 2010-10-30 14:15:43

In the examples I've looked at each individual source has broken xml and all in different ways! Other common mistakes are casing of opening and closing tags not matching. It'd be difficult to work around every single one, reliably at least.To top it off them fixing the sources isn't an option - we have to support them as the previous provider did!

bee 2010-10-30 18:36:44

ansaurus

tags:

views:

answers:

python libxml2 reader and XML_PARSE_RECOVER

related questions