I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.
However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):
Sample code (with small example):
import cStringIO
import libxml2
DOC = "<a>some broken & xml</a>"
reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)
ret = reader.Read()
while ret:
print 'ret: %d' % ret
print "node name: ", reader.Name(), reader.NodeType()
ret = reader.Read()
Any ideas how to recover correctly?