lxml easily validates XML files against any DTD or XMLSchema if only you're using etree.XML()
.
I need to do the same trick with etree.iterparse()
, so that whole XML file won't be put into memory. There are two problems here:
1. DTD is ignored by iterparse (be it internal or external)
2. XML is validated against XMLSchema, but errors have no information about line number or column where they occured (both are set to 0).
I tried to ask at lxml-dev
mailing list, but got no answer for point 2.
and nothing that would help for point 1.
Solution doesn't have to use lxml, but musn't load whole XML file into RAM.
Anyone has any experience with solving such problems?
For all of you that will ask about versions: It's lxml 2.2.4 running on Python 2.6
Code for validation with XMLSchema is below. For validation against internal DTD remove line that creates etree.XMLSchema
object and replace the line:
parser = etree.iterparse(open('badInputFile.xml'), schema=schema)
with
parser = etree.iterparse(open('badInputFile.xml'), dtd_validation=True)
Code:
schema = etree.XMLSchema(file='mySchema.xsd')
try:
parser = etree.iterparse(open('badInputFile.xml'), schema=schema)
for aTuple in parser:
print aTuple
except etree.XMLSyntaxError, e:
print e.position
print e.lineno
print e.error_log
raise