views:

21

answers:

0

lxml easily validates XML files against any DTD or XMLSchema if only you're using etree.XML().

I need to do the same trick with etree.iterparse(), so that whole XML file won't be put into memory. There are two problems here:
1. DTD is ignored by iterparse (be it internal or external)
2. XML is validated against XMLSchema, but errors have no information about line number or column where they occured (both are set to 0).

I tried to ask at lxml-dev mailing list, but got no answer for point 2. and nothing that would help for point 1. Solution doesn't have to use lxml, but musn't load whole XML file into RAM.

Anyone has any experience with solving such problems?

For all of you that will ask about versions: It's lxml 2.2.4 running on Python 2.6

Code for validation with XMLSchema is below. For validation against internal DTD remove line that creates etree.XMLSchema object and replace the line:

parser = etree.iterparse(open('badInputFile.xml'), schema=schema)

with

parser = etree.iterparse(open('badInputFile.xml'), dtd_validation=True)

Code:

schema = etree.XMLSchema(file='mySchema.xsd')
try:
    parser = etree.iterparse(open('badInputFile.xml'), schema=schema)
    for aTuple in parser:
        print aTuple
except etree.XMLSyntaxError, e:
    print e.position
    print e.lineno
    print e.error_log
    raise