I need to read a large XML (65 Mb), validate it against a xsd, and run XPath queries on it. Below, I've given an lxml version of that. It takes a lot of time (over 5 minutes) to run the query but validation seems to be pretty quick.
I've a couple of questions. How would a performance minded Python programmer write the program using lxml? Secondly, if lxml is not the right thing for the job, what else? and could you please give a code snippet?
import sys
from datetime import datetime
from lxml import etree
start = datetime.now()
schema_file = open("library.xsd")
schema = etree.XMLSchema(file=schema_file)
parser = etree.XMLParser(schema = schema)
data_file = open(sys.argv[1], 'r')
tree = etree.parse(data_file, parser)
root = tree.getroot()
data_file.close()
schema_file.close()
end = datetime.now()
delta = end-start
print "Parsing time = ", delta
start = datetime.now()
name_list = root.xpath("book/author/name/text()")
print ("Size of list = " + str(len(name_list)))
end = datetime.now()
delta = end-start
print "Query time = ", delta