views:

256

answers:

2

I need to read a large XML (65 Mb), validate it against a xsd, and run XPath queries on it. Below, I've given an lxml version of that. It takes a lot of time (over 5 minutes) to run the query but validation seems to be pretty quick.

I've a couple of questions. How would a performance minded Python programmer write the program using lxml? Secondly, if lxml is not the right thing for the job, what else? and could you please give a code snippet?

import sys
from datetime import datetime
from lxml import etree

start = datetime.now()
schema_file = open("library.xsd")
schema = etree.XMLSchema(file=schema_file)
parser = etree.XMLParser(schema = schema)
data_file = open(sys.argv[1], 'r')
tree = etree.parse(data_file, parser)
root = tree.getroot()
data_file.close()
schema_file.close()
end = datetime.now()
delta = end-start
print "Parsing time = ", delta

start = datetime.now()
name_list = root.xpath("book/author/name/text()")
print ("Size of list = " + str(len(name_list)))
end = datetime.now()
delta = end-start
print "Query time = ", delta
A: 

I wonder if you can rewrite the xpath expression to run faster? One thing that may work is to avoid building the name_list nodeset (if you don't need it later) and have the nodes counted inside of lxml. Something like this:

start = datetime.now()
name_list_len = root.xpath("count(/book/author/name/text())")
print ("Size of list = " + str(name_list_len))
end = datetime.now()

Otherwise, you may find the expat parser faster for extracting the text, but it isn't validating, and is more complex to use (you'll need to write a state machine and a couple of callbacks). If you just need the text, it may be faster to use the C implementation of the element tree API. The lxml benchmarks make interesting reading and do seem to hint that you could extract the text more quickly with that.

One common xpath performance issue is unnecessary use of '//' at the start of the expression. In this case, making the expression absolute, e.g.:

 name_list = root.xpath("/rootelement/book/author/name/text()")

can be a lot quicker if the document is structured to allow this. Shouldn't be an issue here though.

Andrew Walker
Thinking about it, I don't think my answer is quite right - your xpath expression is already absolute (it doesn't start with '//'). Still, 5 mins seems rather long.
Andrew Walker
I've updated the answer to fix this.
Andrew Walker
A: 

The lxml benchmarks are quite useful. It appears to me that extracting element nodes using XPath is fast but extracting text could be slow. Below, I've three solutions that are pretty fast.

def bench_lxml_xpath_direct(root): # Very slow but very fast if text() is removed.
  name_list = root.xpath("book/author/name/text()")
  print ("Size of list = " + str(len(name_list)))

def bench_lxml_xpath_loop(root): # Fast
  name_list = root.xpath("book/author/name")
  result = []
  for n in name_list:
    result.append(n.text)

  print ("Size of list = " + str(len(name_list)))

def bench_lxml_getiterator(tree): # Very fast
  result = []
  for name in tree.getiterator("name"):
    result.append(name.text)
  print ("Size of list = " + str(len(result)))


def bench_lxml_findall(tree):  # Superfast
  result = []
  for name in tree.findall("//name"):
    result.append(name.text)
  print ("Size of list = " + str(len(result)))
Sumant