ansaurus

Question

Fast python XML validator with XPath support

Answer 1

A:

I wonder if you can rewrite the xpath expression to run faster? One thing that may work is to avoid building the name_list nodeset (if you don't need it later) and have the nodes counted inside of lxml. Something like this:

start = datetime.now()
name_list_len = root.xpath("count(/book/author/name/text())")
print ("Size of list = " + str(name_list_len))
end = datetime.now()

Otherwise, you may find the expat parser faster for extracting the text, but it isn't validating, and is more complex to use (you'll need to write a state machine and a couple of callbacks). If you just need the text, it may be faster to use the C implementation of the element tree API. The lxml benchmarks make interesting reading and do seem to hint that you could extract the text more quickly with that.

One common xpath performance issue is unnecessary use of '//' at the start of the expression. In this case, making the expression absolute, e.g.:

 name_list = root.xpath("/rootelement/book/author/name/text()")

can be a lot quicker if the document is structured to allow this. Shouldn't be an issue here though.

Andrew Walker 2010-02-20 16:12:30

Thinking about it, I don't think my answer is quite right - your xpath expression is already absolute (it doesn't start with '//'). Still, 5 mins seems rather long.

Andrew Walker 2010-02-21 09:13:48

I've updated the answer to fix this.

Andrew Walker 2010-02-21 10:18:04

Answer 2

A:

The lxml benchmarks are quite useful. It appears to me that extracting element nodes using XPath is fast but extracting text could be slow. Below, I've three solutions that are pretty fast.

def bench_lxml_xpath_direct(root): # Very slow but very fast if text() is removed.
  name_list = root.xpath("book/author/name/text()")
  print ("Size of list = " + str(len(name_list)))

def bench_lxml_xpath_loop(root): # Fast
  name_list = root.xpath("book/author/name")
  result = []
  for n in name_list:
    result.append(n.text)

  print ("Size of list = " + str(len(name_list)))

def bench_lxml_getiterator(tree): # Very fast
  result = []
  for name in tree.getiterator("name"):
    result.append(name.text)
  print ("Size of list = " + str(len(result)))


def bench_lxml_findall(tree):  # Superfast
  result = []
  for name in tree.findall("//name"):
    result.append(name.text)
  print ("Size of list = " + str(len(result)))

Sumant 2010-02-21 18:33:07

ansaurus

tags:

views:

answers:

Fast python XML validator with XPath support

related questions