views:

135

answers:

1

I have an application where I've been using html5lib to liberally parse html. I use the minidom interface, because I need a real DOM API and ElementTree is not appropriate for what I'm doing.

Here's how I do this:

parser = html5lib.XHTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
parser.parse(html)

However, parsing huge files is becoming a performance bottleneck, and lxml parsing is about 80 times faster than html5lib (I benchmarked it).

How do I parse with lxml or a similarly fast bad-html-tolerant library, and manipulate with a DOM-compatible API?

+2  A: 

Think I found a solution:

from xml.dom.pulldom import SAX2DOM
import lxml.sax
def parse_lxml_dom(html):
    tree = lxml.html.document_fromstring(html)
    handler = SAX2DOM()
    lxml.sax.saxify(tree, handler)
    return handler.document

However, this is only about 7 times faster than html5lib. The saxify call takes quite a while.

Christian Oudard