I have an application where I've been using html5lib to liberally parse html. I use the minidom interface, because I need a real DOM API and ElementTree is not appropriate for what I'm doing.
Here's how I do this:
parser = html5lib.XHTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
parser.parse(html)
However, parsing huge files is becoming a performance bottleneck, and lxml parsing is about 80 times faster than html5lib (I benchmarked it).
How do I parse with lxml or a similarly fast bad-html-tolerant library, and manipulate with a DOM-compatible API?