I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators.
By looking at the docs for html5lib, I came up with this for a test program:
import cStringIO
f = cStringIO.StringIO()
f.write("""
<html>
<body>
<table>
<tr>
<td>one</td>
<td>1</td>
</tr>
<tr>
<td>two</td>
<td>2</td
</tr>
</table>
</body>
</html>
""")
f.seek(0)
import html5lib
from html5lib import treebuilders
from lxml import etree # why?
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)
root = etree_document.getroot()
root.find(".//tr")
But this returns None. I noticed that if I do a "etree.tostring(root)" I get all my data back, but all my tags are prefaced by "html" (e.g. <html:table>). But root.find(".//html:tr") throws a KeyError.
Can someone put me back on the right track?