ansaurus

Question

html5lib/lxml examples for BeautifulSoup users?

Answer 1

A:

Try:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

You have to specify the namespace rather than the namespace prefix (html:tr). For more information, see the lxml docs, particularly the section:

Tutorial - Namespaces

ars 2010-09-12 19:57:00

Answer 2

A:

It appears that using the "lxml" html5lib TreeBuilder causes html5lib to build the tree in the XHTML namespace -- which makes sense, as lxml is an XML library, and XHTML is how one represents HTML as XML. You can use lxml's qname syntax with the find() method to do something like:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

Or you can use lxml's full XPath functions to do something like:

root.xpath('.//html:tr', namespaces={'html': 'http://www.w3.org/1999/xhtml'})

The lxml documentation has more information on how it uses XML namespaces.

llasram 2010-09-12 20:09:01

Answer 3

+1 A:

In general, use lxml.html for HTML. Then you don't need to worry about generating your own parser & worrying about namespaces.

>>> import lxml.html as l
>>> doc = """
...    <html><body>
...    <table>
...      <tr>
...        <td>one</td>
...        <td>1</td>
...      </tr>
...      <tr>
...        <td>two</td>
...        <td>2</td
...      </tr>
...    </table>
...    </body></html>"""
>>> doc = l.document_fromstring(doc)
>>> doc.finall('.//tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

FYI, lxml.html also allows you to use CSS selectors, which I find is an easier syntax.

>>> doc.cssselect('tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

Tim McNamara 2010-09-12 23:06:38

great! Now I just need to find the method in lxml to parse a file (instead of having to read a file into a string) and I'm all set.

Chris Curvey 2010-09-14 22:55:51

ah, found the parse-a-file at http://stackoverflow.com/questions/3569152/parsing-html-with-lxml

Chris Curvey 2010-09-14 23:00:11

Am glad to see you've been successful! Actually, it's simpler than that example: `lxml.html.parse` can take a URL, file name or file-like object as its argument. One gotcha though is that the function returns a tree, not the root element. Use `lxml.html.parse(file).get_root()` to get the root node.

Tim McNamara 2010-09-15 20:39:27

ansaurus

tags:

views:

answers:

html5lib/lxml examples for BeautifulSoup users?

related questions