views:

525

answers:

6

Hi all. I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"&gt;&lt;html:head/&gt;&lt;html:body&gt;&lt;html:table&gt;&lt;html:tbody&gt;&lt;html:tr&gt;&lt;html:td&gt;Header&lt;/html:td&gt;&lt;/html:tr&gt;&lt;html:tr&gt;&lt;html:td&gt;Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

A: 

try using jquery. and you can retrieve all elements. alternately, you can put an id on your row and pull it out.

1) ... ...

$("td")[1].innerHTML will be what you want

2) ... ...

$("#blah").text() will be what you want

yamspog
I think the request was for a Python solution.
Greg
A: 

i believe you can do css search on lxml objects.. like so

elements = root.cssselect('div.content')
data = elements[0].text
z33m
+1  A: 

With BeautifulSoup, you can do that with

>>> soup = BeautifulSoup.BeautifulSoup('<html><table><tr><td>Header</td></tr><tr><td>Want This</td></tr></table></html>')
>>> soup.findAll('td')[1].string
u'Want This'
>>> soup.findAll('tr')[1].td.string
u'Want This'

(Obviously that's a really crude example, but ya.)

isbadawi
A: 

You can use xml.dom.minidom:

from xml.dom.minidom import parseString

doc = parseString('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>')

def parse(node):
    for node in node.childNodes:
        print(node)
        parse(node)

parse(doc)

There are a number of ways to access the 1st column of the 2nd row. You can iterate through the document until you come across it, you can jump straight to it via attributes, etc... The module is documented very well.

http://docs.python.org/py3k/library/xml.dom.html

http://docs.python.org/py3k/library/xml.dom.minidom.html

rcoyner
but what happens of the html is not correct xml?
Mark
+5  A: 

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

Here is a way to do this with lxml:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

Result:

['Header', 'Want This']
Ryan Ginstrom
+1  A: 

I always recommend to try out lxml library. It's blazingly fast and has many features.

It has also support for html5lib parser if you need that: html5parser

>>> from lxml.html import fromstring, tostring

>>> html = """
... <html>
...     <table>
...         <tr><td>Header</td></tr>
...         <tr><td>Want This</td></tr>
...     </table>
... </html>
... """
>>> doc = fromstring(html)
>>> tr = doc.cssselect('table tr')[1]
>>> print tostring(tr)
<tr><td>Want This</td></tr>
Ruslan Spivak
This is how I'd do it, except I'd use "print doc.cssselect('tr')[1].text_content()" to get at the contents of the second row, rather than have lxml show the HTML.
Greg