ansaurus

Question

Extracting Text from Parsed HTML with Python

Answer 1

A:

Regex + html = NO! OH GOD NO!

Use this instead: http://docs.python.org/library/htmlparser.html

fredley 2010-08-26 13:17:23

Answer 2

+3 A:

BeautifulSoup could also extract node values from your html.

from BeautifulSoup import BeautifulSoup

html = ('<html><head><title>Page title</title></head>'
       '<body>'
       '<table><tr>'
       '<td class="name"><a href="/torrent/32726/0/">Slackware Linux 13.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32727/0/">Slackware Linux 14.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32728/0/">Slackware Linux 15.0 [x86 DVD ISO]</a></td>'
       '</tr></table>'
       'body'
       '</html>')
soup = BeautifulSoup(html)
links = [td.find('a') for td in soup.findAll('td', { "class" : "name" })]
for link in links:
    print link.string

Output:

Slackware Linux 13.0 [x86 DVD ISO]  
Slackware Linux 14.0 [x86 DVD ISO]  
Slackware Linux 15.0 [x86 DVD ISO]

systempuntoout 2010-08-26 13:28:34

Hey you never used the re module ¬¬

razpeitia 2010-08-26 15:01:56

Answer 3

A:

Documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html

from BeautifulSoup import BeautifulSoup

html = '<td class="name"><a href="/torrent/32726/0/">Slackware Linux 13.0 [x86 DVD ISO]</a></td>'

soup = BeautifulSoup(html)

a = soup.find('td', 'name').find('a')

print a['href']
# /torrent/32726/0/

print a.string
# Slackware Linux 13.0 [x86 DVD ISO]

If you want to parse multiple td you could do something like:

rows = soup.findAll('td', 'name')

for row in rows:
    a = row.find('a')

    print a['href']
    print a.string

Arnar Yngvason 2010-08-26 13:39:26

Answer 4

+1 A:

You could use lxml.html to parse the html document:

from lxml import html

doc = html.parse('http://example.com')

for a in doc.cssselect('td a'):
    print a.get('href')
    print a.text_content()

You will have to look at how the document is structured to find the best way of determining the links you want (there might be other tables with links in them that you do not need etc...): you might first want to find the right table element for instance. There are also options besides css selectors (xpath for example) to search the document/the element.

If you need, you can turn the links into absolute links with .make_links_absolute() method (do it on the document after parsing, and all the url's will be absolute, very convenient)

Steven 2010-08-26 15:05:30

ansaurus

tags:

views:

answers:

Extracting Text from Parsed HTML with Python

related questions