tags:

views:

168

answers:

4

I'm new to Python and I have been trying to search through html with regular expressions that has been parsed with BeautifulSoup. I haven't had any success and I think the reason is that I don't completely understand how to set up the regular expressions properly. I've looked at older questions about similar problems but I still haven't figured it out. If somebody could extract the "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" as well as a detailed expression of how the regular expression works, it would be really helpful.

<td class="name">
  <a href="/torrent/32726/0/">
   Slackware Linux 13.0 [x86 DVD ISO]
  </a>
 </td>

Edit: What I meant to say is, I am trying to extract "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" using BeautifulSoups functions to search the parse tree. I've been trying various things after searching and reading the documentation, but I'm still not sure on how to go about it.

A: 

Regex + html = NO! OH GOD NO!

Use this instead: http://docs.python.org/library/htmlparser.html

fredley
+3  A: 

BeautifulSoup could also extract node values from your html.

from BeautifulSoup import BeautifulSoup

html = ('<html><head><title>Page title</title></head>'
       '<body>'
       '<table><tr>'
       '<td class="name"><a href="/torrent/32726/0/">Slackware Linux 13.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32727/0/">Slackware Linux 14.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32728/0/">Slackware Linux 15.0 [x86 DVD ISO]</a></td>'
       '</tr></table>'
       'body'
       '</html>')
soup = BeautifulSoup(html)
links = [td.find('a') for td in soup.findAll('td', { "class" : "name" })]
for link in links:
    print link.string

Output:

Slackware Linux 13.0 [x86 DVD ISO]  
Slackware Linux 14.0 [x86 DVD ISO]  
Slackware Linux 15.0 [x86 DVD ISO]  
systempuntoout
Hey you never used the re module ¬¬
razpeitia
A: 

Documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html

from BeautifulSoup import BeautifulSoup

html = '<td class="name"><a href="/torrent/32726/0/">Slackware Linux 13.0 [x86 DVD ISO]</a></td>'

soup = BeautifulSoup(html)

a = soup.find('td', 'name').find('a')

print a['href']
# /torrent/32726/0/

print a.string
# Slackware Linux 13.0 [x86 DVD ISO]

If you want to parse multiple td you could do something like:

rows = soup.findAll('td', 'name')

for row in rows:
    a = row.find('a')

    print a['href']
    print a.string
Arnar Yngvason
+1  A: 

You could use lxml.html to parse the html document:

from lxml import html

doc = html.parse('http://example.com')

for a in doc.cssselect('td a'):
    print a.get('href')
    print a.text_content()

You will have to look at how the document is structured to find the best way of determining the links you want (there might be other tables with links in them that you do not need etc...): you might first want to find the right table element for instance. There are also options besides css selectors (xpath for example) to search the document/the element.

If you need, you can turn the links into absolute links with .make_links_absolute() method (do it on the document after parsing, and all the url's will be absolute, very convenient)

Steven