views:

84

answers:

1

I'm new to python and I'm using BeautifulSoup to parse a website and then extract data. I have the following code:

for line in raw_data: #raw_data is the parsed html separated into smaller blocks
    d = {}
    d['name'] = line.find('div', {'class':'torrentname'}).find('a')
    print d['name']

<a href="/ubuntu-9-10-desktop-i386-t3144211.html">
<strong class="red">Ubuntu</strong> 9.10 desktop (i386)</a>

Normally I would be able extract 'Ubuntu 9.10 desktop (i386)' by writing:

d['name'] = line.find('div', {'class':'torrentname'}).find('a').string

but due to the strong html tags it returns None. Is there a way to extract the strong tags and then use .string or is there a better way? I have tried using BeautifulSoup's extract() function but I couldn't get it to work.

Edit: I just realized that my solution does not work if there are two sets of strong tags as the space between the words are left out. What would be a way to fix this problem?

+1  A: 

Use the ".text" property:

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

Or do a join on findAll(text=True):

anchor = line.find('div', {'class':'torrentname'}).find('a')
d['name'] = ''.join(anchor.findAll(text=True))
Matt Austin
This doesn't work. It doesn't keep the spaces in an example like this: <strong class="red">Ubuntu</strong> <strong class="red">Linux</strong>. It comes out as UbuntuLinux.
FlowofSoul
I have updated the answer with an additional option.
Matt Austin
Thanks so much, that works great! Could you explain how that second line of code works?
FlowofSoul
The BeautifulSoup documentation says the text argument allows you to "search for NavigableString objects instead of Tags". findAll returns a python list, which can then be joined together (.join) to form one string.http://www.crummy.com/software/BeautifulSoup/documentation.html
Matt Austin