ansaurus

Question

Removing Tags from HTML Parsed with BeautifulSoup

Answer 1

+1 A:

Use the ".text" property:

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

Or do a join on findAll(text=True):

anchor = line.find('div', {'class':'torrentname'}).find('a')
d['name'] = ''.join(anchor.findAll(text=True))

Matt Austin 2010-08-29 03:54:02

This doesn't work. It doesn't keep the spaces in an example like this: <strong class="red">Ubuntu</strong> <strong class="red">Linux</strong>. It comes out as UbuntuLinux.

FlowofSoul 2010-08-29 04:24:05

I have updated the answer with an additional option.

Matt Austin 2010-08-29 05:29:17

Thanks so much, that works great! Could you explain how that second line of code works?

FlowofSoul 2010-08-29 15:29:33

The BeautifulSoup documentation says the text argument allows you to "search for NavigableString objects instead of Tags". findAll returns a python list, which can then be joined together (.join) to form one string.http://www.crummy.com/software/BeautifulSoup/documentation.html

Matt Austin 2010-08-30 04:46:03

ansaurus

tags:

views:

answers:

Removing Tags from HTML Parsed with BeautifulSoup

related questions