views:

531

answers:

1

I'm working on some screen scraping software and have run into an issue with Beautiful Soup. I'm using python 2.4.3 and Beautiful Soup 3.0.7a.

I need to remove an <hr> tag, but it can have many different attributes, so a simple replace() call won't cut it.

Given the following html:

<h1>foo</h1>
<h2><hr/>bar</h2>

And the following code:

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i
    print i.string

The output is:

<h1>foo</h1>
foo
<h2>bar</h2>
None

Am I misunderstanding the extract function, or is this a bug with Beautiful Soup?

A: 

It may be a bug. But fortunately for you, there is another way to get the string:

from BeautifulSoup import BeautifulSoup

string = \
"""<h1>foo</h1>
<h2><hr/>bar</h2>"""

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i, i.next

# <h1>foo</h1> foo
# <h2>bar</h2> bar
Unknown