The strategy I used is to replace a tag with its contents if they are of type NavigableString
and if they aren't, then recurse into them and replace their contents with NavigableString
, etc. Try this:
from BeautifulSoup import BeautifulSoup, NavigableString
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if type(c) != NavigableString:
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)
The result is:
<p>Good, bad, and ugly</p>
I gave this same answer on another question. It seems to come up a lot.