views:

736

answers:

3

Currently I have code that does something like this:

soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.extract()
    soup.renderContents()

Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?

+1  A: 

You'll presumably have to move tag's children to be children of tag's parent before you remove the tag -- is that what you mean?

If so, then, while inserting the contents in the right place is tricky, something like this should work:

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

with the example value, this prints <div><p>Hello there my friend!</p></div> as desired.

Alex Martelli
I still want value = "Hello <div>there</div> my friend!" to be valid.
Jason Christa
@Jason, apart from needing an outermost tag, the string you give is perfectly valid and comes out unchanged from the code I give, so I have absolutely no idea what your comment is **about**!
Alex Martelli
+1  A: 

I have a simpler solution but I don't know if there's a drawback to it.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

This will also print <div><p>Hello there my friend!</p></div> as desired.

Etienne
+3  A: 

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
  soup = BeautifulSoup(html)
  for tag in soup.findAll(True):
    if tag.name in invalid_tags:
      s = ""
      for c in tag.contents:
        if type(c) != NavigableString:
          c = strip_tags(unicode(c), invalid_tags)
        s += unicode(c)
      tag.replaceWith(s)
  return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

Jesse Dhillon