views:

64

answers:

3

Hey all,

I have a strong that I scraped from an XML file and It contains some HTML formatting tags

(<b>, <i>, etc)

Is there a quick and easy way to remove all of these tags from the text?

I tried

str = str.replace("<b>","")

and applied it several times to other tags, but that doesn't work

+1  A: 

Answer depends on your exact needs. You might have a look at regular expressions. But I would advise you to use http://www.crummy.com/software/BeautifulSoup/ if you want to clean up bad xml or html.

Achim
+1: i second that, never use regex for xml/html parsing
eruciform
Doesn't sound like he wants to parse any html, just strip it all away so he is left with plain text (kind of like the innerHTML function).
Stephen Swensen
Stephen, you're correct.I'm not trying to parse the string, I just want to remove the HTML formatting (anything inside a <> I want removed completely)
Alex B
Oops, I meant the innerText property, not the "innerHTML function"
Stephen Swensen
You will not be able to "just" remove the HTML formatting without more sophisticated parsing. Might be possible for some simple samples, but not for complex ones.
Achim
+3  A: 

Using lxml.html:

lxml.html.fromstring(s).text_content()

This strips all tags and converts all entities to their corresponding characters.

lunaryorn
Thanks! I get AttributeError: 'module' object has no attribute 'html' when I try this though
Alex B
Nevermind, it works!
Alex B
Yeah, if you get an AttributeError its probably your import statement. i.e. You want: `import lxml.html` `lxml.html.fromstring(s).text_content()`
ChrisJF
+1  A: 

Here's how to use the BeautifulSoup module to replace only some tags, leaving the rest of the HTML alone:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
  soup = BeautifulSoup(html)
  for tag in soup.findAll(True):
    if tag.name in invalid_tags:
      s = ""
      for c in tag.contents:
        if type(c) != NavigableString:
          c = strip_tags(unicode(c), invalid_tags)
        s += unicode(c)
      tag.replaceWith(s)
  return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

Result:

<p>Good, bad, and ugly</p>
Jesse Dhillon