views:

97

answers:

6

Hello,

I want to process some HTML code and remove the tags as in the example:

"<p><b>This</b> is a very interesting paragraph.</p>" results in "This is a very interesting paragraph."

I'm using Python as technology; do you know any framework I may use to remove the HTML tags?

Thanks!

+4  A: 

BeautifulSoup

kevingessner
+6  A: 

This question may help you: http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python

No matter what solution you choose, I'd recommend avoiding regular expressions. They can be slow when processing large strings, they might not work due to invalid HTML, and stripping HTML with regex isn't always secure or reliable.

Colin O'Dell
It's not merely the case that parsing HTML with regexen is difficult, slow, or inadvisable. The problem is that parsing HTML with regexen is literally [impossible](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags).
Antal S-Z
@Antal - Good point :) I've changed "parsing" to "stripping" in my question to make it accurate.
Colin O'Dell
+2  A: 
import libxml2

text = "<p><b>This</b> is a very interesting paragraph.</p>"
root = libxml2.parseDoc(text)
print root.content

# 'This is a very interesting paragraph.'
eumiro
A: 

Depending on your needs, you could just use the regular expression /<(.|\n)*?>/ and replace all matches with empty strings. This works perfectly for manual cases, but if you're building this as an application feature then you'll need a more robust and secure option.

Daniel Mendel
A: 

you can use lxml.

ghostdog74