tags:

views:

4750

answers:

8

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

+4  A: 

How about parsing the HTML data and extracting the data with the help of the parser ?

I'd try something like the author described in chapter 8.3 in the Dive Into Python (link) book

bernhardrusch
+5  A: 

While I agree with Lucas that regular expressions are not all that scary, I still think that you should go with a specialized HTML parser. This is because the HTML standard is hairy enough (especially if you want to parse arbitrarily "HTML" pages taken off the Internet) that you would need to write a lot of code to handle the corner cases. It seems that python includes one out of the box.

You should also check out the python bindings for TidyLib which can clean up broken HTML, making the success rate of any HTML parsing much higher.

Cd-MaN
+1  A: 

You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:

 <div>5 < 7</div>

Stripping the tags with regex will return the string "5 " and treat

 < 7</div>

as a single tag and strip it out.

I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.

yjerem
+6  A: 

Use BeautifulSoup! It's perfect for this, where you have incoming markup of dubious virtue and need to get something reasonable out of it. Just pass in the original text, extract all the string tags, and join them.

John Millikin
+9  A: 

Use lxml which is the best xml/html library for python.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

And if you just want to sanitize the html look at the lxml.html.clean module

Peter Hoffmann
A: 

Regular expressions are not scary, but writing your own regexes to strip HTML is a sure path to madness (and it won't work, either). Follow the path of wisdom, and use one of the many good HTML-parsing libraries.

Lucas' example is also broken because "sub" is not a method of a Python string. You'd have to "import re", then call re.sub(pattern, repl, string). But that's neither here nor there, as the correct answer to your question does not involve writing any regexes.

Carl Meyer
A: 

Looking at the amount of sense people are demonstrating in other answers here, I'd say that using a regex probably isn't the best idea for your situation. Go for something tried and tested, and treat my previous answer as a demonstration that regexes need not be that scary.

Lucas Richter
A: 

Actually the link to Dive Into Python should be this

Bartosz Radaczyński