tags:

views:

740

answers:

7

How can I remove all HTML from a string in Python? For example, how can I turn:

blah blah <a href="blah">link</a>

into

blah blah link

Thanks!

+3  A: 

You can use a regular expression to remove all the tags:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'
Pourquoi Litytestdata
You can simplify your regex to '<.*?>' which will accomplish the same results, but this assumes properly formated HTML, as does yours.
Unkwntech
Do you have to check for quoted >, or are those not allowed? Can you have <a ... title="1>2"> or something?
Daniel LeCheminant
@Unkwntech: I prefer <[^>]*> over <.*?> since the former does not need to keep backtracking to find the end of the tag.
Pourquoi Litytestdata
@Daniel L: Ideally, >s in attributes should be replaced with >. It is possible to modify the above regexp to ignore >s in attributes, but I'll leave that as an exercise for the interested reader.
Pourquoi Litytestdata
That's not going to work well with things like "line1<br>line2", newlines or double spaces etc. It also won't decode HTML entities. Quick and dirty might be good enough, but to really do this right you're going to need to use a rea HTML library like BeautifulSoup or lxml.
Why not r'<[^>]+>'? There is no '<>' tag in HTML.
J.F. Sebastian
@J.F. Sebastian: I don't see that it makes a difference worth worrying about.
Pourquoi Litytestdata
+3  A: 

Try Beautiful Soup. Throw away everything except the text.

George V. Reilly
A: 
>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> q = re.compile(r'<.*?>', re.IGNORECASE)
>>> re.sub(q, '', s)
'blah blah link'
Selinap
+9  A: 

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
Triptych
BeautifulSoup hits the same wall too. See http://stackoverflow.com/questions/598817/python-html-removal/600471#600471
J.F. Sebastian
+4  A: 

There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

MrTopf
+2  A: 

html2text will do something like this.

RexE
html2text is great for producing nicely formatted, readable output without an extra step. If all the HTML strings you need to convert are as simple as your example, then BeautifulSoup is the way to go. If more complex, html2text does a great job of preserving the readable intent of the original.
Jarret Hardie
+2  A: 

Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

blah blah link END
J.F. Sebastian