ansaurus

Question

Python HTML removal

Answer 1

+3 A:

You can use a regular expression to remove all the tags:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'

Pourquoi Litytestdata 2009-02-28 22:43:17

You can simplify your regex to '<.*?>' which will accomplish the same results, but this assumes properly formated HTML, as does yours.

Unkwntech 2009-02-28 22:45:00

Do you have to check for quoted >, or are those not allowed? Can you have <a ... title="1>2"> or something?

Daniel LeCheminant 2009-02-28 22:45:42

@Unkwntech: I prefer <[^>]*> over <.*?> since the former does not need to keep backtracking to find the end of the tag.

Pourquoi Litytestdata 2009-02-28 22:50:19

@Daniel L: Ideally, >s in attributes should be replaced with >. It is possible to modify the above regexp to ignore >s in attributes, but I'll leave that as an exercise for the interested reader.

Pourquoi Litytestdata 2009-02-28 23:02:08

That's not going to work well with things like "line1<br>line2", newlines or double spaces etc. It also won't decode HTML entities. Quick and dirty might be good enough, but to really do this right you're going to need to use a rea HTML library like BeautifulSoup or lxml.

2009-03-01 01:35:55

Why not r'<[^>]+>'? There is no '<>' tag in HTML.

J.F. Sebastian 2009-03-01 20:45:14

@J.F. Sebastian: I don't see that it makes a difference worth worrying about.

Pourquoi Litytestdata 2009-03-01 21:27:32

Answer 2

+3 A:

Try Beautiful Soup. Throw away everything except the text.

George V. Reilly 2009-02-28 22:52:16

Answer 3

A:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> q = re.compile(r'<.*?>', re.IGNORECASE)
>>> re.sub(q, '', s)
'blah blah link'

Selinap 2009-02-28 23:23:36

Answer 4

+9 A:

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

Triptych 2009-03-01 02:00:18

BeautifulSoup hits the same wall too. See http://stackoverflow.com/questions/598817/python-html-removal/600471#600471

J.F. Sebastian 2009-03-01 20:46:46

Answer 5

+4 A:

There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

MrTopf 2009-03-01 14:45:46

Answer 6

+2 A:

html2text will do something like this.

RexE 2009-03-01 18:38:03

html2text is great for producing nicely formatted, readable output without an extra step. If all the HTML strings you need to convert are as simple as your example, then BeautifulSoup is the way to go. If more complex, html2text does a great job of preserving the readable intent of the original.

Jarret Hardie 2009-03-01 21:20:38

Answer 7

+2 A:

Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

blah blah link END

J.F. Sebastian 2009-03-01 20:42:41

ansaurus

tags:

views:

answers:

Python HTML removal

related questions