ansaurus

Question

using python, Remove HTML tags/formatting from a string

Answer 1

+3 A:

If you are going to use regex:

import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
'I Want This text!'

Zonda333 2010-08-03 17:09:10

This will only work reliably on well-formed HTML (ie, no unescaped `<` or `>` outside of actual tags, no malformed tags like `<b class="forgot-to-close"`, etc.). That being said, this is the first approach I'd use, depending on the source data.

Will McCutchen 2010-08-03 17:26:27

Answer 2

A:

Depending on whether the text will contain '>' or '<' I would either just make a function to remove anything between those, or use a parsing lib

def cleanStrings(self, inStr):
  a = inStr.find('<')
  b = inStr.find('>')
  if a < 0 and b < 0:
    return inStr
  return cleanString(inStr[a:b-a])

snurre 2010-08-03 17:15:44

Answer 3

+4 A:

AFAIK using regex is a bad idea for parsing HTML, you would be better off using a HTML/XML parser like beautiful soup.

volting 2010-08-03 17:17:16

+1 for Beautiful Soup

derekerdmann 2010-08-03 17:34:23

I am using beautifulsoup, but I want to be able to strip html tags manually also. thanks!

Blankman 2010-08-03 18:01:29

@Blankman it would of been a good idea to mention that in your question

volting 2010-08-03 18:31:15

Answer 4

+1 A:

Use SGMLParser. regex works in simple case. But there are a lot of intricacy with HTML you rather not have to deal with.

>>> from sgmllib import SGMLParser
>>>
>>> class TextExtracter(SGMLParser):
...     def __init__(self):
...         self.text = []
...         SGMLParser.__init__(self)
...     def handle_data(self, data):
...         self.text.append(data)
...     def getvalue(self):
...         return ''.join(ex.text)
...
>>> ex = TextExtracter()
>>> ex.feed('<html>hello &gt; world</html>')
>>> ex.getvalue()
'hello > world'

Wai Yip Tung 2010-08-03 17:32:37

Answer 5

A:

Use lxml.html. It's much faster than BeautifulSoup and raw text is a single command.

>>> import lxml.html
>>> page = lxml.html.document_fromstring('<!DOCTYPE html>...</html>')
>>> page.cssselect('body')[0].text_content()
'...'

Tim McNamara 2010-08-03 19:57:46

ansaurus

tags:

views:

answers:

using python, Remove HTML tags/formatting from a string

related questions