ansaurus

Question

Answer 1

+15 A:

I think you will want to look into Beautiful Soup. Once you do, use the advice from this article and strip the HTML elements very simply like this:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Andrew Hare 2009-04-15 18:27:17

Answer 2

+2 A:

You can either use a different HTML parser (like lxml, or Beautiful Soup) that has functions to extract just text. Or, you can run a regex on your line string that strips out the tags. See http://www.amk.ca/python/howto/regex/

jcoon 2009-04-15 18:31:01

Answer 3

+7 A:

I always used this function to strip html tags, as it requires only the python stdlib.

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

2009-05-29 11:47:41

Answer 4

+1 A:

you can write your own function

def StripTags(text): 
     finished = 0 
     while not finished: 
         finished = 1 
         start = text.find("<") 
         if start >= 0: 
             stop = text[start:].find(">") 
             if stop >= 0: 
                 text = text[:start] + text[start+stop+1:] 
                 finished = 0 
     return text

Gunslinger_ 2010-10-04 15:26:49

ansaurus

tags:

views:

answers:

Strip html from strings in python

related questions