tags:

views:

2964

answers:

4
from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

When print a line in an html file, I'm trying to find a way to only show the contents of each HTML element and not the formatting itself. If it finds '<a href="whatever.com">some text</a>' it will only print 'some text', '<b>hello</b>' prints 'hello', etc etc. How would one go about doing this?

+15  A: 

I think you will want to look into Beautiful Soup. Once you do, use the advice from this article and strip the HTML elements very simply like this:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))
Andrew Hare
+2  A: 

You can either use a different HTML parser (like lxml, or Beautiful Soup) that has functions to extract just text. Or, you can run a regex on your line string that strips out the tags. See http://www.amk.ca/python/howto/regex/

jcoon
+7  A: 

I always used this function to strip html tags, as it requires only the python stdlib.

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()
+1  A: 

you can write your own function

def StripTags(text): 
     finished = 0 
     while not finished: 
         finished = 1 
         start = text.find("<") 
         if start >= 0: 
             stop = text[start:].find(">") 
             if stop >= 0: 
                 text = text[:start] + text[start+stop+1:] 
                 finished = 0 
     return text 
Gunslinger_