I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is viewable. sounds simple enough.. i did it with the HTMLParser but its not handling javascript well
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
self.output.write(data)
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text
Any Ideas for a way to do this with lxml or a better way to do it HTMLParser.. HTMLParser would be best because no additional libs are needed.. thanks everyone
Scott F.