ansaurus

Question

Answer 1

+2 A:

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

KushalP 2010-06-01 13:39:06

It seems BS is deprecated (and googling seems to suggest lxml is the way forward..)so ideally i wanted to learn some lxml [as the documentation is mildly bewildering..]

sadhu_ 2010-06-01 18:07:28

ansaurus

tags:

views:

answers:

python [lxml] - cleaning out html tags

related questions