views:

24

answers:

1

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.

soup = BeautifulSoup(html)
for script in soup("script"):
    soup.script.extract()

for style in soup("style"):
    soup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)

contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents list and piece the html back together excluding the script & style tags?

Or is there an even better solution to accomplish what I want?

+1  A: 

unicode( soup ) gives you the html.

Also what you want is this:

for elem in soup.findAll(['script', 'style']):
    elem.extract()
THC4k
Hah. so simple. thanks!
Hallik