I need to search through an html doc, find all of the spans and delete the spans and everything between them. What would a regular expression look like that would match everything between ?
-1: there are a lot of cases where this fails, nested <span>s for instance.
nosklo
2009-01-15 13:43:18
+18
A:
Use Beautifulsoup. Or be sad. HTML and regular expression don't mix.
Here's the entire program:
import urllib2
from BeautifulSoup import BeautifulSoup
# Grab your html
html = urllib2.urlopen("http://www.google.com").read()
# Create a soup object
soup = BeautifulSoup(html)
# Find all the spans, even if they're malformed
spans = soup.findAll("span")
# Remove all the spans from the soup object
[span.extract() for span in spans]
# Dump your new HTML to stdout.
print soup
Triptych
2009-01-15 04:44:55
While I agree, for this particular thing there's no reason to introduce beautiful soup.
Nick Stinemates
2009-01-15 04:49:08
no? how about a span in a comment? or as a string in javascript code? or one that's malformed?
Triptych
2009-01-15 05:00:30
This is good, however, I think using a list comprehension solely for a side-effect is bad form. Recommend a plain for loop here.
Dustin
2009-01-15 05:15:54
I should also download a gzipped version of the HTML, wrap it in a try/except block, encode the output, etc. Just trying to keep it simple.
Triptych
2009-01-15 05:29:44
I second Dustin's opinion. Don't use a list comprehension if you don't need a list of the results.
nosklo
2009-01-15 13:41:11
agreed, using list comprehensions for their side effects is ++ungood, but sadly it's in the original documentation :(
hop
2009-01-16 02:17:19
A:
You have to be careful with regular expressions---they won't work if spans are nested.
BeautifulSoup looks like a nice tool.
Norman Ramsey
2009-01-15 06:15:28