views:

1884

answers:

3

I need to search through an html doc, find all of the spans and delete the spans and everything between them. What would a regular expression look like that would match everything between ?

A: 

<span.+?<\/span> will match the tags and anything in between them.

-1: there are a lot of cases where this fails, nested <span>s for instance.
nosklo
+18  A: 

Use Beautifulsoup. Or be sad. HTML and regular expression don't mix.

Here's the entire program:

import urllib2
from BeautifulSoup import BeautifulSoup

# Grab your html
html  = urllib2.urlopen("http://www.google.com").read()

# Create a soup object
soup  = BeautifulSoup(html)

# Find all the spans, even if they're malformed
spans = soup.findAll("span")

# Remove all the spans from the soup object
[span.extract() for span in spans]

# Dump your new HTML to stdout.
print soup
Triptych
While I agree, for this particular thing there's no reason to introduce beautiful soup.
Nick Stinemates
no? how about a span in a comment? or as a string in javascript code? or one that's malformed?
Triptych
This is good, however, I think using a list comprehension solely for a side-effect is bad form. Recommend a plain for loop here.
Dustin
I should also download a gzipped version of the HTML, wrap it in a try/except block, encode the output, etc. Just trying to keep it simple.
Triptych
I second Dustin's opinion. Don't use a list comprehension if you don't need a list of the results.
nosklo
agreed, using list comprehensions for their side effects is ++ungood, but sadly it's in the original documentation :(
hop
A: 

You have to be careful with regular expressions---they won't work if spans are nested.

BeautifulSoup looks like a nice tool.

Norman Ramsey