ansaurus

Question

Regular Expression in Python - Parsing html

Answer 1

A:

<span.+?<\/span> will match the tags and anything in between them.

2009-01-15 04:34:28

-1: there are a lot of cases where this fails, nested <span>s for instance.

nosklo 2009-01-15 13:43:18

Answer 2

+18 A:

Use Beautifulsoup. Or be sad. HTML and regular expression don't mix.

Here's the entire program:

import urllib2
from BeautifulSoup import BeautifulSoup

# Grab your html
html  = urllib2.urlopen("http://www.google.com").read()

# Create a soup object
soup  = BeautifulSoup(html)

# Find all the spans, even if they're malformed
spans = soup.findAll("span")

# Remove all the spans from the soup object
[span.extract() for span in spans]

# Dump your new HTML to stdout.
print soup

Triptych 2009-01-15 04:44:55

While I agree, for this particular thing there's no reason to introduce beautiful soup.

Nick Stinemates 2009-01-15 04:49:08

no? how about a span in a comment? or as a string in javascript code? or one that's malformed?

Triptych 2009-01-15 05:00:30

This is good, however, I think using a list comprehension solely for a side-effect is bad form. Recommend a plain for loop here.

Dustin 2009-01-15 05:15:54

I should also download a gzipped version of the HTML, wrap it in a try/except block, encode the output, etc. Just trying to keep it simple.

Triptych 2009-01-15 05:29:44

I second Dustin's opinion. Don't use a list comprehension if you don't need a list of the results.

nosklo 2009-01-15 13:41:11

agreed, using list comprehensions for their side effects is ++ungood, but sadly it's in the original documentation :(

hop 2009-01-16 02:17:19

Answer 3

A:

You have to be careful with regular expressions---they won't work if spans are nested.

BeautifulSoup looks like a nice tool.

Norman Ramsey 2009-01-15 06:15:28

ansaurus

tags:

views:

answers:

Regular Expression in Python - Parsing html

related questions