tags:

views:

125

answers:

3

Hi all, a question about python regular expression.

I would like to match a div block like

<div class="leftTail"><ul class="hotnews">any news stuff</ul></div>

I was thinking a pattern like

p = re.compile(r'<div\s+class=\"leftTail\">[^(div)]+</div>')

but it seems not working properly

another pattern

p = re.compile(r'<div\s+class=\"leftTail\">[\W|\w]+</div>')

i got much more than i think, it gets all the stuff until the last tag in the file.

Thanks for any help

+4  A: 

Don't use regular expressions to parse XML or HTML. You'll never be able to get it to work correctly for nested divs.

Laurence Gonsalves
Actually, you can, but it's a complete PITA and you're far better off to just use a proper X/HTML parser.
Matthew Scharley
"Real" regular expressions can't deal with nesting. That's what separates regular languages from context free languages. Most regex implementations are more powerful than strictly regular, but most still aren't powerful enough to deal with nesting.
Laurence Gonsalves
+11  A: 

You might want to consider graduating to an actual HTML parser. I suggest you give Beautiful Soup a try. There are many crazy ways for HTML to be formatted, and the regular expressions may not work correctly all the time, even if you write them correctly.

steveha
thanks Beautiful Soup works great!
xlione
+2  A: 

try this:

p = re.compile(r'<div\s+class=\"leftTail\">.*?</div>')
Rubens Farias