tags:

views:

99

answers:

2

I'd like to match the contents within each paragraph in html using a python regular expression. These paragraphs always have BR tags inside them like so:

<p class="thisClass">this is nice <br /><br /> isn't it?</p>

I'm currently using this pattern:

pattern = re.compile('<p class=\"thisClass\">(.*?)<\/p>')

Then I'm using:

pattern.findall(html)

to find all the matches. However, it only matches two of 28 paragraphs I have, and it looks like that's because those two don't have BR tags inside of them and the rest do. What am I doing wrong? What can I do to fix it? Thanks!

+5  A: 

I don't think it is failing because of the <br/> but rather because the paragraph is spread across multiple lines. Use the DOTALL mode to fix this:

pattern = re.compile('<p class=\"thisClass\">(.*?)<\/p>', re.DOTALL)
Nadia Alramli
Or rather you should use the re.DOTALL mode to make the dot also match newlines. http://www.regular-expressions.info/python.html
Rene Saarsoo
@Rene, thanks you are right fixed my answer
Nadia Alramli
Thanks for your answer! I know it's a newbie mistake ;)
sotangochips
+2  A: 

It turns out the answer was to include re.S as a flag which allows the "." character to match newlines as well.

pattern = re.compile('<p class=\"thisClass\">(.*?)<\/p>', re.S)

This works perfectly.

sotangochips
That's a shortcut to the DOTALL mode
Nadia Alramli