views:

329

answers:

3

I have the following VB.Net 2.0 in an ASP.Net app:

output = Regex.Replace(output, "<p>(?:(?:\<\!\-\-.*?\-\-\>)|&(?:nbsp|\#0*160|x0*A0);|<br\s*/?>|[\s\u00A0]+)*</p>", String.Empty, RegexOptions.Compiled Or RegexOptions.CultureInvariant Or RegexOptions.IgnoreCase Or RegexOptions.Singleline)

Example stuff it matches well:

  • <p></p>
  • <p> </p>
  • <p><br/><br/></p>
  • <p><!-- comment --><!-- comment --></p>
  • <p>&nbsp;&nbsp;</p>
  • <p><br/>&nbsp;</p>
  • <p><!-- comment --><br/><!-- comment --></p>
  • <p>&nbsp;<br/></p>

Examples of stuff I'd like to match but it doesn't:

  • <p > <!--[if !supportLineBreakNewLine]--><br /> <!--[endif]--></p>

How do I make the groups and repetitions work how I want them to?

Edit: oops, forgot the comment group. Edit #2: oops, forgot a fail. Edit #3: fixed examples. Edit #4: updated regex based on answers

Conclusion:

Here are my benchmarked results for all three answers. Since all three now match everything I ran each one through 10,000 iterations on a block of text:

Mine:
<p\s*>(?:(?:<!--.*?-->)|&(?:nbsp|\#0*160|x0*A0);|<br\s*/?>|[\s\u00A0]+)*</p>
6.312

Gumbo:
<p\s*>(?:[\s\u00A0]+|&(?:nbsp|\#0*160|x0*A0);|<br\s*/?>|<!--(?:[^-]+|-(?!-))*-->)*</p>
6.05

steamer25:
<p\s*>(?:(?:\&nbsp\;)|(?:\&\#0*160\;)|(?:<br\s*/?>)|\s|\u00A0|<!\-\-[^(?:\-\-)]*\-\->)*</p>
6.121

Gumbo's was the fastest, so I'll mark his as the correct answer.

+1  A: 

Try this regular expression:

<p\s*>(?:[\s\u00A0]+|&(?:nbsp|\#0*160|x0*A0);|<br\s*/?>|<!--(?:[^-]+|-(?!-))*-->)*</p>
Gumbo
Seems to be missing a ')' somewhere
travis
ah, I had to escape the '#' still doesn't seem to match that last item
travis
+1  A: 
<p\s*>(?:(?:\&nbsp\;)|(?:\&\#0*160\;)|(?:<br\s*/?>)|\s|\u00A0|<!\-\-[^(?:\-\-)]*\-\->)*</p>

You don't need to escape angle brackets <> and I've added comments.

steamer25
thanks for the tip on the angle brackets still doesn't match that last item correctly though
travis
+1  A: 

UGH! I see my problem, it was in the P tag itself, not the grouping.

<p\s*>(?:(?:<!--.*?-->)|&(?:nbsp|\#0*160|x0*A0);|<br\s*/?>|[\s\u00A0]+)*</p>

Notice the \s* in the tag. Points for all!

travis