tags:

views:

115

answers:

2

This one is a real head scratcher for me...

var matches = Regex.Matches("<p>test something<script language=\"javascript\">alert('hello');</script> and here's <b>bold</b> and <i>italic</i> and <a href=\"http://popw.com/\"&gt;link&lt;/a&gt;.&lt;/p&gt;", "</?(?!p|a|b|i)\b[^>]*>");

The Regex is supposed to capture any HTML tag (open or close) that's not p, a, b, or i. I've plugged the input string and regex into countless testing pages, and every one of them return the script tag (open and close) as matches. But it absolutely doesn't work in the code. The matches variable has a count of 0.

Am I missing something incredibly obvious?

+8  A: 

You forgot to escape the backslash in the pattern string.

"</?(?!p|a|b|i)\\b[^>]*>"
Guffa
Or, I should have used the C# string literal indicator. Duh.@"</?(?!p|a|b|i)\b[^>]*>"
Jeff Putz
A: 

(?! ) is a negative look-ahead. It matches zero characters if it's contained pattern does not match from the current position.

(?!p|a|b|i)\\b will look at the next character to see if it matches p|a|b|i. If it does, the look-ahead fails to match anything. If the contained pattern fails to match, the look-ahead succeeds, and it tries to match the next token in the pattern from the same position. In this case a word boundary.

What you want is probably something like this:

@"</?(?!(?:p|a|b|i)\b)\w+[^>]*>"

It looks ahead for something that matches (?:p|a|b|i)\b. If the that pattern fails to match, the look-ahead succeeds, and it will match at least one word-character, followed by any number of characters up until the closing ">".

MizardX
No, honestly what I had was what I wanted, and it passes all of my unit tests. I just made the stupid mistake of not using it as a string literal (or escaping the \ as Guffa suggested).
Jeff Putz