tags:

views:

885

answers:

4

I'm using an HTML sanitizing whitelist code found here:
http://refactormycode.com/codes/333-sanitize-html

I needed to add the "font" tag as an additional tag to match, so I tried adding this condition after the <img tag check

if (tagname.StartsWith("<font"))
{
    // detailed <font> tag checking
    // Non-escaped expression (for testing in a Regex editor app)
    // ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
    if (!IsMatch(tagname, @"<font
                            (\s*size=""\d{1}"")?
                            (\s*color=""((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)"")?
                            (\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
                             \s*?>"))
    {
        html = html.Remove(tag.Index, tag.Length);
    }
}

Aside from the condition above, my code is almost identical to the code in the page I linked to. When I try to test this in C#, it throws an exception saying "Not enough )'s". I've counted the parenthesis several times and I've run the expression through a few online Javascript-based regex testers and none of them seem to tell me of any problems.

Am I missing something in my Regex that is causing a parenthesis to escape? What do I need to do to fix this?

UPDATE
After a lot of trial and error, I remembered that the # sign is a comment in regexes. The key to fixing this is to escape the # character. In case anyone else comes across the same problem, I've included my fix (just escaping the # sign)

if (tagname.StartsWith("<font"))
{
    // detailed <font> tag checking
    // Non-escaped expression (for testing in a Regex editor app)
    // ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
    if (!IsMatch(tagname, @"<font
                            (\s*size=""\d{1}"")?
                            (\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
                            (\s*face=""(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)"")?
                             \s*?>"))
    {
        html = html.Remove(tag.Index, tag.Length);
    }
}
+2  A: 

I don't see anything obviously wrong with the regex. I would try isolating the problem by removing pieces of the regex until the problem goes away and then focus on the part that causes the issue.

Robert Gamble
I'm not sure it is anything to do with the Regex - it works fine for me
Marc Gravell
+1  A: 

It works fine for me... what version of the .NET framework are you using, and what is the exact exception?

Also - what does you IsMatch method look like? is this just a pass-thru to Regex.IsMatch?

[update] The problem is that the OP's example code didn't show they are using the IgnorePatternWhitespace regex option; with this option it doesn't work; without this option (i.e. as presented) the code is fine.

Marc Gravell
+4  A: 

Your IsMatch Method is using the option RegexOptions.IgnorePatternWhitespace, that allows you to put comments inside the regular expressions, so you have to scape the # chatacter, otherwise it will be interpreted as a comment.

if (!IsMatch(tagname,@"<font(\s*size=""\d{1}"")?
    (\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
    (\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
    \s?>"))
{
    html = html.Remove(tag.Index, tag.Length);
}
CMS
+1  A: 
Dan Finucane
The order of the attributes will always be the same for me because of the text editor control I'm using. I don't need to escape my " because of the @ sign. That's a good catch with "courier new." I didn't see that one.
Dan Herbert