tags:

views:

1326

answers:

6

I am tying to make comments in a blog engine XSS-safe. Tried a lot of different approaches but find it very difficult.

When I am displaying the comments I am first using Microsoft AntiXss 3.0 to html encode the whole thing. Then I am trying to html decode the safe tags using a whitelist approach.

Been looking at Steve Downing's example in Atwood's "Sanitize HTML" thread at refactormycode.

My problem is that the AntiXss library encodes the values to &#DECIMAL; notation and I don't know how to rewrite Steve's example, since my regex knowledge is limited.

I tried the following code where I simply replaced entities to decimal form but it does not work properly.

< with <
> with >

My rewrite:

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding.  Splits the input so we can get discrete
    /// chunks that start with &lt; and ends with either end of line or &gt;
    /// </summary>
    private static Regex _tags = new Regex("&#60;(?!&#62;).+?(&#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &gt; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>
    private static Regex _whitelist = new Regex(@"
^&#60;/?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&#62;$
|^&#60;(b|h)r\s?/?&#62;$
|^&#60;a(?!&#62;).+?&#62;$
|^&#60;img(?!&#62;).+?/?&#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {

        string tagname = "";
        Match tag;
        MatchCollection tags = _tags.Matches(html);
        string safeHtml = "";

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }

        }

        return html;
    }

}

My input testing html is:

<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>

After AntiXss it turns into:

&#60;p&#62;&#60;script language&#61;&#34;javascript&#34;&#62;alert&#40;&#39;XSS&#39;&#41;&#60;&#47;script&#62;&#60;b&#62;bold should work&#60;&#47;b&#62;&#60;&#47;p&#62;

When I run the version of Sanitize(string html) above it gives me:

<p><script language="javascript">alert&#40;&#39;XSS&#39;&#41;</script><b>bold should work</b></p>

The regex is matching script from the whitelist which I don't want. Any help with this would be highly appreciated.

+1  A: 

Have you considered using Markdown or VBCode or some similar approaches for the users to mark their comments up with? Then you can disallow all HTML.

If you must allow HTML then I would consider using a HTML parser (in the spirit of HTMLTidy) and do the white-listing there.

PEZ
+1  A: 

Hi Pez. Yes I am using the WMD editor with markdown, but I want the users to be able to post HTML and code examples like on Stack Overflow, so I don't want to disallow HTML completely.

I have been looking at HTML Tidy but not tried it yet. I am however using the Html Agility Pack to make sure the HTML is correct (no orphan tags). This is done before I run AntiXss.

I will try out HTML Tidy if I can't make my current solution work as I like, thanks for the suggestion.

jesperlind
But on Stackoverflow all HTML is escaped. No whiste-listing. Or am I mistaken?
PEZ
jesperlind
I think you are right, perhaps disabling html and only accepting markdown would be sufficient. Wonder if the comment textbox on Stack Overflow uses WDM editor as well as the answer box?
jesperlind
I checked Html Agility Pack up some. On the surface it looks like it could assist in the whitelisting. But anyway, better escape all HTML.
PEZ
A: 

I'm on a Mac so I can't test your C# code. But to me it seems like you should make the _whitelist regexp only work with the tag names. It might mean you have to make two passes, one for opening and one for closing tags. But it will make it much simpler.

PEZ
Thanks for trying this out and your helpful comments. I'm on Mac also but using Visual Studio in a VMware Fusion partition of Vista.
jesperlind
+1  A: 

Your problem is that C# is missinterpretating your regexp. You need to escape the #-sign. Without the escape it matches too much.

private static Regex _whitelist = new Regex(@"
    ^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
    |^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
    |^&\#60;a(?!&\#62;).+?&\#62;$
    |^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$",

    RegexOptions.Singleline |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.ExplicitCapture 
    RegexOptions.Compiled
 );

Update 2: You might be interested in this xss and regexp site.

some
Thanks. Figured out this myself from reading regex references a minute ago. With your version the starting tags are replaces properly. But the ending tags are not matched since the slash is encoded to decimal form / by AntiXss. But now we are close.
jesperlind
@jesperlind: Oh, I didn't look for that. Updated the script, changed all/ to (/)
some
Beware for attributes on a and img tags... I suggest you handle these specially and only allow href and src.
some
@some: Thank you very much, this works great now. My poor RegEx knowledge got a lot better by this exercise.
jesperlind
@jesperlind: No problem, I learnt some c#! I added some links that you might be interested in.
some
Thanks for the links. I used to read hack.ers.org but but my rss stopped working a while ago for some reason. Resubscribed. The JavaScript testing site for RegEx is very interesting.
jesperlind
A: 

I will here post the complete code again (slightly refactored and with updated comments) if anybody is interested in using this.

I also decided to remove the img tag from the whitelist as @Pez and @some pointed out that this can be dangerous to allow.

Also have to point out that I have not tested this properly against possible XSS attacks. It's just a stating point for my to se how well this method works.

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete
    /// chunks that start with &#60; and ends with either end of line or &#62;
    /// </summary>
    private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &#60; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>

    private static readonly Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {
        Match tag;
        MatchCollection tags = _tags.Matches(html);

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            string tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                string safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }
        }
        return html;
    }
}
jesperlind
You still have to sanitize the content of the a-tag, or it is possible to use href="javascript:evil" or onmouseover="evil" etc....
some
A: 

link text

hi "hi", thanks for the google link, but I've already tried that ;)
jesperlind