ansaurus

Question

Sanitize html encoded text (#decimal notation) from AntiXSS v3 output

Answer 1

+1 A:

Have you considered using Markdown or VBCode or some similar approaches for the users to mark their comments up with? Then you can disallow all HTML.

If you must allow HTML then I would consider using a HTML parser (in the spirit of HTMLTidy) and do the white-listing there.

PEZ 2008-12-28 16:13:06

Answer 2

+1 A:

Hi Pez. Yes I am using the WMD editor with markdown, but I want the users to be able to post HTML and code examples like on Stack Overflow, so I don't want to disallow HTML completely.

I have been looking at HTML Tidy but not tried it yet. I am however using the Html Agility Pack to make sure the HTML is correct (no orphan tags). This is done before I run AntiXss.

I will try out HTML Tidy if I can't make my current solution work as I like, thanks for the suggestion.

jesperlind 2008-12-28 16:30:17

But on Stackoverflow all HTML is escaped. No whiste-listing. Or am I mistaken?

PEZ 2008-12-28 16:43:37

jesperlind 2008-12-28 16:56:42

I think you are right, perhaps disabling html and only accepting markdown would be sufficient. Wonder if the comment textbox on Stack Overflow uses WDM editor as well as the answer box?

jesperlind 2008-12-28 17:00:24

I checked Html Agility Pack up some. On the surface it looks like it could assist in the whitelisting. But anyway, better escape all HTML.

PEZ 2008-12-28 17:11:50

Answer 3

A:

I'm on a Mac so I can't test your C# code. But to me it seems like you should make the _whitelist regexp only work with the tag names. It might mean you have to make two passes, one for opening and one for closing tags. But it will make it much simpler.

PEZ 2008-12-28 16:51:06

Thanks for trying this out and your helpful comments. I'm on Mac also but using Visual Studio in a VMware Fusion partition of Vista.

jesperlind 2008-12-28 18:20:12

Answer 4

+1 A:

Your problem is that C# is missinterpretating your regexp. You need to escape the #-sign. Without the escape it matches too much.

private static Regex _whitelist = new Regex(@"
    ^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
    |^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
    |^&\#60;a(?!&\#62;).+?&\#62;$
    |^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$",

    RegexOptions.Singleline |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.ExplicitCapture 
    RegexOptions.Compiled
 );

Update 2: You might be interested in this xss and regexp site.

some 2008-12-28 17:13:06

Thanks. Figured out this myself from reading regex references a minute ago. With your version the starting tags are replaces properly. But the ending tags are not matched since the slash is encoded to decimal form / by AntiXss. But now we are close.

jesperlind 2008-12-28 17:39:36

@jesperlind: Oh, I didn't look for that. Updated the script, changed all/ to (/)

some 2008-12-28 17:53:48

Beware for attributes on a and img tags... I suggest you handle these specially and only allow href and src.

some 2008-12-28 17:59:13

@some: Thank you very much, this works great now. My poor RegEx knowledge got a lot better by this exercise.

jesperlind 2008-12-28 18:01:38

@jesperlind: No problem, I learnt some c#! I added some links that you might be interested in.

some 2008-12-28 18:12:37

Thanks for the links. I used to read hack.ers.org but but my rss stopped working a while ago for some reason. Resubscribed. The JavaScript testing site for RegEx is very interesting.

jesperlind 2008-12-28 19:01:18

Answer 5

A:

I will here post the complete code again (slightly refactored and with updated comments) if anybody is interested in using this.

I also decided to remove the img tag from the whitelist as @Pez and @some pointed out that this can be dangerous to allow.

Also have to point out that I have not tested this properly against possible XSS attacks. It's just a stating point for my to se how well this method works.

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete
    /// chunks that start with &#60; and ends with either end of line or &#62;
    /// </summary>
    private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &#60; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>

    private static readonly Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {
        Match tag;
        MatchCollection tags = _tags.Matches(html);

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            string tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                string safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }
        }
        return html;
    }
}

jesperlind 2008-12-28 18:17:37

You still have to sanitize the content of the a-tag, or it is possible to use href="javascript:evil" or onmouseover="evil" etc....

some 2008-12-28 22:53:02

Answer 6

A:

link text

2009-08-11 04:29:45

hi "hi", thanks for the google link, but I've already tried that ;)

jesperlind 2009-08-13 00:59:24

ansaurus

tags:

views:

answers:

Sanitize html encoded text (#decimal notation) from AntiXSS v3 output

related questions