views:

608

answers:

3

I'm trying to write a regex function to remove onclick (also onload, onmouseover etc.) attributes from HTML elements. I want to do this on the server side before the HTML is sent to the client.

I have content coming from a Rich Text editor and being displayed on screen in a div, and I want to protect against XSS (Cross Site Scripting). Obviously I can't HTML encode it using Server.HtmlEncode() because the rich text stores the text as HTML markup, so I'm using a blacklisting approach, looking for certain elements such as <script> and <style>. I'm now trying to look for onclick, onmouseover etc. attributes, so far I have the following:

returnVal = Regex.Replace(returnVal, @"\<(.*?)(\ on[a-z]+\=\""?.*?\""?)*(.*?)\>",
               "<$1 $3>", RegexOptions.Singleline | RegexOptions.IgnoreCase);

...which isn't working, and I've tried a few variations. Basically I want it so that...

<p style="font-style: italic" onclick="alert('hacked!!');">Hello World</p>

gets turned into...

<p style="font-style: italic">Hello World</p>

Any ideas? Cheers!

A: 

have you seen here: http://refactormycode.com/codes/333-sanitize-html ?

AndreasKnudsen
Thanks, but I've seen that before, almost all of the examples are whitelist approach, we've already tried Marcus McConnell's code (right at the bottom), but even that had problems with certain href attributes that used relative paths. The trouble with the whitelist approach is that there are many edge cases to look out for and they're hard to get right.
Sunday Ironfoot
+2  A: 

Try this regex:


returnValue = 
    Regex.Replace(
        returnValue,
        @"(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", 
        delegate(Match match)
        {
            return String.Concat(match.Groups[1].Value, match.Groups[3].Value);
        }, RegexOptions.Compiled | RegexOptions.IgnoreCase);

HTH

Rubens Farias
WOW! Almost perfect except if the onclick attribute appears twice in a single element it'll only remove one of them. I have stuck that code in a while loop using Regex.IsMatch( with your expression and it seems to work. I'll post the code in a separate post in this question, because code samples don't come out very well in these comments.
Sunday Ironfoot
Forgot to say, thank you very much!!
Sunday Ironfoot
A: 

This is a response to 'Rubens Farias' answer with the code sample I came up with. I used a while loop like so...

while (Regex.IsMatch(returnVal, @"(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", RegexOptions.Compiled | RegexOptions.IgnoreCase))
{
    returnVal = Regex.Replace(returnVal, @"(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)",
                    delegate(Match match)
                    {
                        return String.Concat(match.Groups[1].Value, match.Groups[3].Value);
                    }, RegexOptions.Compiled | RegexOptions.IgnoreCase);
}

For those interested, here is the entire method I'm using to help protect against XSS...

/// <summary>
///     'Helps' protect against XSS (Cross Site Scripting attacks) by stripping out known evil HTML elements
///     such as script and style. Used for outputing text generated by a Rich Text Editor. Doesn't HTML encode!
/// </summary>
/// <param name="input">Input string to strip bad HTML elements from</param>
public static string XSSProtect(string input)
{
    string returnVal = input ?? "";

    returnVal = Regex.Replace(returnVal, @"\<script(.*?)\>(.*?)\<\/script(.*?)\>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    returnVal = Regex.Replace(returnVal, @"\<style(.*?)\>(.*?)\<\/style(.*?)\>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);

    while (Regex.IsMatch(returnVal, @"(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", RegexOptions.Compiled | RegexOptions.IgnoreCase))
    {
        returnVal = Regex.Replace(returnVal, @"(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)",
                        delegate(Match match)
                        {
                            return String.Concat(match.Groups[1].Value, match.Groups[3].Value);
                        }, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    }

    return returnVal;
}
Sunday Ironfoot
I must admit I'm a little nervous with this approach, is there any situation where this could end up in an infinite loop?
Sunday Ironfoot
I'm not comfortable too; you also should add an extra \s* after and before = symbol, avoiding onclick = "alert()"
Rubens Farias