views:

251

answers:

2

I am wondering is there any sort of C# class or 3rd party library that removes dangerous characters such as script tags?

I know you can use regex but I also know people can write their script tags so many ways that you can fool the regex into thinking it is OK.

I also heard that HTML Agility Pack is good so I am wondering is there any script removal class made for it?

Edit

http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=24346

I found this on their forms. However I am not sure if this is complete solution as the guy does not have any tests to back it up and it would be nicer if this was on some site where tons of people where using this script every day to test to see if anything gets by.

Great example (almost), Thanks! A few ways to make it stronger that I saw, though:

1) Use case-insensitive search when looking for links with "javascript:", "vbscript:", "jscript:". For example, the original example would not remove the HTML:

click me

2) Remove any style attributes that contain an expression rule. Internet Explorer evaluates the CSS rule express as script. For example, the following would product a message box:

bad code

3) Also remove tags

I honestly have no idea why "expression" has not been removed from IE - major flaw in my opinion. (Try the div example in internet explorer and you'll see why - even IE8.) I just wish there was an easier/standard way to clean-up html input from a user.

Here's the code updated with these improvements. Let me know if you see anything wrong:

    public string ScrubHTML(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        //Remove potentially harmful elements
        HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.ParentNode.RemoveChild(node, false);

            }
        }

        //remove hrefs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {

            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("href", "#");
            }
        }


        //remove img with refs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("src", "#");
            }
        }

        //remove on<Event> handlers from all tags
        nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("onFocus");
                node.Attributes.Remove("onBlur");
                node.Attributes.Remove("onClick");
                node.Attributes.Remove("onMouseOver");
                node.Attributes.Remove("onMouseOut");
                node.Attributes.Remove("onDoubleClick");
                node.Attributes.Remove("onLoad");
                node.Attributes.Remove("onUnload");
            }
        }

        // remove any style attributes that contain the word expression (IE evaluates this as script)
        nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("stYle");
            }
        }

        return doc.DocumentNode.WriteTo();
    } 
A: 

I would use built-in methods. As I see it, if a user wants to break your program, they will find a way to do it. But if you combine multiple methods of sanitizing user input, your program will only be more secure.

For instance, with a String variable named "myString", I would combine REGEX character stripping with just a regular manual character stripping by hand, just to be safe.

This will remove everything that isn't alphanumeric.

myString = Regex.Replace(myString, "[^a-z0-9]", "", RegexOptions.CaseInsensitive);
myString = myString.replace("/","");
myString = myString.replace("<","");

etc.

You could also extend this further by removing text that is between "<" and ">" characters and then between ">" and "<".

I prefer not to use external third-party libraries -unless I have to - because you have to distribute the library as well, you're relying on someone else's program to make yours secure, and if there's a vulnerability in their software yours is vulnerable too.

Jeffrey Kern
On the other hand if I use a third-party library I can often gain all the robustness and competency of their solution without spending many hours coding a solution that's incomplete due to my lack of complete understanding of the nuances of a problem. Picking the right third-party library is another question.
jball
Valid point. Another concern regarding the 3rd party package is if that package needs any external libraries, etc. Since the deployment environment wasn't specified, you cannot make assumptions really about what would be best. But if I was in an IT department with hundreds of machines, if I had to install a 3rd party library onto all of those machines along with other components to get said library to run, I'd probably freak out a little bit. Soo..... time wasted developing your own solution vs. time wasted deploying someone else's lol
Jeffrey Kern
I rather use something that someone with more knowledge in that area made. It seems like just changing a space could make it so a script tag could get through. There are many creative ways I think to write the script tag and I don't have time to find figure what they all are then test it in all cases to see if it would pass. If it is made why reinvent the wheel? Also ya if they have vulnerability then it is in yours. Well then you better not use anything even the build in security stuff that comes from C# like encryption. It could be flawed but if it is it will be found faster then your find
chobo2
the venerability your script probably will have.
chobo2
Which is a valid point. However, if someone wants to break your application, they will find a way. It isn't a matter of if, but when. For my applications, I put in place enough security to deter casual users and programmers from breaking it. But if someone is going to decompile, unobfuscate, and figure out how I implemented said security in order to break my app - that's fine by me for two reasons.... 1, they're a better programmer than me, and 2, I made something worthwhile enough for them to focus their attention on. :)
Jeffrey Kern
And for the C# encryption - that is fine by me, because it is a part of the C# runtime environment, the same environment needed to run your app in the first place. In addition, for encryption for instance - the algorithms are based off of tried-and-true publicly published methods and techniques, and not some clever solution that Joe Blow is selling for $10 as a component. Do whatever works best for you and your project, by all means - I just would suggest not losing sleep over trying to protect your app from everything. Cuz a fool will come along and somehow find out how to break it anyway!
Jeffrey Kern
I agree if someone wants to break in they will get in. But even if I make my own to try to remove it I probably will miss alot and you still got to cover against scrip kiddies. So I thinking a 3rd party library that has been around for a while probably would get stop 90-95% of attackers. Where with my limited knowledge of how many ways you can write a script tag, hide the script tag to sneak it through and who know what other dangerous tags I probably could only prevent maybe 60-70% at best and who knows how long that would take me to do.
chobo2
I just edited my post with some script I found. Just looking at it I see stuff I never thought about like iframes could be potential dangerous, that they could try to link to a javascript file and etc.
chobo2