views:

152

answers:

2

I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.

I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.

I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.

So is this script missing anything?

It uses html agility pack.

public string ScrubHTML(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //Remove potentially harmful elements
    HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.ParentNode.RemoveChild(node, false);

        }
    }

    //remove hrefs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
    if (nc != null)
    {

        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("href", "#");
        }
    }


    //remove img with refs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("src", "#");
        }
    }

    //remove on<Event> handlers from all tags
    nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("onFocus");
            node.Attributes.Remove("onBlur");
            node.Attributes.Remove("onClick");
            node.Attributes.Remove("onMouseOver");
            node.Attributes.Remove("onMouseOut");
            node.Attributes.Remove("onDoubleClick");
            node.Attributes.Remove("onLoad");
            node.Attributes.Remove("onUnload");
        }
    }

    // remove any style attributes that contain the word expression (IE evaluates this as script)
    nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("stYle");
        }
    }

    return doc.DocumentNode.WriteTo();
} 

Edit

2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).

  1. How do you make a white list? Is it just a list collection?

  2. How do you actual parse out all html tags, script tags and every other tag?

  3. Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.

  4. Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.

  5. Should I be using html agility pack?

+1  A: 

Yes, I already see you're missing onmousedown, onmouseup, onchange, onsubmit, etc. This is part of why should use whitelisting for both tags and attributes. Even if you had a perfect blacklist now (very unlikely), tags and attributes are added fairly often.

See Why use a whitelist for HTML sanitizing?.

Matthew Flaschen
See Edit.......
chobo2
+1  A: 

That code is dangerous -- you should be whitelisting elements, not blacklisting them.

In other words, make a small list of tags and attributes you want to allow, and don't let any others through.

EDIT: I'm not familiar with HTML agility pack, but I see no reason why it wouldn't work for this. Since I don't know the framework, I'll give you pseudo-code for what you need to do.

doc.LoadHtml(html);

var validTags = new List<string>(new string[] {"b", "i", "u", "strong", "em"});

var nodes = doc.DocumentNode.SelectAllNodes();
foreach(HtmlNode node in nodes)
    if(!validTags.Contains(node.Tag.ToLower()))
        node.Parent.ReplaceNode(node, node.InnerHtml);

Basically, for each tag, if it's not contained in the whitelist, replace the tag with just its inner HTML. Again, I don't know your framework, so I can't really give you specifics, sorry. Hopefully this gets you started in the right direction.

zildjohn01
See Edit.......
chobo2
Hi. So how about attributes to tags or classes? what happens if you allow links so it would be <a href="">hi</a> how would that look in the list of valid tags? Also you get all nodes but once your doing checking them out do you put the nodes back together? Like right now they are all separate from the link of SelectAllNodes()
chobo2
What framework do you uses? Html agility pack does not seem to have slectAllNodes
chobo2