views:

140

answers:

5

My C# site allows users to submit HTML to be displayed on the site. I would like to limit the tags and attributes allowed for the HTML, but am unable to figure out how to do this in .net.

I've tried using Html Agility Pack, but I don't see how to modify the HTML, I can see how to go through the HTML and find certain data, but actually generating an output file is baffling me.

Does anyone have a good example for cleaning up HTML in .net? The agility pack might be the answer, but the documentation is lacking.

+2  A: 

You should only accept well-formed HTML.

You can then use LINQ to XML to parse and modify it.

You can make a recursive function that takes an element from the user and returns a new element with a whitelisted set of tags and attributes.

For example:

//Maps allowed tags to allowed attributes for the tags.
static readonly Dictionary<string, string[]> AllowedTags = new Dictionary<string, string[]>(StringComparer.OrdinalIgnoreCase) {
    { "b",    new string[0] },
    { "img",  new string[] { "src", "alt" } },
    //...
};
static XElement CleanElement(XElement dirtyElement) {
    return new XElement(dirtyElem.Name,
        dirtyElement.Elements
            .Where(e => AllowedTags.ContainsKey(e.Name))
            .Select<XElement, XElement>(CleanElement)
            .Concat(
                dirtyElement.Attributes
                    .Where(a => AllowedTags[dirtyElem.Name].Contains(a.Name, StringComparer.OrdinalIgnoreCase))
            );
}

If you allow hyperlinks, make sure to disallow javascript: urls; this code doesn't do that.

SLaks
+1 Nice - I like the "home-brewed" approach.
David Robbins
+2  A: 

With HtmlAgilityPack you can remove unwanted tags from the input:

node.ParentNode.RemoveChild(node);
morsanu
That's the method I was looking for. Thanks.
spaetzel
A: 

A tool you can use that is available off of SourceForge is SGMLReader which turns the HTML into properly formatted XML and allows you to read it as an XmlReader or load it into an XmlDocument object for further processing. I have used this before for parsing web pages which are not always in properly formatted HTML.

Adam Gritt
+2  A: 

I would strongly recommend Microsoft's Anti-XSS Library for santizing input. It supports sanitizing html.

David Stratton
A: 

Have you had a look at MarkdownSharp which is Open Source and created by the guys here?

Jamie Dixon