views:

161

answers:

1

Are there any problems with what I am doing here? This is my first time to deal with something like this and I just want to make sure I understand all the risks etc. to different methods.

I am using WMD to get user input and I am displaying it with a Literal control. Since it is uneditable once entered I will be storing the HTML and not the Markdown.

input = Server.HTMLEncode(stringThatComesFromWMDTextArea)

and then run something like this for tags I want users to be able to use

// unescape whitelisted tags
string output = input.Replace("&lt;b&gt;", "<b>").Replace("&lt;/b&gt;", "</b>")
                     .Replace("&lt;i&gt;", "<i>").Replace("&lt;/i&gt;", "</i>");

Edit Here is what I am doing currently:

 public static string EncodeAndWhitelist(string html)
    {
        string[] whiteList = { "b", "i", "strong", "img", "ul", "li" };
        string encodedHTML = HttpUtility.HtmlEncode(html);
        foreach (string wl in whiteList)
            encodedHTML = encodedHTML.Replace("&lt;" + wl + "&gt;", "<" + wl + ">").Replace("&lt;/" + wl + "&gt;", "</" + wl + ">");
        return encodedHTML;
    }
  1. Will what I am doing here keep me protected from XSS?
  2. Are there any other considerations that should be made?
  3. Is there a good list of normal tags to whitelist?
+1  A: 

If your requirements really are that basic that you can do such simple string replacements then yes, this is ‘safe’ against XSS. (However, it's still possible to submit non-well-formed content where <i> and <b> are mis-nested or unclosed, which could potentially mess up the page the content ends up inserted into.)

But this is rarely enough. For example currently <a href="..."> or <img src="..." /> are not allowed. If you wanted to allow these or other markup with attribute values in, you'd have a whole lot more work to do. You might then approach it with regex, but that gives you endless problems with accidental nesting and replacement of already-replaced content, seeing as how regex can't parse HTML, and that.

To solve both problems, the usual approach is to use an [X][HT]ML parser on the input, then walk the DOM removing all but known-good elements and attributes, then finally re-serialise to [X]HTML. The result is then guaranteed well-formed and contains only safe content.

bobince
So, assuming I wanted something more robust, what would you suggest for the parsers you mentioned? Could HTML Agility Pack handle it?Is there not something that does all this already?
Blankasaurus
Yes, HTML Agility Pack is a good choice. Once you've got the DOM parsed it's a relatively trivial exercise to write a recursive function that removes all but known-good elements/attributes from the DOM tree. Also if you allow `href`/`src`/etc., remember to check the URLs for known-good schemes like `http`/`https`, to avoid injection through `javascript:` URLs and the like.
bobince