views:

2496

answers:

10

What's an example of something dangerous that would not be caught by the code below?

EDIT: After some of the comments I added another line, commented below. See Vinko's comment in David Grant's answer. So far only Vinko has answered the question, which asks for specific examples that would slip through this function. Vinko provided one, but I've edited the code to close that hole. If another of you can think of another specific example, you'll have my vote!

public static string strip_dangerous_tags(string text_with_tags)
{
    string s = Regex.Replace(text_with_tags, @"<script", "<scrSAFEipt", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"</script", "</scrSAFEipt", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"<object", "</objSAFEct", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"</object", "</obSAFEct", RegexOptions.IgnoreCase);
    // ADDED AFTER THIS QUESTION WAS POSTED
    s = Regex.Replace(s, @"javascript", "javaSAFEscript", RegexOptions.IgnoreCase);

    s = Regex.Replace(s, @"onabort", "onSAFEabort", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onblur", "onSAFEblur", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onchange", "onSAFEchange", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onclick", "onSAFEclick", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"ondblclick", "onSAFEdblclick", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onerror", "onSAFEerror", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onfocus", "onSAFEfocus", RegexOptions.IgnoreCase);

    s = Regex.Replace(s, @"onkeydown", "onSAFEkeydown", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onkeypress", "onSAFEkeypress", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onkeyup", "onSAFEkeyup", RegexOptions.IgnoreCase);

    s = Regex.Replace(s, @"onload", "onSAFEload", RegexOptions.IgnoreCase);

    s = Regex.Replace(s, @"onmousedown", "onSAFEmousedown", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onmousemove", "onSAFEmousemove", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onmouseout", "onSAFEmouseout", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onmouseup", "onSAFEmouseup", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onmouseup", "onSAFEmouseup", RegexOptions.IgnoreCase);

    s = Regex.Replace(s, @"onreset", "onSAFEresetK", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onresize", "onSAFEresize", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onselect", "onSAFEselect", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onsubmit", "onSAFEsubmit", RegexOptions.IgnoreCase);
    s = Regex.Replace(s, @"onunload", "onSAFEunload", RegexOptions.IgnoreCase);

    return s;
}
+3  A: 
<a href="javascript:document.writeln('on' + 'unload' + ' and more malicious stuff here...');">example</a>

Any time you can write a string to the document, a big door swings open.

There are myriad places to inject malicious things into HTML/JavaScript. For this reason, Facebook didn't initially allow JavaScript in their applications platform. Their solution was to later implement a markup/script compiler that allows them to seriously filter out the bad stuff.

As said already, whitelist a few tags and attributes and strip out everything else. Don't blacklist a few known malicious attributes and allow everything else.

David Grant
How can document.writeln() do anything if the scripts are stripped?
eyelidlessness
but the "document.writeln" part has to be within a set of <script>,</script> tags, right? So, my logic protects against your example.
Corey Trager
<a href:"javascript:document.writeln ...
Vinko Vrsalovic
Oh, you're right. Then,<a href="javascript:malicious()"></a>
David Grant
Vinko. Thanks. Very clear.
Corey Trager
@David - if this is the correct answer, how about expanding it with an explanation? After reading the comments I got it but many will look at it in bewilderment :)
Kev
+7  A: 

As David shows, there's no easy way to protect with just some regexes you can always forget something, like javascript: in your case. You better escape the HTML entities on output. There is a lot of discussion about the best way to do this, depending on what you actually need to allow, but what's certain is that your function is not enough.

Jeff has talked a bit about this here.

Vinko Vrsalovic
Vinko, you "javascript:" example was helpful. I know Jeff Atwood's position, but it's too late for my app, which is already distributed. I'm stuck with allowing HTML input, and so I'd like to make the blacklist approach work.
Corey Trager
+3  A: 

Although I can't provide a specific example of why not, I am going to go ahead and outright say no. This is more on principal. Regex's are an amazing tool but they should only be used for certain problems. They are fantastic for data matching and searching.

They are not however a good tool for security. It is too easy to mess up a regex and have it be only partially correct. Hackers can find lots of wiggle room inside a poorly or even well constructed regex. I would try another avenue to prevent cross site scripting.

JaredPar
+9  A: 

You're much better off turning all < into &lt; and all > into &gt;, then converting acceptable tags back. In other words, whitelist, don't blacklist.

ceejayoz
I can't, because in the history of my codebase. I'm committed to making the blacklist approach work and I not convinced (yet) that it can't.
Corey Trager
Even this approach can have its downfalls, if you check the XSS cheat sheet (http://ha.ckers.org/xss.html). There are ways to make something look like one of the valid tags but to obfuscate attribs on that tag. Agree about the whitelist not blacklist thing though.
thomasrutter
See David Grant's answer to this question for a really basic example of what I mean (ie, if you have defined "a" as an acceptable tag and "href" as an acceptable attribute). There are more complex ways to obfuscate though, even if you filtered for the string "javascript:" in the href attrib.
thomasrutter
The difference between an exploited firewall (or relay) and a secure firewall (or relay) is blacklist vs. whitelist. Whitelisting is much harder to exploit.Blacklists must be *continuously* maintained; as hackers find new exploits, your 'lists must be updated, or you will suddenly become vulnerable. Microsoft *still* releases patches to Windows XP, because new holes are still being found, and that sucker is YEARs old. Now Imagine instead of the thousands of developers they have at their disposal, it is just you.
BryanH
+1  A: 

Another vote for whitelisting. But it looks like you're going about this the wrong way. The way I do it, is to parse the HTML into a tag tree. If the tag you're parsing is in the whitelist, give it a tree node, and parse on. Same goes for its attributes.

Dropped attributes are just dropped. Everything else is HTML-escaped literal content.

And the bonus of this route is because you're effectively regenerating all the markup, it's all completely valid markup! (I hate it when people leave comments and they screw up the validation/design.)

Re "I can't whitelist" (para): Blacklisting is a maintenance-heavy approach. You'll have to keep an eye on new exploits and make sure your covered. It's a miserable existence. Just do it right once and you'll never need to touch it again.

Oli
+36  A: 

It's never enough – whitelist, don't blacklist

For example javascript: pseudo-URL can be obfuscated with HTML entities, you've forgotten about <embed> and there are dangerous CSS properties like behavior and expression in IE.

There are countless ways to evade filters and such approach is bound to fail. Even if you find and block all exploits possible today, new unsafe elements and attributes may be added in the future.

There are only two good ways to secure HTML:

  • convert it to text by replacing every < with &lt;.
    If you want to allow users enter formatted text, you can use your own markup (e.g. markdown like SO does).

  • parse HTML into DOM, check every element and attribute and remove everything that is not whitelisted.
    You will also need to check contents of allowed attributes like href (make sure that URLs use safe protocol, block all unknown protocols).
    Once you've cleaned up the DOM, generate new, valid HTML from it. Never work on HTML as if it was text, because invalid markup, comments, entities, etc. can easily fool your filter.

Also make sure your page declares its encoding, because there are exploits that take advantage of browsers auto-detecting wrong encoding.

porneL
A: 

Whitespace makes you vulnerable. Read this.

Rich
+2  A: 

Take a look at the XSS cheatsheet at http://ha.ckers.org/xss.html it's not a complete list but a good start.

One that comes to mind is <img src="http://badsite.com/javascriptfile" />

You also forgot onmouseover, and the style tag.

The easiest thing to do really is entity escaping. If the vector can't render properly in the first place, an incomplete blacklist won't matter.

tduehr
A: 

From a different point of view, what happens when someone wants to have 'javascript' or 'functionload' or 'visionblurred' in what they submit? This can happen in most places for any number of reasons... From what I understand, those will become 'javaSAFEscript', 'functionSAFEload' and 'visionSAFEblurred'(!!).

If this might apply to you, and you're stuck with the blacklist approach, be sure to use the exact matching regexes to avoid annoying the user. In other words, be at the optimum point between security and usability, compromising either as little as possible.

sundar
+2  A: 

As an example of an attack that makes it through this:

  <div style="color: expression('alert(4)')">

Shameless plug: The Caja project defines whitelists of HTML elements and attributes so that it can control how and when scripts in HTML get executed.

See the project at http://code.google.com/p/google-caja/ and the whitelists are the JSON files in http://code.google.com/p/google-caja/source/browse/#svn/trunk/src/com/google/caja/lang/html and http://code.google.com/p/google-caja/source/browse/#svn/trunk/src/com/google/caja/lang/css

Mike Samuel