tags:

views:

568

answers:

2

I'm looking at encoding strings to prevent XSS attacks. Right now we want to use a whitelist approach, where any characters outside of that whitelist will get encoded. Right now, we're taking things like '(' and outputting '(' instead. As far as we can tell, this will prevent most XSS.

The problem is that we've got a lot of international users, and when the whole site's in japanese, encoding becomes a major bandwidth hog. Is it safe to say that any character outside of the basic ASCII set isn't a vulnerability and they don't need to be encoded, or are there characters outside the ASCII set that still need to be encoded?

+10  A: 

Might be (a lot) easier if you just pass the encoding to htmlentities()/htmlspecialchars

echo htmlspecialchars($string,  ENT_QUOTES, 'utf-8');

But if this is sufficient or not depends on what you're printing (and where).

see also:
http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
http://jimbojw.com/wiki/index.php?title=Sanitizing_user_input_against_XSS
http://www.erich-kachel.de/?p=415 (in german. If I find something similar in English -> update) edit: well, I guess you can get the main point without being fluent in german ;) The string

javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41)) 
passes htmlentities() unchanged. Now consider something like
<a href="<?php echo htmlentities($_GET['homepage']); ?>"
which will send
<a href="javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41))">
to the browser. And that boils down to
href="javascript:eval(\"alert('XSS')\")"
While htmlentities() gets the job done for the contents of an element, it's not so good for attributes.

VolkerK
+2  A: 

In general, yes, you can depend on anything non-ascii to be "safe", however there are some very important caveats to consider:

  1. Always ensure that what you're sending to the client is tagged as UTF-8. This means having a header that explicitly says "Content-Type: text/html; charset=utf-8" on every single page, including all of your error pages if any of the content on those error pages is generated from user input. (Many people forget to test their 404 page, and have that page include the not-found URL verbatim)
  2. Always ensure that what you're sending to the client is valid UTF-8. This means you cannot simply pass through bytes received from the user back to the user again. You need to decode the bytes as UTF-8, apply your html-encoding XSS prevention, and then encode them as UTF-8 as you write them back out.

The first of those two caveats is to keep the client's browser from seeing a bunch of stuff including high-letter characters and falling back to some local multibyte character set. That local multi-byte character set may have multiple ways of specifying harmful ascii characters that you won't have defended against. Related to this, some older versions of certain browsers - cough ie cough - were a bit overeager in detecting that a page was UTF-7; this opens up no end of XSS possibilities. To defend against this, you might want to make sure you html-encode any outgoing "+" sign; this is excessive paranoia when you're generating proper Content-Type headers, but will save you when some future person flips a switch that turns off your custom headers. (For example, by putting a poorly configured caching reverse proxy in front of your app, or by doing something to insert an extra banner header - php won't let you set any HTTP headers if any output is already written)

The second of those is because it is possible in UTF-8 to specify "overly short" sequences that, while invalid under current specs, will be interpreted by older browsers as ASCII characters. (See what wikipedia has to say) Also, it is possible that someone may insert a single bad byte into a request; if you pass this pack to the user, it can cause some browsers to replace both the bad byte and one or more bytes after it with "?" or some other "couldn't understand this" character. That is, a single bad byte could cause some good bytes to also be swallowed up. If you look closely at what you're outputting, there's probably a spot somewhere where an attacker who was able to wipe a byte or two out of the output could do some XSS. Decoding the input as UTF-8 and then re-encoding it prevents this attack vector.

Daniel Martin