tags:

views:

219

answers:

4

Hi,

I have a website that tells the output is UTF-8, but I never make sure that it is. Should I use a regular expression or Iconv library to convert UTF-8 to UTF-8 (leaving invalid sequences)? Is this a security issue if I do not do it?

+2  A: 

First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.

Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.

As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.

Knut Haugen
+1  A: 

Not a security issue, but your users (especially non-english speaking) will be very annoyed, if you send invalid UTF-8 byte streams.

In the best case (what most browsers do) all invalid strings just disappear or show up as gibberish. The worst case is that the browser quits interpreting your page and says something like "invalid encoding". That is what, e.g., some text editors (namely gedit) on Linux do.

OK, to keep it realistic: If you have an english-centered website without heavily relying on some maths characters or Unicode arrows, it will almost make no difference. But if you serve, e.g., a Chinese site, you can totally screw it up.

Cheers,

Boldewyn
+1  A: 

Everybody gets charsets messed up, so generally you can't trust any outside source. It's a good practise to verify that the provided input is indeed valid for the charset that it claims to use. Luckily, with UTF-8, you can make a fairly safe assertion about the validity.

troelskn
A: 

If it's possible for users to send in arbitrary bytes, then yes, there are security implications of not ensuring valid utf8 output. Depending on how you're storing data, though, there are also security implications of not ensuring valid utf8 data on input (e.g., it's possible to create a variant of this SQL injection attack that works with utf8 input if the utf8 is allowed to be invalid utf8), so you really should be using iconv to convert utf8 to utf8 on input, and just avoid the whole issue of validating utf8 on output.

The two main security reason you want to check that the output is valid utf-8 is to avoid "overlong" byte sequences - that is, cases of byte sequences that mean some character like '<' but are encoded in multiple bytes - and to avoid invalid byte sequences. The overlong encoding issue is obvious - if your filter changes '<' into '&lt;', it might not convert a sequence that means '<' but is written differently. Note that all current-generation browsers will mark overlong sequences as invalid, but some people may be using old browsers.

The issue with invalid sequences is that some utf-8 parsers will allow an invalid sequence to eat some number of valid bytes that follow the invalid ones. Again, not an issue if everyone always has a current browser, but...

Daniel Martin