Should I convert overlong UTF-8 strings to their shortest normal form?

views:

115

answers:

+7 Q:

Should I convert overlong UTF-8 strings to their shortest normal form?

I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.

My question is quite simply "is this a bad idea"?

A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.

Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.

Am I missing something. Is there a hidden danger I haven't considered?

+3 A:

Yes, this is a bad idea.

Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.

The canonical example that caused many problems: '\xC0\xBCscript>'. ‘Fix’ the overlong sequence to plain ASCII < and you have accidentally created a security hole.

No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.

bobince 2010-04-30 11:20:39

I'm afraid I don't follow your logic. My module is not an application, it's a data filter. I don't see how there is anything inherently insecure about the text '<script>' or '..' or anything else for that matter - in data.If a web application used my module as a *part* of its input filtering process, how has this made things less secure?

Grant McLean 2010-04-30 21:17:23

A *component* of a web application might use your filter. For example imagine an external database/service used to store sections of HTML content to drop into a web page. It has been pre-screened to remove any potentially-harmful elements or attributes, but it's from an application working in ISO 8859-1 so it doesn't see overlongs as a problem (or maybe doesn't know about Unicode at all).

bobince 2010-04-30 22:06:50

Now your web application talks to that service to fetch some HTML snippets, and knowing that the service isn't UTF-8-friendly, runs it through your filter. An unexpected and, if included in the page, dangerous `<script>` tag appears. This may be a contrived example, but for something as general-purpose as a filter, there are endless possible scenarios. Better to fail-safe. In any case, nothing generates overlongs (except deliberate exploits). There are no overlongs in the wild that you would ever need to convert to canonical sequences. You don't stand to gain anything; there is only risk.

bobince 2010-04-30 22:07:27

A component might misuse all sorts of things to create security holes. If you change the data through any filter, it doesn't matter if you've pre-screened it. You should always check the data on the way out.

brian d foy 2010-04-30 22:08:57

I see that the `[\xC0][\xBC]script>` thing is a reported bug, but I missing the part on how a Unicode normalization would change that to `<script>`. _Translating_ it to ASCII might be a problem, but that's not the same thing as Unicode normalization.

brian d foy 2010-04-30 22:15:53

I agree that this could lead to vulnerabilities, but so could any decode routine (esp base64_decode()!!!) but i think they can be avoided if you follow normal security practices of sanitizing input before use.

Rook 2010-04-30 22:23:52

@brian d foy: My module is not doing Unicode normalisation. Its primary purpose is to clean up the distressingly common situation of a data file (eg: a database dump) that contains both Latin-1 characters and UTF-8 characters - producing UTF-8 output. In this context it *would* definitely convert "\xC0\xBCscript" to "<script>".

Grant McLean 2010-05-01 00:03:08

Ah, okay, I thought you were talking about having the valid UTF-8 string after you'd fixed it.

brian d foy 2010-05-01 02:13:17

You'd have the same problem with `+ADw-` in UTF-7 or 0x4C in EBCDIC. The proper approach is to decode the string and *then* check for unwanted characters.

dan04 2010-06-19 01:41:09

+2 A:

I don't think this is a bad idea from a security or usability perspective.

From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <> before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)

Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.

Rook 2010-04-30 22:22:44

+1 A:

I don't know if it is a bad idea in your scenario, however, as this kind of change is not bijective, it may lead to data loss.

If you incorrectly detected the encoding of your data, you may interpret data as being legitimate UTF-8 overlongs and change them in the shortest normal form. There will be no way to later retrieve the original data.

As a personal experience, I know that when such things may happen, they WILL and you will potentially not notice the error before it is too late...

dodecaplex 2010-05-03 11:49:11

Thanks for your response. There's not really any safe way to handle the situation you describe apart from converting from the known single byte encoding to UTF-8 - in which case over-long sequences will not be encountered. The niche which Encoding::FixLatin occupies is cleaning up data which contains characters in multiple encodings. The heuristics used do have the potential to introduce data corruption and the module documentation describes the risks.

Grant McLean 2010-05-04 20:59:18

ansaurus

tags:

views:

answers:

Should I convert overlong UTF-8 strings to their shortest normal form?

related questions