views:

909

answers:

3

I've got an application where users input text into forms.

The data is saved into a MySQL database (collation: utf8_general_ci) and then output as XML (encoding: UTF-8).

The problem is that people tend to cut and paste their information from other sources, for instance, Microsoft Word documents or PDFs for instance.

This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding

This causes problems, obviously, when transforming or otherwise working on the XML because the characters are illegal.

So, how to sanitise the input?

Previously, I've used some fairly brute-force methods, things like the "de-moronize" script which consists of a long list of search-and-replace operations.

Is this still the best way to do it? Is there any other way?

Can I just set the accept-charset attribute on the form and have the browser do it for me?

If so, which browsers will do that and are there likely to be any problems?

Also, how come my database is accepting these characters, which are reserved/control characters in UTF-8?

As you can see, I know enough about encodings to know I have a problem, but I'm now a bit out of my depth...

TIA

+2  A: 

You might try the Perl Encode module:

http://search.cpan.org/dist/Encode/Encode.pm

It supports conversion between a number of character sets, including UTF-8 of couse. I just checked my install of Perl and it also supported "cp1252", which is just another name for Windows-1252 according to Wikipedia. You can check your own install with the following one liner:

perl -MEncode -e 'print map {"$_\n"} Encode->encodings(":all");'
Brian Ferris
Cheers. I have done that and I'm ready to go.
AmbroseChapel
+7  A: 

This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding

“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.

Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.

If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:

Content-Type: text/html;charset=utf-8

HTTP header and/or the:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

element in <head>.

Can I just set the accept-charset attribute on the form and have the browser do it for me?

No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!

how come my database is accepting these characters, which are reserved/control characters in UTF-8?

In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.

It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).

ETA:

So, I entered a string containing byte 146

No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.

(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘&#8217;’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)

It went into the database as 146

Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘&#146;’?

it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser

Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)

I suspect it is coming out as a ‘&#146;’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.

bobince
OK "bytes 147 and 148 are perfectly valid Unicode characters, U+201C and U+201D" is the part I don't get. The browser can't *know* the encoding of the pasted test. How does it know they're CP1252 smart quotes? Surely they're just bytes! Or is the use of 147-148 so unusual it's safe to assume that?
AmbroseChapel
The browser doesn't deal with bytes, it deals with Unicode characters. You paste Unicode characters U+201C (et al) into a text area; the browser neither knows nor cares how they were stored in the application you copied them out of (which was probably Unicode anyway).
bobince
It might be *submitting* bytes 147/148, but that would be because it thinks the encoding of the page containing the form was cp1252 and therefore the encoding your form wants to get is cp1252. If you don't tell it otherwise, it can default to cp1252 (or other system codepages in other countries).
bobince
I'm still confused. If I cut a string containing byte 147 from Word, then paste it into a browser input field, what are you saying the browser does? Automatically convert it to U+201C because it "knows" it's a smart quote and it has an internal 1252-UTF8 mapping?
AmbroseChapel
The encoding of the page *is* UTF-8, by the way. I may be able to track down the actual user and get them to tell me how exactly they entered the text, as it's an internal app.
AmbroseChapel
Is it only one user who gets the problem, then? If you paste in some non-ASCII yourself (here's some to test: “«æšʩЖあ☃»”), does it come through OK?
bobince
Yes it does. So, I entered a string containing byte 146 (CP-1252 right curly single quote). It went into the database as 146, and it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser or my XML editor, despite 146/U+0092 being a control character in UTF-8.
AmbroseChapel
[more comments in answer]
bobince
You're right! I suppose it really comes down to me not having the right tools to look at characters and determine what they really are. But when I do sprintf("%X",ord($_)) on the characters in that string, it is indeed ‘\xE2\x80\x99’ for the smart quote! Thank you.
AmbroseChapel
+1  A: 

"Can I just set the accept-charset attribute on the form and have the browser do it for me?"

Only if you're prepared to trust "the browser" - that might be suitable in some applications, but in general it's leaving yourself wide open to mischief (or worse).

(Also see bobince's warnings about IE...)

Iain

bigiain