views:

716

answers:

3

I want to standardise on UTF8 on our web browser. All our databases and internet stufff is in UTF8. All our web servers SAre sending the charset=utf-8 HTTP header. However I've discovered that my changing the encoding on my Firefox (View -> Character Encoding) to something else I can enter Latin-9 character into a form and PHP just treats them as malformed UTF8.

How much do I have to worry about that? Is it possible for the user's web browser to override the utf8 charset header and send non-UTF8?

Update: Several people have suggested accept-charset on the individual forms. However I'd rather not have to change every webform. Assuming I can control the HTTP content-type header, and it's set to UTF8, do I have anything to worry about?

+3  A: 

Try adding the accept-charset attribute to your form elements.

Lars Haugseth
+1  A: 

Place an accept-charset="UTF-8" element on the form element, that will cause the form post to be UTF-8 despite the encoding of the page content.

AnthonyWJones
A: 

Is it possible for the user's web browser to override the utf8 charset header and send non-UTF8?

Of course. You don't control the client, and the client can do whatever it wants, including letting users override the normal encodings and cause junk (or what passes for junk) to be sent to your server.

That said, it sounds like you've taken most of important steps here. Your actual HTML document is UTF-8 encoded and explicitly marked as such, which means that browsers will generally default to submitting forms in that encoding also. (Note that the HTML spec doesn't require this. Specifying the accept-charset on the form explicitly is the only spec-compliant guarantee.) I suspect that this will work as expected in all modern browsers, and you could test this easily.

On the server, your job is always to validate your input to the extent that it's important to your service. Although the vast majority of your users will be benevolent and using modern standard browsers, the HTTP protocol is open, and both wacky users and malicious hackers are out there, and both can throw any kind of data they want at you. Make sure that you're not making assumptions about data encodings when security or authenticated data is involved, and sanitize this stuff before you shove it into databases.

quixoto