views:

881

answers:

1

A simple HTML file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<form method="POST" action="test.jsp" accept-charset="utf-8" method="post" enctype="application/x-www-form-urlencoded" >
    <input type="text" name="P"/>
    <input type="submit" value="subMit"/>
</form>
</body>
</html>

The HTML file is served by the server using header Content-Type:text/html; charset=utf-8. Everything says: "dear browser when you post this form, please post it utf-8 encoded". The browser actually does this. Every value entered in the input field will be UTF-8 encoded. BUT the browser wont tell this to the server! The HTTP header of the post request will contain a Content-Type:application/x-www-form-urlencoded field but the charset will be omitted (tested with FF3.6 and IE8).

The problem is the application server I use (Tomcat6) expects the charset in the Content-Type header (as stated in RFC2388). Like this: Content-Type:application/x-www-form-urlencoded;charset=utf-8. If the charset is omitted it will assume ISO-8859-1 which is not the charset used for encoding. The result is broken data.

Does some one have a clue how to force the current browsers to append the charset to the Content-Type header?

+2  A: 

Does some one have a clue how to force the current browsers to append the charset to the Content-Type header?

No, no browser has ever supplied a charset parameter with the application/x-www-form-urlencoded media type. What's more, the HTML spec which defines that type, does not propose a charset parameter, so the server can't reasonably expect to get one.

(HTML4 does expect a charset for the subparts of a multipart/form-data submission, but even in that case no browser actually complies.)

accept-charset="utf-8"

accept-charset is broken in IE, and shouldn't be used. It won't make a difference either way for forms in pages served as UTF-8, but in other cases it can end up with inconsistent results.

No, with forms you just have to serve the page they're in as UTF-8, and the results should come back as UTF-8 (with no identifying marks to tell you that (except potentially for the _charset_ hack, but Tomcat doesn't support that).

So you have to tell the Servlet container what encoding to use for parameters if you don't want it to fall back to its default (which is usually wrong). In a limited set of circumstances you may be able to call ServletRequest.setCharacterEncoding() to do this, but this tends to be brittle, and doesn't work at all for parameters taken from the query string. There's not a standardised Servlet-level fix for this, sadly. For Tomcat you usually have to muck about with the server.xml instead of being able to fix it in the app.

bobince
Good answer, expect of the Tomcat part. The `ServletRequest#setCharacterEncoding()` actually sets the charset to be used to parse the request **body** (in other words: POST parameters) and the `URIEncoding` in `server.xml` actually sets the charset to be used to parse request **URI** (in other words: GET parameters). As he is using POST in his example, just using `ServletRequest#setCharacterEncoding()` is sufficient. More details in this article: http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html
BalusC
It's sufficient, it can just be fragile. If any request parameter is read, it will cause the request body to be read and decoded, after which any call to `setCharacterEncoding` will be ineffective. It's easy for some sneaky middleware component to mess things up by jumping in and reading a parameter...
bobince
@bobince: you mean "http spec" not "html spec", don't you? Actually the http spec says "Data in character sets other than 'ISO-8859-1' or its subsets MUST be labeled with an appropriate charset value." in the "3.7 Media Types" section: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7
Eduard Wirch
No, I mean HTML spec: the HTTP spec has nothing to say about how form data is encoded into a request body. The reference to ISO-8859-1 in RFC2616 only applies to those “some media types” which define a `charset` parameter; neither `application/x-www-form-urlencoded` nor `multipart/form-data` define one, so the rule doesn't affect form submissions. `form-urlencoded` doesn't even include any direct high bytes, only `%`-encoded versions of same, so it wouldn't be affected by that even if there was a `charset` parameter for it.
bobince
Meanwhile the content of `multipart` subparts' headers are not affected by RFC2616 but by normal MIME header rules; it should be possible as per RFC2388 (which defines `multipart/form-data`) to specify an encoding for subparts, but no browsers do so and very few servers will even bother look for it. (And the odd one will break if you try, which is why no browser added support for it.)
bobince