views:

41

answers:

2

My encoding is set to ISO-8859-1.

I'm making an AJAX call using jQuery.ajax to a servlet. The URL (after it has been serialized by jQuery) ends up looking like this:

https://myurl.com/countryAndProvinceCodeServlet?action=getProvinces&label=%C3%85land+Islands

The actual label value is Åland Islands. When this comes to the servlet, the value that I receive is:

Ã\u0085land Islands

But this is not what I want. I'd like it to get decoded to Åland Islands. I've tried many things (setting scriptCharset, trying to convert the string using getBytes(), but nothing seems to work).

+5  A: 

It is an unfortunate part of the Servlet specification that the encoding used to decode query parameters is not settable by servlets themselves. Instead it is left as a configuration matter for the server.

This makes deployment of internationalised web sites an enormous pain, especially because the default encoding chosen by the Servlet spec is not the most-likely-to-be-useful UTF-8, but ISO-8859-1. (Actual ISO-8859-1, not even Windows code page 1252, which is the encoding browsers will really submit when told to use ISO-8859-1!)

So how to reconfigure this is a server problem. For Tomcat, it requires some fiddling with the server.xml.

The alternative approach, if you don't have access to the server config, is to take each submitted parameter name/value and re-encode them. Luckily ISO-8859-1 preserves every byte submitted as a Unicode code point of the same number, so to convert the string as if it had been interpreted properly as UTF-8 in the first place, you can simply encode each String to a byte array using ISO-8859-1, and then decode the bytes back to a String using UTF-8. Of course if someone then re-configures the server to use UTF-8 you've got a problem...

bobince
Note that the unability to configure the encoding for query parameters using the Servlet API only applies on GET query parameters (in URL), not on POST query parameters (in request body). Also note that the browser which *actually* sends CP1252 is only MSIE, not others. For the remnant, great answer as always :)
BalusC
Great answer. I also came to somewhat the same conclusion after reading http://wiki.apache.org/tomcat/FAQ/CharacterEncoding. I have updated my question as well -- it also looks like jQuery might be to blame (not url-encoding properly?)
Vivin Paliath
@BalusC: actually all browsers have sent cp1252 for quite some time, even on non-Windows platforms. Some other ISO-8859-family encodings also mutate to their Windows equivalents. HTML5 is finally standardising this unfortunate wart. @Vivin: no, `%C3%85` is the correct way to send a UTF-8-encoded `Å`; the JavaScript `encodeURIComponent` function used by jQuery always chooses UTF-8 because it's the only sensible encoding to use in a modern site. It's just a pity Servlet's default doesn't agree.
bobince
@bobince Yes, just figured that out after some experimentation (`escape` vs `encodeURIComponent`)
Vivin Paliath
Yeah, `escape`/`unescape` is a bit naughty and should usually be avoided. Apart from its use of ISO-8859-1 to URL-encode, and the non-standard handling of non-ISO-8859-1 characters, it fails to encode the `+` character, which can lead to unexpected spaces.
bobince
+3  A: 

Bobince already went into detail, so I'll skip that part. If you have really no control over the container managed URI encoding, your best bet is to take the URI encoding in your own hands. You can obtain the raw GET query string in servlets by HttpServletRequest#getQueryString(). Then it's a matter to split and URL-decode them using UTF-8 yourself using the usual String methods and URLDecoder#decode().

for (String parameter : request.getQueryString().split("&")) {
    String[] pair = parameter.split("=");
    String name = URLDecoder.decode(pair[0], "UTF-8");
    String value = URLDecoder.decode(pair[1], "UTF-8");
    // ...
}

Needless to say, keep in mind that this isn't a solution, but a workaround.

BalusC
+1 I've done this as a last resort before. You'll want to check that `pair` has the expected length, though, to avoid a non-`a=b`-format value in the query string causing an exception. (Ideally, splitting on only the first `=` may be a good idea too.)
bobince