ansaurus

Question

Answer 1

+6 A:

When dealing with Strings, always remember: byte != char. So in your first example, you have the char c3, not the byte c3 which is a huge difference: The byte would be part of the UTF-8 sequence but the char already is Unicode. So when you convert that to UTF-8, the Unicode character c3 must become the byte sequence c3 83.

So the question is: How did you get the String? There must be a bug in that code which doesn't properly handle UTF-8 encoded byte sequences.

The reason why ISO-8859-1 usually works is that this encoding doesn't modify any char with a code point < 256 (i.e. anything between 0 and 255), so UTF-8 encoded byte sequences won't be modified.

Your last example is also wrong: The char e9 is é in ISO-8859-1 and Unicode. In UTF-8, it's not valid since it's not a byte and since it's the byte c3 prefix is missing. That said, it correctly represents the Unicode string you seek.

Aaron Digulla 2009-10-29 08:48:23

Thanks for the very informative answer. So it boils down to request.getParameter() in javax.servlet.http.HttpServletRequest to not correctly handle UTF-8 encoded byte sequences, right?I have called req.setCharacterEncoding("UTF-8") on it though.What possible workaround am I being left with? It still isn't clear for me how I get the original data for my parameters (its bytes, not chars) so I can get some _non-buggy_ String implementation to work out the right UTF string out of it...

2009-10-29 10:33:15

My guess is that the sender encodes the data with UTF-8 but fails to set the correct HTTP headers for this.

Aaron Digulla 2009-10-29 11:25:30

So make sure that the PHP part generates web pages that correctly specify their encoding, especially in forms.

Aaron Digulla 2009-10-29 11:26:28

After that, the Java code should decode the data correctly without any manual corrections by you.

Aaron Digulla 2009-10-29 11:27:06

Yes you are totally right. The culprit was the php cUrl code, which only worked for me in POST mode. Also, on the return path (getting the string back from the database and to php through groovy), I had some more problems that I solved by following the instructions given here: http://mathiasrichter.blogspot.com/2009/10/character-encoding-utf-8-with.html

2009-10-29 20:05:57

okay ... do I get "correct answer", then? :)

Aaron Digulla 2009-10-30 08:26:26

yes sorry, I didn't know I could do that :-) Thanks a lot!

2009-10-30 10:04:43

Answer 2

+1 A:

If you start with the Java String where "d\u00C3\u00A9jeuner".equals(stmt) then the data is already corrupt at this stage.

A Java char is not a C char. A char in Java is 16bits wide and implicitly contains UTF-16 encoded data. Trying to store any other encoded data in a Java char/String type is asking for trouble. Character data in any other encoding should be as byte data.

If you are reading the parameter using the servlet API, then it is likely that the HTTP request contains inconsistent or insufficient encoding information. Check the calling code and the HTTP headers. It is likely that the client is encoding the data as UTF-8, but the servlet is decoding it as ISO-8859-1.

McDowell 2009-10-29 11:00:07

Answer 3

A:

Hi, I'm having a very similar problem except that my form uses "GET" request not a "POST" request.

So, my URL is something like: http://localhost:4502/form.jsp?query=d%C3%A9jeuner

request.getCharacterEncoding() = ISO-8859-1
response.getCharacterEncoding() = UTF-8
request.getParameter("query") = dÃ©jeuner

So should the HttpServletRequest use UTF-8 to decode the request param (which clearly it's not) or is this simply a browser error because the browser does not set any character encoding header (which again doesn't make much sense because it's not doing a post request). Here is the full set of headers and notice the %C3%A9 in the URL.

http://localhost:4502/form.jsp?query=d%C3%A9juerne

GET /form.jsp?query=d%C3%A9juerne HTTP/1.1
Host: localhost:4502
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.17) Gecko/2010010604 Ubuntu/9.04 (jaunty) Firefox/3.0.17
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

This problem I'm having is that I actually copied and pasted the query into the browser form and it incorrectly encoded it. Both in chrome and firefox.

marto 2010-02-16 11:59:57

Answer 4

A:

After some further investigation I found this answer

http://stackoverflow.com/questions/138948/how-to-get-utf-8-working-in-java-webapps.

It's all about setting URIEncoding="UTF-8" in the tomcat connector.

Now to figuring out on how to do this in the CMS we use (CQ5/Day).

marto 2010-02-16 14:11:12

Hi, welcome at Stackoverflow! Please do not post own questions as answers in other's questions! They will get lost in noise and nobody would respond on your question. Just post a question by clicking `Ask Question` button at the right top. Once done that, please delete this noise from this topic as well.

BalusC 2010-02-16 14:15:15

ansaurus

tags:

views:

answers:

utf-8 decoding in java

related questions