ansaurus

Question

Answer 1

+6 A:

There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.

When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.

You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.

Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.

EDIT: To diagnose this, use something like this:

public static void dumpString(String text) {
    for (int i = 0; i < text.length(); i++) {
        System.out.println(i + ": " + (int) text.charAt(i));
    }
}

Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.

Jon Skeet 2010-10-29 07:18:36

告 This character for example needs to be convertedI get 229 145 138 this decimal representation whichis correct according to http://www.ansell-uebersetzungen.com/gbuni.html because it's this hex representation: E5 91 8ASo now I need it to be converted to unicode. I

Rob Hufschmitt 2010-10-29 07:28:38

So in my opinion the request sends the right characters but I cannot read these in java, it needs to be converted to unicode

Rob Hufschmitt 2010-10-29 07:30:12

@Rob: No, that should appear in the string as U+544A. The hex representation you've quoted is the UTF-8 representation - which is *never* going to be what's in the string itself. You say you "get" 229 145 138 - when you do what? I'll edit my answer with some diagnostic code.

Jon Skeet 2010-10-29 07:42:20

Answer 2

+2 A:

First make sure that the data is actually encoded as UTF-8.

There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.

Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().

The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.

Alexandre Jasmin 2010-10-29 07:33:54

The charset is right in html.

Rob Hufschmitt 2010-10-29 07:38:13

Now When I convert I get the representation of unicode 63 for each character So I guess that my conversion is still wrong

Rob Hufschmitt 2010-10-29 07:38:56

@Rob You shouldn't have to make any manual conversion. You should call `setCharacterEncoding("UTF-8")` and use `request.getParameter()` to get a normal Java Unicode string. I suppose your code works with normal ascii characters as well?

Alexandre Jasmin 2010-10-29 07:47:10

And please use @Jon Skeet code snippet to get the Unicode code point of each character instead of `String.getBytes()`.

Alexandre Jasmin 2010-10-29 07:53:54

@Alexandre Jasmin: Thank you so much, you really made my day!

Rob Hufschmitt 2010-10-29 07:55:13

You're welcome.

Alexandre Jasmin 2010-10-29 07:56:07

Answer 3

A:

Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter

endryha 2010-10-29 08:33:54

Answer 4

A:

String question = request.getParameter("searchWord");

is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, Ã¼ somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.

Michael Konietzka 2010-10-29 09:47:07

ansaurus

tags:

views:

answers:

How to convert UTF8 to unicode

related questions