ansaurus

Question

Unicode problem with JSF and HTML forms?

Answer 1

A:

A browser can't send unicode over the wire; it has to encode the unicode in some way. From the output of the exception (two kanji became five characters), I'm guessing the data was encoded as UTF-8 and the string title wasn't decoded correctly after reception in the server side of the component.

I suggest to set the accept-charset attribute for the form. That should tell everyone to behave.

Aaron Digulla 2009-05-14 15:01:33

Your guess is my guess too. I need to use utf-8 (my educational application may include chinese and sanskrit in the same input element). I'm not sure how setting accept-charset on the client side form will make the server side component decode utf-8 correctly. How does that work? Anyhow, what is the syntax exactly? I'll give it a try...

Aaron Watters 2009-05-14 16:32:30

A form post/get is actually a HTML request. With accept-charset, you tell the browser which charset the server expects. The browser will also put this information into a header field of the request so your framework will see it. That way, everyone involved will get a hint what to do.

Aaron Digulla 2009-05-15 07:40:40

Answer 2

A:

maybe you should think about localization

2009-05-14 15:32:21

Answer 3

+1 A:

Questions I would be asking:

How is the form encoding the request (application/x-www-form-urlencoded or multipart/form-data)? Multi-part data will be decoded using a 3rd party MIME parser, so there is scope for trouble there. If the data is url-encoded, is it being escaped properly?
What charsets is the browser accepting?
What encoding is the server detecting? Is it a Unicode character set?
Is it just the logging that is writing as a lossy encoding (e.g. MacRoman)? What default charset is the server using?

Since what you see on a console isn't necessarily what is in the string, you can dump the Unicode code points using this code:

  public static void printCodepoints(char[] s) {
    for (int i = 0; i < s.length; i++) {
      int codePoint = Character.isHighSurrogate(s[i]) ? Character
          .toCodePoint(s[i], s[++i])
          : s[i];
      System.out.println(Integer.toHexString(codePoint));
    }
  }

McDowell 2009-05-14 15:53:37

It's a multipart form. Maybe I'll try switching to url-encoding. thx.

Aaron Watters 2009-05-14 16:53:54

HEY! This appears to work! Just change to standard post encoding. Thanks

Aaron Watters 2009-05-14 17:00:56

I would not be so quick to celebrate. I've seen multipart/form-data used to _overcome_ character bugs and it is required if you want to do form file upload. Still, at least you have an idea about where the problem lies.

McDowell 2009-05-14 17:48:05

ansaurus

tags:

views:

answers:

Unicode problem with JSF and HTML forms?

related questions