views:

702

answers:

3

I have an HTML form generated by JSF which maps an input element to a bean setter and it looks to me like JSF is garbling unicode input on the way in. In particular I put the following exception for testing purposes in the setter

public void setTitle(String title){
    System.out.println("title set with: "+title+"\n");
    if (title.startsWith("xxx")) {
        throw new RuntimeException("debug exception "+title);
    }
    this.title = title;
}

Then I put the following text into the form title input element: "xxxx 海陆". Then when I submit the form I see the log print

title set with: xxxx ?????

(on a unicode compatible mac terminal). And I get an error message on the response HTML page:

Error setting property 'title' in bean of type   
uk.ac.lancs.e_science.sakaiproject.api.blogger.post.Post: 
java.lang.RuntimeException: debug exception xxxx ���??

Any clues on what's wrong? Am I just full of it and have the wrong diagnosis? I think I've eliminated all other possibilities. Unicode seems to work fine in other components of the same application.

A: 

A browser can't send unicode over the wire; it has to encode the unicode in some way. From the output of the exception (two kanji became five characters), I'm guessing the data was encoded as UTF-8 and the string title wasn't decoded correctly after reception in the server side of the component.

I suggest to set the accept-charset attribute for the form. That should tell everyone to behave.

Aaron Digulla
Your guess is my guess too. I need to use utf-8 (my educational application may include chinese and sanskrit in the same input element). I'm not sure how setting accept-charset on the client side form will make the server side component decode utf-8 correctly. How does that work? Anyhow, what is the syntax exactly? I'll give it a try...
Aaron Watters
A form post/get is actually a HTML request. With accept-charset, you tell the browser which charset the server expects. The browser will also put this information into a header field of the request so your framework will see it. That way, everyone involved will get a hint what to do.
Aaron Digulla
A: 

maybe you should think about localization

+1  A: 

Questions I would be asking:

  • How is the form encoding the request (application/x-www-form-urlencoded or multipart/form-data)? Multi-part data will be decoded using a 3rd party MIME parser, so there is scope for trouble there. If the data is url-encoded, is it being escaped properly?
  • What charsets is the browser accepting?
  • What encoding is the server detecting? Is it a Unicode character set?
  • Is it just the logging that is writing as a lossy encoding (e.g. MacRoman)? What default charset is the server using?

Since what you see on a console isn't necessarily what is in the string, you can dump the Unicode code points using this code:

  public static void printCodepoints(char[] s) {
    for (int i = 0; i < s.length; i++) {
      int codePoint = Character.isHighSurrogate(s[i]) ? Character
          .toCodePoint(s[i], s[++i])
          : s[i];
      System.out.println(Integer.toHexString(codePoint));
    }
  }
McDowell
It's a multipart form. Maybe I'll try switching to url-encoding. thx.
Aaron Watters
HEY! This appears to work! Just change to standard post encoding. Thanks
Aaron Watters
I would not be so quick to celebrate. I've seen multipart/form-data used to _overcome_ character bugs and it is required if you want to do form file upload. Still, at least you have an idea about where the problem lies.
McDowell