views:

214

answers:

1

let's say I have a JSP Page(i just list part of it, please don't mind):

<%@ page language="java" contentType="text/html;charset=UTF-8"%>
<form>  
         <input type=input>   
         </input>
    中華<!--character with BIG5 encoding>
</form>

and In server side I use this request.setCharacterEncoding("UTF-8"); my problem is: If i use IME to input Chinese characters into the input box, then when I submit this form, what encoding will the character in the input box is ? WHY? And if i try to copy the "中華" in the jsp page into the input box and submit the form, in server side, i found the string in the input box is not "UTF-8"(same as the setting in request.setCharacterEncoding) but "BIG5". And this is in java/jsp, it seems that the request are not really as the setting to be "UTF-8". why ? can someone tell me something about this ?

But In asp.net, whatever character i input into the input box and post the form, in server side, it will always be UTF-8, and seems to never corrupt.

Why ? does asp.net handle this automatically? it Change the character encoding in the input box into UTF-8 automatically?

I always think that the form post action just treat all the character in the form as some HEX, and will not process them automatically, it just enclose these HEX with header and then send it to server. But if this idea is true, why the characters will never get corrupted in asp.net?

Thanks in advance!

A: 

Identify the point of failure.

中華

The characters you have chosen are (as Unicode codepoints) U+4E2D and U+83EF (in the CJK Unified Ideographs block). On the server, if you take the string you receive and output the values of the constituent characters using Integer.toHexString(mystring.charAt(i)), you should see these values. If this is not the case, there is a problem interpreting data from the client.

You are specifying a page encoding of UTF-8. Encoded as UTF-8, the above characters should take on the following byte sequence values in the rendered HTML:

U+4E2D    0xE4 0xB8 0xAD
U+83EF    0xE8 0x8F 0xAF

So, save the page in the browser as a file and open it in a hex editor - you should see the characters encoded as above.

You can also glean information about what is being sent from the client by sending the form to a servlet, dumping the raw byte input to a file, and inspecting it with a hex editor. It is also worth inspecting the HTTP headers and what character encodings the server and client say they will accept and are sending (see Firebug).

McDowell