views:

104

answers:

2

If i have a HTML page with setting to be UTF-8. and then I input Chinese characters with encoding big5 in the form and submit. what encoding it is at server side ? is it automatically converted to UTF-8? Or how it works ?? Thanks!

Supplement1: Actually i am really not sure, why the browser can decide which encoding to use ? since the encode was generated by IME. for example: the tool i used to input Chinese character, right ?

supplement2: if everything just like what "Michael Madsen" said at the below response, then how can asp.net handle this, such that whatever and no matter how i input the characters in the forms, it will not get corrupted always but jsp can't?

A: 

The browser can send up its post in big5 if it wants to, and the server should be able to handle that. But what do you mean by "I input Chinese characters with encoding big5 in the form"? When you input the characters, it's up to the browser to decide which encoding to use, surely?

Jon Skeet
Actually i am really not sure, why the browser can decide which encoding to use ? since the encode was generated by IME. for example: the tool i used to input Chinese character, right ?
MemoryLeak
That's just going to get the text data into the browser in some appropriate fashion. The important thing is the textual values, not the encoding involved. Depending on the OS, browser and IME that could happen in a number of ways - but so long as the browser knows what Unicode characters to transmit, it can then decide to use whatever encoding it likes (and put it in the headers).
Jon Skeet
what we can see is "textual values", but computer can only recognize the encoding just like HEX value, right ? so if you input the character with IME, and generate the big5 character, then will the browser automatically translate it into UTF-8?
MemoryLeak
+3  A: 

The browser works with Unicode - when the characters are typed in there, they're internally stored as Unicode. When the form is submitted, it outputs the characters in whatever encoding is appropriate - usually the encoding of the page.

If you're talking about copy/pasting from a Big5 document, then it will already have been converted to Unicode when it's inserted into the clipboard - maybe even when the document is loaded, depending on your editor.

If you're talking about using some IME to input the characters, the question is kind of faulty, since your IME should be working exclusively with Unicode and Big5 encoding is therefore never involved. If it is, then there's some layer inbetween doing the conversion to/from Unicode anyway, so regardless of that part, the browser never knows the source encoding.

Michael Madsen
why ? system can automatically convert the string from big5 into utf-8?
MemoryLeak
Yes. The operating system knows how to go from values in each encoding to an actual character, which will be represented using the internal encoding of the operating system when the operating system is doing something with it. That's why legacy apps still work on Windows - Windows uses UTF-16 internally, but legacy apps using a language-specific code page can call a compatibility layer which basically just call the Unicode versions of the API functions after the text sent to the function has been converted (text returned from the API function is also converted the other way, of course).
Michael Madsen
Because the browser is Unicode-capable, it can understand and process the stuff given to it by the operating system without conversion. Once it's told to submit the form, it converts from the system encoding to the encoding requested by the server and sends the converted text.
Michael Madsen
Yeah, I don't know how you can know this, but if OK, could you please tell me how to prove this ?
MemoryLeak
Take a look at the Windows utility AppLocale - http://www.microsoft.com/globaldev/tools/apploc.mspx - this allows you to run a single application as though the non-Unicode language was something other than the current system setting. Since all non-Unicode applications use the same *A (non-Unicode) versions of the Windows API functions, there must be code in those functions that handles the differing code pages correctly. The only way to do that with even the slightest bit of sanity is to simply convert the input and call the *W (Unicode) version.
Michael Madsen