views:

50

answers:

2

Does a servlet knows the encoding of the sent form that specified using http-equiv?

When I specify an encoding of a POSTed form using http-equiv like that:

<HTML>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=gb2312'/>
</head>
<BODY >
<form name="form" method="post" >
    <input type="text" name="v_rcvname" value="相宜本草">
</form>
</BODY>
</HTML>

And then at the servlet I use the method, request.getCharacterEncoding() I got null ! So, Is there a way that I can tell the server that I am encoding the data in some char encoding??

+4  A: 

This will indeed return null from most webbrowsers. But usually you can safely assume that the webbrowser has actually used the encoding as specified in the original response header, which is in this case gb2312. A common approach is to create a Filter which checks the request encoding and then uses ServletRequest#setCharacterEncoding() to force the desired value (which you should of course use consistently throughout your webapplication).

public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws ServletException, IOException {
    if (request.getCharacterEncoding() == null) {
        request.setCharacterEncoding("gb2312");
    }
    chain.doFilter(request, response);
}

Map this Filter on an url-pattern covering all servlet requests, e.g. /*.

If you didn't do this and let it go, then the servletcontainer will use its default encoding to parse the parameters, which is usually ISO-8859-1, which in turn is wrong. Your input of 相宜本草 would end up like ÏàÒ˱¾²Ý.

BalusC
But, I think it is supposed that the browser read the encoding and encode the content of the page -by this encoding- to array of bytes, and then send these decoded bytes to the server along with HTTP headers. So the Servlet should interpret HTTP headers and use the encoding sent to decodes the bytes back to Strings.This should be the correct scenario, right? So, the question is, why this is not gonna applied ???
Mohammed
As said, most browsers doesn't send this information along the headers. That's why it may return `null`. Check it yourself with a HTTP header checker like Firebug or Fiddler. After all you can just safely assume that it's encoded in the encoding which you've specified **yourself** in the original request. Give it a try. You'll see that it works.
BalusC
The `form` element has an `accept-charst` attribute. I can't remember testing this because I like to use Unicode everywhere. http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3
McDowell
@McDowell: This is ignored in almost all browsers as well, expect of MSIE. Even then, it's bogus in MSIE. Any ISO-8859 encoding will be sent in CP1252. Never use it.
BalusC
@BalusC - thanks - good to know
McDowell
+1  A: 

It's impossible to send POST data back in GB2312. I think UTF-8 is the W3C recommendation and all new browsers only send data back in either Latin-1 or UTF-8.

We were able to get GB2312 encoded data back in old IE on Win 95 but it's generally not possible on the new Unicode based browsers.

See this test on Firefox,

POST / HTTP/1.1
Host: localhost:1234
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 46

My page is in GB2312 and I specified GB2312 everywhere but the Firefox simply ignores it.

Some broken browsers even encode Chinese in Latin-1. We recently added a hidden field with a known value. By checking the value, we can figure out the encoding.

request.getCharacterEncoding() returns the encoding from Content-Type. As you can see from my trace, it's always null.

ZZ Coder
The first statement is not true. The data **is** sent in the encoding as you specify in the response header of the page with the form. I can confirm this for at least IE6/7/8, FF2/3, Opera9/10, Safari3/4 and Chrome. I however agree that it's better to keep everything UTF-8, just with eye on future expansions.
BalusC
I just tested again on IE8/FF3. I cannot get GB2312. All encoded in UTF-8. Post a trace if you have one.
ZZ Coder
The character encoding returns null here as well, but setting the character encoding as GB2312 yields correct characters on getParameter().
BalusC
I am agree with BalusC, that is,for browsers being not able to send the charset as a part of the Content-Type request header doesn't prevent it from correctly uses the charset specified in http-equiv or in the form's accept-charset attribute to encode the data correctly.
Mohammed