views:

943

answers:

2

On the web site I am trying to help with, user can type in an URL in the browser, like following Chinese characters,

  http://localhost:8080?a=测试

On server, we get

  GET /a=%E6%B5%8B%E8%AF%95 HTTP/1.1

As you can see, it's UTF-8 encoded, then URL encoded. We can handle this correctly by setting encoding to UTF-8 in Tomcat.

However, sometimes we get Latin1 encoding on certain browsers,

  http://localhost:8080?a=ß

turns into

  GET /a=%DF HTTP/1.1

Is there anyway to handle this correctly in Tomcat? Looks like the server has to do some intelligent guessing. We don't expect to handle the Latin1 correctly 100% but anything is better than what we are doing now by assuming everything is UTF-8.

The server is Tomcat 5.5. The supported browsers are IE 6+, Firefox 2+ and Safari on iPhone.

+2  A: 

Unfortunately, UTF-8 encoding is a "should" in the URI specification, which seems to assume that the origin server will generate all URLs in such a way that they will be meaningful to the destination server.

There are a couple of techniques that I would consider; all involve parsing the query string yourself (although you may know better than I whether setting the request encoding affects the query string to parameter mapping or just the body).

First, examine the query string for single "high-bytes": a valid UTF-8 sequence must have two or more bytes (the Wikipedia entry has a nice table of valid and invalid bytes).

Less reliable would be to look a the "Accept-Charset" header in the request. I don't think this header is required (haven't looked at the HTTP spec to verify), and I know that Firefox, at least, will send a whole list of acceptable values. Picking the first value in the list might work, or it might not.

Finally, have you done any analysis on the logs, to see if a particular user-agent will consistently use this encoding?

kdgregory
A: 

Related to http://stackoverflow.com/questions/2657515

Roland Illig