views:

57

answers:

3

So, the javadoc for URLEncoder's encode and URLDecoder's decode have this note in them:

Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites."

However, if someone sends in a request with a different encoding type, wouldn't it be a bad idea to encode with UTF-8? Is there anything wrong with checking a header (if it exists) and using whatever encoding is specified in there? Perhaps some more background to this note would allow it to make more sense to me, if anyone can provide it.

+1  A: 

In the same documentation

The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

You can change the encoding, but as it's not W3C compliant it would be a bad idea.


Resources :

Colin Hebert
Bah, I feel stupid for looking past that. Thanks.
AHungerArtist
+1  A: 

Tomcat and some other web servers have a separate setting that controls the decoder used for the URL in a GET request. Specifically, Tomcat will use the server's default character encoding unless one is specified in the with the URIEncoding attribute of the "Connector".

I found the discussion in this post helpful when I was dealing with similar problems.

erickson
+1  A: 

Some countries' websites do use other encodings, because UTF-8 would be inefficient for their languages.

URLs are generally opaque. It's a sequence of ASCII chars that were generated by a website, and consumed by the same website. As long as the website itself can parse it, it's good.

On the other hand, people do want to look into URLs, try to understand finer details. A browser, when displaying a URL full of %-encoded octets, may want to convert them back to characters. Unfortunately it has to guess the character encoding, theoretically the encoding can be anything, even proprietary ones.

Also, a 3rd party may want to generate a URL to a website that they don't control. How many programs have dynamically generated Google search URLs? Again, the encoding supported by the website must be speculated.

So if you are a website owner, and you want to be nice, it's better to support UTF-8 encoded URLs. Of course, you don't have to be. It's your URLs, it's up to you.

irreputable