views:

58

answers:

1

I've a web application (well, in fact is just a servlet) which receives data from 3 different sources:

  • Source A is a HTML document written in UTF-8, and sends the data via <form method="get">.
  • Source B is written in ISO-8859-1, and sends the data via <form method="get">, too.
  • Source C is written in ISO-8859-1, and sends the data via <a href="http://my-servlet-url?param=value&amp;param2=value2&amp;etc"&gt;.

The servlet receives the request params and URL-decodes them using UTF-8. As you can expect, A works without problems, while B and C fail (you can't URL-decode in UTF-8 something that's encoded in ISO-8859-1...).

I can make slight modifications to B and C, but I am not allowed to change them from ISO-8859-1 to UTF-8, which would solve all the problems.

In B, I've been able to solve the problem by adding accept-charset="UTF-8" to the <form>. So it sends the data in UTF-8 even with the page being ISO.

What can I do to fix C?

Alternatively, is there any way to determine the charset on the servlet, so I can call URL-decode with the right encoding in each case?


Edit: I've just found this, which seems to solve my problem. I still have to make some tests in order to determine if it impacts the perfomance, but I think I'll stick with that solution.

A: 

The browser will by default send the data in the same encoding as the requested page was returned in. This is controllable by the HTTP Content-Type header which you can also set using the HTML <meta> tag.

The accept-charset attribute of the HTML <form> element should be avoided since it's broken in MSIE. Almost all non-UTF-8 encodings are ignored and will be sent in platform default encoding (which is usually CP-1252 in case of Windows).

To fix A and B (POST) you basically need to set HttpServletRequest#setCharacterEncoding() before gathering request parameters. Keep in mind that this is an one-time task. You cannot get a parameter and then change the encoding and then "re-get" the parameters.

To fix C (GET) you basically need to set the request URI encoding in the server configuration. Since it's unclear which server you're using, here's a Tomcat-targeted example: in the HTTP connector set the following attribute:

<Connector (...) URIEncoding="ISO-8859-1" />

However, this is already the default encoding in most servers. So you maybe don't need to do anything for C.

As an alternative, you can grab the raw and un-URL-encoded data from the request body (in case of POST) by HttpServletRequest#getInputStream() or from the query string (in case of GET) by HttpServletRequest#getQueryString() and then guess the encoding yourself based on the characters available in the parameters and then URL-encode accordingly using the guessed encoding. A hidden input element with a specific character which is different in both UTF-8 and ISO-8859-1 may help a lot in this.

BalusC
Are you sure that `accept-encoding` only works in MSIE? My ISO-8859-1 page is now sending the data correctly in UTF-8 (tried it in Chrome and Firefox).The problem I face is that I don't know which encoding is being used in each case, ISO-8859-1 or UTF-8. So I can't use `setCharacterEncoding()`. I hope that zildjohn01's suggestion will help to determine it.
AJPerez