ansaurus

Question

How do I correctly decode unicode parameters passed to a servlet

Answer 1

A:

You could always use javascript to manipulate the text further.

<div id="test">a</div>
<script>
var a = document.getElementById('test');
alert(a.innerHTML);
a.innerHTML = decodeURI("Yahoo!%E2%84%A2");
alert(a.innerHTML);
</script>

jacobangel 2009-01-22 17:13:24

Yes, decodeURIComponent() returns the correct value, but only if I extract the value from the URL in JavaScript. If I attempt to decodeURIComponent('<%= request.getParameter("external_title") %>'); I don't get the correct value.

Grant Wagner 2009-01-22 17:32:45

Answer 2

+2 A:

I suspect that the data mutilation happens in the request, i.e. the declared encoding of the request does not match the one that is actually used for the data.

What does request.getCharacterEncoding() return?

I don't really know how JavaScript handles encodings or how to make it use a specific one.

You need to make sure that encodings are used correctly at all stages - do NOT try to "fix" the data by using new String() an getBytes() at a point where it has already been encoded incorrectly.

Edit: It may help to have the origin page (the one with the Javascript) also encoded in UTF-8 and declared as such in its Content-Type. Then I believe Javascript may default to using UTF-8 for its request - but this is not definite knowledge, just guesswork.

Michael Borgwardt 2009-01-22 17:16:17

request.getCharacterEncoding() is returning ISO-8859-1. So I think the problem is that encodeURIComponent() encodes the value as UTF-8, but it is getting mangled by the request encoding of ISO-8859-1.

Grant Wagner 2009-01-22 17:31:12

Answer 3

A:

I think I can get the following to work:

encodeURIComponent(escape(lnk.title))

That gives me %25u2122 (for &#8482) or %25AE (for &#174), which will decode to %u2122 and %AE respectively in the servlet.

I should then be able to turn %u2122 into '\u2122' and %AE into '\u00AE' relatively easily using (char) (base-10 integer value of %uXXXX or %XX) in a match and replace loop using regular expressions.

i.e. - match /%u([0-9a-f]{4})/i, extract the matching subexpression, convert it to base-10, turn it into a char and append it to the output, then do the same with /%([0-9a-f]{2})/i

Grant Wagner 2009-01-22 18:22:34

This is one possible encoding scheme you could use to get around the Servlet Parameter Charset Problem. (One that didn't use the dodgy JavaScript escape() function might be better.) But any such isn't the standard way to pass parameters in, so any other scripts/forms wouldn't be able to talk to it.

bobince 2009-01-22 18:39:26

Grant Wagner 2009-01-22 20:13:36

Answer 4

+4 A:

You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.

The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.

(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)

So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

bobince 2009-01-22 18:36:34

Thanks for the explanation. At least I know I'm not crazy. I tried request.setCharacterEncoding() while looking for a solution and as you said, it didn't seem to do anything to help resolve my problem.

Grant Wagner 2009-01-22 19:49:32

Answer 5

A:

I got The Same problem and solved it by decoding Request.getQueryString() using URLDecoder(), and after extracting my parameters.

String[] Parameters = = URLDecoder.decode(Request.getQueryString(), 'UTF-8').splitat('&') ;

Modi 2010-03-31 14:58:23

Answer 6

A:

Hi Grant,

Can u share the code to convert %u2122 into '\u2122' and %AE into '\u00AE' I am struggling with it. Thanks in advance :))

Kaushik 2010-05-13 06:29:00

ansaurus

tags:

views:

answers:

How do I correctly decode unicode parameters passed to a servlet

related questions