views:

3119

answers:

6

Suppose I have:

<a href="http://www.yahoo.com/" target="_yahoo" 
    title="Yahoo!&#8482;" onclick="return gateway(this);">Yahoo!</a>
<script type="text/javascript">
function gateway(lnk) {
    window.open(SERVLET +
        '?external_link=' + encodeURIComponent(lnk.href) +
        '&external_target=' + encodeURIComponent(lnk.target) +
        '&external_title=' + encodeURIComponent(lnk.title));
    return false;
}
</script>

I have confirmed external_title gets encoded as Yahoo!%E2%84%A2 and passed to SERVLET. If in SERVLET I do:

Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));

I get Yahoo!â„¢ in the browser. If I manually switch the browser character encoding to UTF-8, it changes to Yahoo!TM (which is what I want).

So I figured the encoding I was sending to the browser was wrong (it was Content-type: text/html; charset=ISO-8859-1). I changed SERVLET to:

response.setContentType("text/html; charset=utf-8");
Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));

Now the browser character encoding is UTF-8, but it outputs Yahoo!™ and I can't get the browser to render the correct character at all.

My question is: is there some combination of Content-type and/or new String(request.getParameter("external_title").getBytes(), "UTF-8"); and/or something else that will result in Yahoo!TM appearing in the SERVLET output?

A: 

You could always use javascript to manipulate the text further.

<div id="test">a</div>
<script>
var a = document.getElementById('test');
alert(a.innerHTML);
a.innerHTML = decodeURI("Yahoo!%E2%84%A2");
alert(a.innerHTML);
</script>
jacobangel
Yes, decodeURIComponent() returns the correct value, but only if I extract the value from the URL in JavaScript. If I attempt to decodeURIComponent('<%= request.getParameter("external_title") %>'); I don't get the correct value.
Grant Wagner
+2  A: 

I suspect that the data mutilation happens in the request, i.e. the declared encoding of the request does not match the one that is actually used for the data.

What does request.getCharacterEncoding() return?

I don't really know how JavaScript handles encodings or how to make it use a specific one.

You need to make sure that encodings are used correctly at all stages - do NOT try to "fix" the data by using new String() an getBytes() at a point where it has already been encoded incorrectly.

Edit: It may help to have the origin page (the one with the Javascript) also encoded in UTF-8 and declared as such in its Content-Type. Then I believe Javascript may default to using UTF-8 for its request - but this is not definite knowledge, just guesswork.

Michael Borgwardt
request.getCharacterEncoding() is returning ISO-8859-1. So I think the problem is that encodeURIComponent() encodes the value as UTF-8, but it is getting mangled by the request encoding of ISO-8859-1.
Grant Wagner
A: 

I think I can get the following to work:

encodeURIComponent(escape(lnk.title))

That gives me %25u2122 (for &#8482) or %25AE (for &#174), which will decode to %u2122 and %AE respectively in the servlet.

I should then be able to turn %u2122 into '\u2122' and %AE into '\u00AE' relatively easily using (char) (base-10 integer value of %uXXXX or %XX) in a match and replace loop using regular expressions.

i.e. - match /%u([0-9a-f]{4})/i, extract the matching subexpression, convert it to base-10, turn it into a char and append it to the output, then do the same with /%([0-9a-f]{2})/i

Grant Wagner
This is one possible encoding scheme you could use to get around the Servlet Parameter Charset Problem. (One that didn't use the dodgy JavaScript escape() function might be better.) But any such isn't the standard way to pass parameters in, so any other scripts/forms wouldn't be able to talk to it.
bobince
Grant Wagner
+4  A: 

You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.

The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.

(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)

So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

bobince
Thanks for the explanation. At least I know I'm not crazy. I tried request.setCharacterEncoding() while looking for a solution and as you said, it didn't seem to do anything to help resolve my problem.
Grant Wagner
A: 

I got The Same problem and solved it by decoding Request.getQueryString() using URLDecoder(), and after extracting my parameters.

String[] Parameters = = URLDecoder.decode(Request.getQueryString(), 'UTF-8').splitat('&') ;

Modi
A: 

Hi Grant,

Can u share the code to convert %u2122 into '\u2122' and %AE into '\u00AE' I am struggling with it. Thanks in advance :))

Kaushik