views:

133

answers:

2

I have a javascript file that lots of people have embedded to their pages. Since I am hosting the file, I have control over that javascript file; I cannot control the way it is embedded because lots of people is using it already.

This javascript file sends GET requests to my servlets, and the parameters passed with the request are recorded to DB. For example, javascript sends a request to http://myserver.com/servlet?p1=123&p2=aString and then servlet records 123 and aString to DB somehow.

Before sending strings I use encodeURIComponent() to encode it. But what I figured out is every client sends the same string with different encodings depending on either their browser or the site they are visiting. As a result, same strings are represented with different characters when it reaches servlet (so they are different strings).

What I am trying to do is to convert the strings to one kind of encoding from javascript so when they reach the client same words are represented with same characters.

How is this possible?

PS. If there is a way to convert the encoding from Java it is also applicable.

Edit: To be more precise, I select some words from the page and send it to the server. That is where encoding causes problems.

Edit 2: I am NOT sending (and can't send) GET requests via XMLHttpRequest, because domains are different. I am using adding script tag to head method that @streetpc mentioned.

Edit 3: At the moment I am sanitizing the strings by replacing non-ASCII characters at javascript side, but I have a feeling that this is not the way to go:

function sanitize(word) {
    /*
    ğ : \u011f
    ü : \u00fc
    ş : \u015f
    ö : \u00f6
    ç : \u00e7
    ı : \u0131
    û : \u00fb
    */
    return encodeURIComponent(
            word.replace(/\u011f/g, '_g')
                .replace(/\u00fc/g, '_u')
                .replace(/\u00fb/g, '_u')
                .replace(/\u015f/g, '_s')
                .replace(/\u00f6/g, '_o')
                .replace(/\u00e7/g, '_c')
                .replace(/\u0131/g, '_i'));
}
+2  A: 

Do you specify the encoding of the JavaScript file in the HTTP headers? Like Content-type: text/javascript; charset=utf-8 with the .js file beign saved in UTF-8 of course. With Apache, you can configure

AddCharset utf-8 .js 

Or you can make the hosted javascript file create another script tag with a charset='utf-8' parameter and add-it to the head element (like most bookmarklets do).

I think the javascript being interpreted as UTF-8 code should then get/manipulate UTF-8 strings.

Then, in your Java Servlet, you can specify the input encoding to use:

request.setCharacterEncoding("UTF-8");

Edit: check this page about Character Encoding in JavaScript, especially the part named "Setting the Character Encoding".

streetpc
+3  A: 

what I figured out is every client sends the same string with different encodings

Whilst that would be normal for <form> submissions, it should not happen for XMLHttpRequest work. The encodeURIComponent function explicitly always writes URL-encoded UTF-8 bytes, regardless of the encoding of the page from which it was used. Of course persuading your servlet container to allow you to read those UTF-8 bytes without messing them up is another story, but that shouldn't depend on the client.

What might be a problem is if you are using raw non-ASCII characters inside your script file itself. In that case the interpretation of those characters will vary according to the charset the browser is using to load the script. This may be affected by:

  1. any charset declared in the Content-Type: text/javascript;charset= header.
  2. any charset attribute declared on the <script src="..." charset="..."> element.
  3. the charset of the page that included the script.

(1) and (2) are not supported in all browsers. Normally you can rely on (3), but as a third-party script author that is out of your control. Therefore you should use only ASCII characters in your script. (Use \u1234 escapes to include non-ASCII characters in string literals in your script to get around this limitation.)

bobince
I am using non-ASCII characters, that is why I am having problems.
nimcap
You are using literal, raw non-ASCII characters in your returned `.js`? If so, you will need to encode them so they fit in only ASCII. For string literals that's easy, as above. (I can't think of a reason you'd need non-ASCII characters outside of string literals.)
bobince
I updated my question to be more clear, I am using non-ASCII chars but not directly in JS. I fetch them from the page, they usually contain non-ASCII chars.
nimcap
When contained in an HTML document, characters are already Unicode. If they are appearing correctly on the user's browser, they will definitely also come through `encodeURIComponent` correctly. If the words don't appear right in the user's browser, there's little you can do to recover them.
bobince
+1 nice one Bob. FWIW, the fact that `encodeURIComponent` specifically creates UTF-8 sequences of bytes is covered by section 15.1.9 of the spec (both 3rd and 5th editions). http://www.ecma-international.org/publications/standards/Ecma-262.htm
T.J. Crowder