views:

1859

answers:

8

I am having a problem displaying a Javascript string with embedded Unicode character escape sequences (\uXXXX) where the initial "\" character is itself escaped as "\" What do I need to do to transform the string so that it properly evaluates the escape sequences and produces output with the correct Unicode character?

For example, I am dealing with input such as:

"this is a \u201ctest\u201d";

attempting to decode the "\" using a regex expression, e.g.:

var out  = text.replace('/\/g','\');

results in the output text:

"this is a \u201ctest\u201d";

that is, the Unicode escape sequences are displayed as actual escape sequences, not the double quote characters I would like.

A: 

I'm not sure if this is it, but the answer might have something to do with eval(), if you can trust your input.

Kev
A: 
+2  A: 

As it turns out, it's unescape() we want, but with '%uXXXX' rather than '\uXXXX':

unescape(yourteststringhere.replace(/\/g,'%'))

Kev
I don't think this will work in general; unescape is for URLs, which don't handle multibyte Unicode characters.
JW
Doesn't the fact that there are 4 X's indicate multibytedness? ;) In any case, it works for me in FF3: var yourteststringhere = "Ein sch\u00F6nes Beispiel eines mehrsprachigen Textes: \u65E5\u672C\u8A9E";
Kev
And FF2, I might add.
Kev
I stand corrected. Thought you were using %XX, rather than %uXX.
JW
escape/unescape is actually its own weird animal which behaves differently to URL encoding (encodeURIComponent) or any other standard encoding scheme for the web. The %uXXXX escape for non-Latin-1 was introduced by IE, and is supported by most browsers now, but it's still not reliably documented.
bobince
...nonetheless, you could probably get away with the above in practice, as long as there are no other percentage signs in the string.
bobince
A: 

This is a terrible solution, but you can do this:

var x = "this is a \u201ctest\u201d".replace(/\/g,'\\')
// x is now "this is a \u201ctest\u201d"
eval('x = "' + x + '"')
// x is now "this is a “test”"

It's terrible because:

  • eval can be dangerous, if you don't know what's in the string

  • the string quoting in the eval statement will break if you have actual quotation marks in your string

JW
A: 

How many accounts do you have, Jeffrey? I see at least three answers from you in different accounts... (if JW is still you.)

Anyway, I think you can go with the smart solution given by Kev, I tested it quickly in FF3, Opera 9.x, Safari 3 and even IE6! It worked in all browsers: most modern browsers are Unicode aware.
http://www.javascripter.net/faq/unescape.htm -> "(In Unicode-aware browsers, in addition to escape-sequences %XX, the unescape function also processes sequences of the form %uXXXX)"

Test code:

javascript:alert(unescape("this is a \u201ctest\u201d".replace(/\/g, '%')));
PhiLho
No, JW is me...not Jeffrey Winter.
JW
OK, sorry for the confusion! ;-)
PhiLho
A: 

(I'm not JW; I have only a single account :) )

Kev's solutions is doing the trick for me. I will evaluate it in other browsers, etc. to see if I encounter any other problems.

Thanks for the solution

No probs, glad to help. :)
Kev
I think you are a liar, the account you posted this answer from is not the same one you posted the question from (userid 35789 for the question, 35803 for this answer). Seeing that Jeffrey is userid 35791 I believe this is also one of your accounts.
Robert Gamble
A: 

I think you are a liar, the account you posted this answer from is not the same one you posted the question.

Jeez, what's will the hostility? There's obviously something going on here that I'm not aware of. Is there some hidden benefit to having multiple accounts that gets people ticked-off for some reason?

I should have said "I don't have any 'accounts'". This is the first question I've ever posted on this site, and simply filled in my name and email address, I never created a specific account.

But I must say, I'm pretty impressed. I was hacking away all morning looking for an answer here and had a solution before I even got up from my desk. Thanks again.

Oh, okay, so you have actually posted as both "Jeffrey" and "Jeffrey Winter", account or no. I think that's what Robert Gamble was pointing out. I'm glad the vibe didn't scare you off.
Kev
BTW, if you wouldn't mind officially accepting my answer (I guess as "Jeffrey Winter") if you haven't found any problems, I'd appreciate the reputation points. :)
Kev
A: 

Are you sure '\' is the only character that might get HTML-escaped? Are you sure '\uXXXX' is the only kind of string escape in use?

If not, you'll need a general-purpose HTML-character/entity-reference-decoder and JS-string-literal-decoder. Unfortunately JavaScript has no built-in methods for this and it's quite tedious to do manually with a load of regexps.

It is possible to take advantage of the browser's HTML-decoder by assigning the string to an element's innerHTML property, and then ask JavaScript to decode the string as above:

var el= document.createElement('div');
el.innerHTML= s;
return eval('"'+el.firstChild.data+'"');

However this is an incredibly ugly hack and a security hole if the string comes from a source that isn't 100% trusted.

Where are the strings coming from? It would be nicer if possible to deal with the problem at the server end where you may have more powerful text handling features available. And if you could fix whatever it is that is unnecessarily HTML-escaping your backslashes you could find the problem fixes itself.

bobince