views:

397

answers:

3

So I have %u041E%u043B%u0435%u0433%20%u042F%u043A how to save it into real UTF-8 or (better for me to HTML entities)?

+1  A: 

PHP has the decoding function

$string = html_entity_decode($string,ENT_COMPAT,"UTF-8")
Geek Num 88
what does ENT_COMPAT mean?
Blender
That's an HTML decoder. `%u....` is not HTML encoded.
bobince
will it work for strings encoded in C# or any other language?
Blender
nope... It does not change even that strange flash string to UTF8=(
Blender
ENT_COMPAT is the default value for the 2nd argument, I put that in to get to the 3rd argument for UTF-8
Geek Num 88
+5  A: 

That's JavaScript escape() format. It is similar to URL-encoding but not compatible. Using it at all is usually a mistake.

The best thing to do is to change the script that generates it, to use proper URL-encoding (encodeURIComponent()) instead. Then you can decode it with urldecode or any other normal URL-decoding function on the server side.

If you absolutely must interchange data in this non-standard format, you'll have to write a custom decoder for it. Here's a quick hack leveraging the HTML character-reference-decoder:

function jsunescape($s) {
    $s= preg_replace('/%u(....)/', '&#x$1;', $s);
    $s= preg_replace('/%(..)/', '&#x$1;', $s);
    return html_entity_decode($s, ENT_COMPAT, 'utf-8');
}

This returns a raw UTF-8 byte string. If you really want it in HTML character references like Ру... then leave off the html_entity_decode call. But normally you don't. Best to keep strings in raw format until they need to be escaped for final output — and best not to replace non-ASCII characters with character references at all unless you really need to.

what If some string like this will come to me ' %CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED'

That's URL-form-encoded, which is not directly compatible with escape() format. Whilst URL-encoding's 2-digit byte escapes are different from the crazy escape-format 4-digit code-unit-escapes, the character + is ambiguous. It could mean a plus (if the string came from escape), or a space (if it came from a browser form submission). There is no way to tell which it is. This is another reason not to use escape().

Apart from that; if the charset of this string were UTF-8 then yes, the above function would be fine, converting both the URL-encoded bytes and the crazy escape()-format Unicode characters into raw UTF-8 bytes.

However it actually appears to be code page 1251 (Windows Russian). Do you really want to handle all your strings in cp1251? If so, you would have to change it a bit to make it encode the four-digit escapes into a different charset. This is messy:

function url_or_maybe_jsescape_decode($s, $charset, $isform) {
    if ($isform)
        $s= str_replace('+', ' ', $s);
    $s= preg_replace('/%u(....)/', '&#x$1;', $s);
    $s= preg_replace('/%(..)/', '&!#x$1;', $s);
    $s= html_entity_decode($s, ENT_COMPAT, $charset);
    $s= str_replace('&!', '&', $s);
    $s= html_entity_decode($s, ENT_COMPAT, 'utf-8');
    return $s;
}

echo url_or_maybe_jsescape_decode('%CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED', 'cp1251', TRUE);

I would strongly recommend:

  1. fixing the Flash file so that it uses proper encodeURIComponent and not escape, so you can use a standard URL-decoder instead of this ugly hack.

  2. using UTF-8 instead all the way through your application, so you can support languages other than just Russian, and you don't have to worry about the input encoding of submitted forms changing.

(All encodings that are not UTF-8 suck, and that's a FACT proven by SCIENCE!)

bobince
It works fine for me by now (while I use Flash) But what If some string like this will come to me ' %CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED' will your function make any harm to it?
Blender
I mean will it steel look like Олег Якушкин when displayed in browser?
Blender
Finally a complete answer (with workarounds) adressing the incompatibilities between JavaScript's `escape()` and proper URL encoding. Would +5 if I could and would suggest re-wording the question title so future generations can profit from it.
Pekka
A: 

As suggested by other, convert it to Unicode HTML Entities. This the regex I use,

function escapePercentU($s) {
   $s = preg_replace( "/%u([A-Fa-f0-9]{4})/", "&#x$1;", $s);
   return html_entity_decode($s, ENT_COMPAT, 'utf-8');
}
ZZ Coder