views:

534

answers:

4

I have inherited a database which contains strings such as:

\u5353\u8d8a\u4e9a\u9a6c\u900a: \u7f51\u4e0a\u8d2d\u7269: \u5728\u7ebf\u9500\u552e\u56fe\u4e66\uff0cDVD\uff0cCD\uff0c\u6570\u7801\uff0c\u73a9\u5177\uff0c\u5bb6\u5c45\uff0c\u5316\u5986

The question is, how do I get this to be displayed properly in an HTML page?

I'm using PHP5 to process the strings.

A: 

Here's a very through article on Unicode Encoding in PHP: http://www.phpwact.org/php/i18n/charsets

Gavin Miller
+3  A: 

PHP < 6 is woefully unaware of Unicode, so you have to do everything yourself:

  • Make sure that your database is using a Unicode-capable encoding for its connections. In MySQL for example, the directive is default-character-set = . UTF-8 is a reasonable choice
  • Let the browser know which encoding you are using. There are several ways of doing this:

    1. Set a charset value in the Content-Type header. Something like header('Content-Type: text/html;charset=utf-8');

    2. Use a <meta http-equiv> version of the above header.

    3. Set the XML encoding parameter <?xml encoding="utf-8"?>

Option 1. takes precedence over 2. I'm not sure where 3. fits in.

If you need to do any string processing prior to displaying the data, make sure you use the multibyte (mb_*) string functions. If you have Unicode data coming from other sources in other encodings, you'll need to use mb_convert_encoding.

oggy
+2  A: 
daremon
Brilliant.. thanks!
+1 for use of fileformat.info - I love that site ;)
Peter Bailey
+1  A: 

Based on daremon's submission, here is a "unicode_decode" function which will convert \uXXXX into their UTF counterparts.

function unicode_decode($str){
    return preg_replace("/\\\u([0-9A-F]{4})/ie", "iconv('utf-16', 'utf-8', hex2str(\"$1\"))", $str); 
}
function hex2str($hex) {
    $r = '';
    for ($i = 0; $i < strlen($hex) - 1; $i += 2)
    $r .= chr(hexdec($hex[$i] . $hex[$i + 1]));
    return $r;
}
I'm not exactly certain what iconv() does... PHP manuals are down right now.