tags:

views:

265

answers:

2

I have my data in this format: U+597D or like this U+6211. I want to convert them to UTF-8 (original characters are 好 and 我). How can I do it?

+3  A: 
$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#\\1;", $string), ENT_NOQUOTES, 'UTF-8');

is probably the simplest solution.

Mez
That results in HTML entity, not a UTF8 character :)
Dor
Not in my tests it doesn't. It converts the code as shown in the Q to a HTML entity... THEN decodes the html entity.
Mez
The same problem here, I get HTML entity...
Anthony
Your regex won't match all code points - you need {4,5} to match characters higher than U+FFFF.
Thanatos
No, the problem is that my browser shows "ɕD;" and "D;" in the html-source of the page, while it's supposed to show "好"
Anthony
Thanatos
weird, I've already tried both. I'll keep searching for a problem. Thanks anyway.
Anthony
Hm... seems like it works only for one byte characters.
Anthony
The replacement string should read: "\\1;"
Thanatos
Ok, it works with decimal code point of the character, i.e.
Anthony
+1  A: 

With the aid of the following table:

http://en.wikipedia.org/wiki/UTF-8#Description

can't be simpler :)

Simply mask the unicode numbers according to which range they fit in.

Dor