I have my data in this format: U+597D
or like this U+6211
. I want to convert them to UTF-8 (original characters are 好 and 我). How can I do it?
views:
265answers:
2
+3
A:
$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#\\1;", $string), ENT_NOQUOTES, 'UTF-8');
is probably the simplest solution.
Mez
2009-11-26 21:54:41
That results in HTML entity, not a UTF8 character :)
Dor
2009-11-26 21:57:04
Not in my tests it doesn't. It converts the code as shown in the Q to a HTML entity... THEN decodes the html entity.
Mez
2009-11-26 22:12:53
The same problem here, I get HTML entity...
Anthony
2009-11-26 22:27:41
Your regex won't match all code points - you need {4,5} to match characters higher than U+FFFF.
Thanatos
2009-11-26 22:32:42
No, the problem is that my browser shows "ɕD;" and "D;" in the html-source of the page, while it's supposed to show "好"
Anthony
2009-11-26 22:37:11
Thanatos
2009-11-26 22:39:07
weird, I've already tried both. I'll keep searching for a problem. Thanks anyway.
Anthony
2009-11-26 23:38:52
Hm... seems like it works only for one byte characters.
Anthony
2009-11-26 23:53:46
The replacement string should read: "\\1;"
Thanatos
2009-11-27 00:04:49
Ok, it works with decimal code point of the character, i.e.
Anthony
2009-11-27 09:44:26
+1
A:
With the aid of the following table:
http://en.wikipedia.org/wiki/UTF-8#Description
can't be simpler :)
Simply mask the unicode numbers according to which range they fit in.
Dor
2009-11-26 21:54:47