ansaurus

Question

Answer 1

+3 A:

$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#\\1;", $string), ENT_NOQUOTES, 'UTF-8');

is probably the simplest solution.

Mez 2009-11-26 21:54:41

That results in HTML entity, not a UTF8 character :)

Dor 2009-11-26 21:57:04

Not in my tests it doesn't. It converts the code as shown in the Q to a HTML entity... THEN decodes the html entity.

Mez 2009-11-26 22:12:53

The same problem here, I get HTML entity...

Anthony 2009-11-26 22:27:41

Your regex won't match all code points - you need {4,5} to match characters higher than U+FFFF.

Thanatos 2009-11-26 22:32:42

No, the problem is that my browser shows "ɕD;" and "D;" in the html-source of the page, while it's supposed to show "好"

Anthony 2009-11-26 22:37:11

Thanatos 2009-11-26 22:39:07

weird, I've already tried both. I'll keep searching for a problem. Thanks anyway.

Anthony 2009-11-26 23:38:52

Hm... seems like it works only for one byte characters.

Anthony 2009-11-26 23:53:46

The replacement string should read: "\\1;"

Thanatos 2009-11-27 00:04:49

Ok, it works with decimal code point of the character, i.e.

Anthony 2009-11-27 09:44:26

Answer 2

+1 A:

With the aid of the following table:

can't be simpler :)

Simply mask the unicode numbers according to which range they fit in.

Dor 2009-11-26 21:54:47

PHP: Convert unicode codepoint to UTF-8