Unicode character in octets is something like 110xxxxx 10xxxxxx. How can I transform these octets in hexadecimal notation like U+XXXX?
First of all, here's the document that definitively defines UTF-8 encoding.
The two octets in your example: 110xxxxx 10xxxxxx encode one Unicode character. Its binary code is - well, just take those x's (bits) and put them together. You'll get a number - in binary, but you can convert it to decimal or hexadecimal if you want. That's the same number as XXXX in U+XXXX.
How did I know that 110xxxxx 10xxxxxx encode one character? There may be three kinds of octets in a UTF-8 stream:
- 10xxxxxx - trailing octets
- 0xxxxxxx - ASCII characters
- 110xxxxx, 1110xxxx, etc. - leading octets in the sequence.
Sequences of leading and trailing octets are used to encode Unicode points from 128 and up. 110xxxxx means it starts a sequence of two octets, 1110xxxx starts a sequence of three octets, etc. This way you can isolate sequences from each other. Then take x's from the all octets in the sequence and that's your Unicode code point.
You can leverage iconv
's UTF-8 decoder to avoid having to write one yourself:
function utf8_to_codepoints($s) {
return unpack('V*', iconv('UTF-8', 'UCS-4LE', $s));
}
$data= "Caf\xc3\xa9 \xe6\x97\xa5\xe6\x9c\xac \xf0\x9d\x84\x9e"; // Café 日本
var_export(utf8_to_codepoints($data));
gives:
array (
1 => 67,
2 => 97,
3 => 102,
4 => 233,
5 => 32,
6 => 26085,
7 => 26412,
8 => 32,
9 => 119070,
)
which can be converted to U+nnnn format using dechex
.