ansaurus

Question

Unicode character in octets to hexadecimal

Answer 1

+1 A:

First of all, here's the document that definitively defines UTF-8 encoding.

The two octets in your example: 110xxxxx 10xxxxxx encode one Unicode character. Its binary code is - well, just take those x's (bits) and put them together. You'll get a number - in binary, but you can convert it to decimal or hexadecimal if you want. That's the same number as XXXX in U+XXXX.

How did I know that 110xxxxx 10xxxxxx encode one character? There may be three kinds of octets in a UTF-8 stream:

10xxxxxx - trailing octets
0xxxxxxx - ASCII characters
110xxxxx, 1110xxxx, etc. - leading octets in the sequence.

Sequences of leading and trailing octets are used to encode Unicode points from 128 and up. 110xxxxx means it starts a sequence of two octets, 1110xxxx starts a sequence of three octets, etc. This way you can isolate sequences from each other. Then take x's from the all octets in the sequence and that's your Unicode code point.

azheglov 2010-10-04 20:07:12

Answer 2

+1 A:

You can leverage iconv's UTF-8 decoder to avoid having to write one yourself:

function utf8_to_codepoints($s) {
    return unpack('V*', iconv('UTF-8', 'UCS-4LE', $s));
}

$data= "Caf\xc3\xa9 \xe6\x97\xa5\xe6\x9c\xac \xf0\x9d\x84\x9e"; // Café 日本 
var_export(utf8_to_codepoints($data));

gives:

array (
  1 => 67,
  2 => 97,
  3 => 102,
  4 => 233,
  5 => 32,
  6 => 26085,
  7 => 26412,
  8 => 32,
  9 => 119070,
)

which can be converted to U+nnnn format using dechex.

bobince 2010-10-05 13:04:15

ansaurus

tags:

views:

answers:

Unicode character in octets to hexadecimal

related questions