views:

84

answers:

2

Can someone please confirm that all Kanji characters in chinese are UTF8 3 byte long.

+1  A: 

Yes, Kanji is U+4e00 to U+9faf, UTF8 3 bytes are U+0800 to U+FFFF.

gawi
+5  A: 

The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and Katakana characters also take 3 bytes.)

However, there are also some very rarely-used characters in the "CJK Unified Ideographs Extension B" and "CJK Compatibility Ideographs Supplement" blocks, which take 4 bytes in UTF-8.

Also be aware that Chinese text often contains ASCII characters like the digits 0-9.

dan04
+1 Wow, apparently we have Chinese speakers on stackoverflow. Cool :-).
sleske
Japanese text sourced from Shift-JIS is also likely to contain other non-Kanji, non-ASCII characters mapping to two-byte sequences. And then we'll shortly have the emoji to contend with, which are also outside the Basic Multilingual Plane and so 4 bytes...
bobince
@sleske: No, I don't *speak* Chinese. I've just done way too much work with character encoding.
dan04