Can someone please confirm that all Kanji characters in chinese are UTF8 3 byte long.
+5
A:
The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and Katakana characters also take 3 bytes.)
However, there are also some very rarely-used characters in the "CJK Unified Ideographs Extension B" and "CJK Compatibility Ideographs Supplement" blocks, which take 4 bytes in UTF-8.
Also be aware that Chinese text often contains ASCII characters like the digits 0-9.
dan04
2010-09-09 23:50:51
+1 Wow, apparently we have Chinese speakers on stackoverflow. Cool :-).
sleske
2010-09-10 09:17:11
Japanese text sourced from Shift-JIS is also likely to contain other non-Kanji, non-ASCII characters mapping to two-byte sequences. And then we'll shortly have the emoji to contend with, which are also outside the Basic Multilingual Plane and so 4 bytes...
bobince
2010-09-10 11:28:30
@sleske: No, I don't *speak* Chinese. I've just done way too much work with character encoding.
dan04
2010-09-10 13:17:17