Unicode simply assigns an integer to each character. UTF-8 or others are used to encode these integers ("code points") to a sequence of bytes to be stored in the memory. My question is that why can't we simply store the character as the binary representation of its Unicode value (the "code point") ? Consequently, some languages have characters that require multiple bytes to represent them. Isn't it more easier to store them just as the binary of their code points ?
views:
38answers:
3Yes we can, and that is UTF-32.
The problem is UTF-32 wastes a lot of space. If the text contains a lot of European / Hebrew / Arabic text, with UTF-8 it takes only 1 to 2 bytes per code point, but with UTF-32 it takes 4 bytes per code point.
If we store the integer value as variable size, e.g. 0 ~ 255 use 1 byte, 256 ~ 65535 use 2 bytes etc., we would have an ambiguity problem, e.g. should 5a 5a
represent "ZZ" or "婚"? Basically, the solution is what we called UTF-8 — we use some special bits to indicate the length of the byte sequence to give a unique decoding result.
Firstly, there is a way to store them as raw codepoints. That's UTF-32 or UCS-4. Each character will always be four bytes, and store each codepoint unmodified.
However, the reasons for using others such as UTF-8 include:
- ASCII compatibility: files that only contain U+0000 - U+007f don't need to change at all
- size efficiency: UTF-8 usually ends up in much smaller files
How exactly would you save those code points? Some code points fit into one byte, some need 3 bytes. Will you use 4 bytes per each code point? When you look at byte stream, how do you know where one code point ends and other one starts? UTF-8 (and other encodings) gives you answer to that.