tags:

views:

38

answers:

3

Unicode simply assigns an integer to each character. UTF-8 or others are used to encode these integers ("code points") to a sequence of bytes to be stored in the memory. My question is that why can't we simply store the character as the binary representation of its Unicode value (the "code point") ? Consequently, some languages have characters that require multiple bytes to represent them. Isn't it more easier to store them just as the binary of their code points ?

+1  A: 

Yes we can, and that is UTF-32.

The problem is UTF-32 wastes a lot of space. If the text contains a lot of European / Hebrew / Arabic text, with UTF-8 it takes only 1 to 2 bytes per code point, but with UTF-32 it takes 4 bytes per code point.

If we store the integer value as variable size, e.g. 0 ~ 255 use 1 byte, 256 ~ 65535 use 2 bytes etc., we would have an ambiguity problem, e.g. should 5a 5a represent "ZZ" or "婚"? Basically, the solution is what we called UTF-8 — we use some special bits to indicate the length of the byte sequence to give a unique decoding result.

KennyTM
Thanks. But one thing I can't understand is that what is so special about some characters that they can be fitted in only 1 or 2 bytes, while others require more bytes ?
Daud
@Daud: Those characters are used more frequently.
KennyTM
Thanks. But what I meant was that if some characters can be fitted into 1 or 2 bytes, what prevents other characters from being fitted into 1 or 2 bytes.
Daud
@Daud: 1 byte can represent at most 256 different values.
KennyTM
Sorry, I think I am not stating my question properly. If one character can be represented by 1 byte (like the Hebrew/Arabic characters, like you said), why can't characters in other languages be represented in 1 byte. Why do they require more bytes ? Thanks
Daud
@Daud: You could use language-specific encoding e.g. ISO-8859-8 to put everything into 1 byte. The problem is (1) you can't use more than 1 language in a file; (2) some languages (CJK) have >256 characters.
KennyTM
Thanks. I think I finally got it. Some languages require more than 1 byte because their character set has more than 256 characters. Thanks for persevering.
Daud
@Daud: No. You still think of Unicode as some sort of "Codepage".
soc
@Daud: No. You still think of Unicode as some sort of language-dependent "Codepage", which is not correct. Unicode can represent almost every character on this planet. There are Unicode _encodings_ in which some languages need less storage space on the disk than others. If you only work with English texts UTF-8 might be the best encoding, because most characters only need 1 byte to encode. In a multi-language language enviroment UTF-16 might be a bit more efficient. In Chinese texts GB18030, which is comparable to UTF-8 but optimized towards Chinese characters, might be another option.
soc
Thanks. Can u give an example in which a character in one language requires more space than a character in another language (or a link where this is concisely explained). I just can't understand how one character can be represented in the same encoding scheme in less amount of space than another character.
Daud
A: 

Firstly, there is a way to store them as raw codepoints. That's UTF-32 or UCS-4. Each character will always be four bytes, and store each codepoint unmodified.

However, the reasons for using others such as UTF-8 include:

  • ASCII compatibility: files that only contain U+0000 - U+007f don't need to change at all
  • size efficiency: UTF-8 usually ends up in much smaller files
Delan Azabani
Thanks. But one thing I can't understand is that what is so special about some characters that they can be fitted in only 1 or 2 bytes, while others require more bytes ?
Daud
A: 

How exactly would you save those code points? Some code points fit into one byte, some need 3 bytes. Will you use 4 bytes per each code point? When you look at byte stream, how do you know where one code point ends and other one starts? UTF-8 (and other encodings) gives you answer to that.

Peter Štibraný