Emacs 23 uses character set four times larger than Unicode - why?

views:

357

answers:

+3 Q:

Emacs 23 uses character set four times larger than Unicode - why?

From Emacs 23.1 NEWS:

*** The Emacs character set is now a superset of Unicode. (It has about four times the code space, which should be plenty).

And more details later on:

*** In multibyte buffers and strings, characters are represented by UTF-8 byte sequences. The character code space is now 0x0..0x3FFFFF with no gap; code points 0x0..0x10FFFF are Unicode characters of the same code points, while code points 0x3FFF80..0x3FFFFF are raw 8-bit bytes.

According to Wikipedia, the BMP of the UCS has 65536 characters, the latest version of Unicode contains more than 107000 characters, and the UCS has more than one million code points. 0x3FFFFF is more than four millions.

What problems could be solved or how otherwise it is beneficial to have internal character set that is a superset of Unicode?

+18 A:

Unicode is designed to encompass the required character sets for all human languages, which is certainly useful for globalisation/localisation of your code, but because Emacs is the tool of the gods themselves, it has to also encompass every character that may be used by deities of all kinds ( including but not limited to the eldritch runes of the Great Old Ones), spacefaring races ( including but not limited to our future alien overlords ), ultra-intelligent-machine-intelligences ( including but not limited to our future robot masters ) and every other being that desires infinite cosmic power. That is potentially a whole lot of characters!

Or it could be to do with UTF-8 being a way of encoding characters that has much more space than is taken up by the Unicode set and Emacs just supporting the whole of UTF-8, but I prefer my explanation above.

glenatron 2009-11-04 15:26:10

UTF-8 is an encoding of the Unicode character set. Neither is a sub/superset of the other.

jamessan 2009-11-04 15:28:13

Please note that Unicode is a charset and UTF-8 is a byte-encoding of the Unicode charset (i.e., UTF-8 is a way to represent any sequence of "abstract characters" in the Unicode charset as a sequence of bytes).

Justice 2009-11-04 15:29:17

Edited to cohere more closely with the comments above. Thanks for the explanation.

glenatron 2009-11-04 15:54:48

ansaurus

tags:

views:

answers:

Emacs 23 uses character set four times larger than Unicode - why?

related questions