views:

121

answers:

5

It seems the most confusing issue to me.

How is the beginning of a new character recognized?

How are the codepoints allocated?

Let's take Chinese character for example.

What range of codepoints are allocated to them,

and why is it thus allocated,any reason?

EDIT: Plz describe it in your own words,not by citation.

Or could you recommend a book that talks about Unicode systematically,which you think have made it clear(it's the most important).

+4  A: 

The Unicode Consortium is responsible for the codepoint allocation. If you have want a new character or a code page allocated, you can apply there. See the proposal pipeline for examples.

Aaron Digulla
+2  A: 

Take a look here for a general overview of Unicode that might be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)

Donut
While generally a good resource for Unicode, it's not relevant to this question.
Joachim Sauer
+3  A: 

Chapter 2 of the Unicode specification defines the general structure of Unicode, including what ranges are allocated for what kind of characters.

Martin v. Löwis
I encourage you to read the Unicode 5 standard. It is one of the best written standards I've ever read. The opening chapters give a very readable introduction to every aspect of Unicode and character set issues in general. And it freely available in PDF on-line!
MtnViewMark
+1  A: 

It is better to say Character Encoding instead of Codepage

A Character Encoding is a way to map some character to some data (and also vice-versa!)

As Wikipedia says:

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers

Most popular character encodings are ASCII,UTF-16 and UTF-8

ASCII

First code-page that widely used in computers. in ANSI just one byte is allocated for each character. So ANSI could have a very limited set of characters (English letters, Numbers,...)

As I said, ASCII used videly in old operating systems like MS-DOS. But ASCII is not dead and still used. When you have a txt file with 10 characters and it is 10 bytes, you have a ASCII file!


UTF-16

In UTF-16, Two bytes is allocated of a character. So we can have 65536 different characters in UTF-16 !

Microsoft Windows uses UTF-16 internally.


UTF-8

UTF-8 is another popular way for encoding characters. it uses variable-length bytes (1byte to 4bytes) for characters. It is also compatible with ASCII because uses 1byte for ASCII characters.

Most Unix based systems uses UTF-8


Programming languages do not depend on code-pages. Maybe a specific implementation of a programming language do not support codepages (like Turbo C++)

You can use any code-page in modern programming languages. They also have some tools for converting the code-pages.

There is different Unicode versions like Utf-7,Utf-8,... You can read about them here (recommanded!) and maybe for more formal details here

Isaac
UTF-16 has a set of surrogates, which are basically 2 16-bit numbers in a row, which are used to represent characters outside the Basic Multilingual Plane (BMP) - where the BMP is the characters that can be represented by 16-bit values. Unicode is a 21-bit system.
Jonathan Leffler
Also, Unicode 16 is not a standard term. UCS-2 is an old term dating back to when the BMP was all there was to Unicode; UTF-16 is used these days (AFAIK, even by Microsoft).
Jonathan Leffler
@Jonathan. Yes. you was right. Unicode16 is the UTF-16 term! ;)
Isaac
@Jonathan - +1 for the BMP story and +1 for the UTF-16 term
Isaac
+1  A: 

Unicode is a standard specified by the Unicode Consortium. The specification defines Unicode’s character set, the Universal Character Set (UCS), and some encodings to encode that characters, the Unicode Transformation Formats UTF-7, UTF-8, UTF-16 and UTF-32.

How is the beginning of a new character recognized?

It depends on the encoding that’s been used. UTF-16 and UTF-32 are encodings with fixed code word lengths (16 and 32 bits respectively) while UTF-7 and UTF-8 have a variable code word length (from 8 bits up to 32 bits) depending on the character point that is to be encoded.

How are the codepoints allocated? Let's take Chinese character for example. What range of codepoints are allocated to them, and why is it thus allocated,any reason?

The UCS is separated into so called character planes. The first one is Basic Latin (U+0000–U+007F, encoded like ASCII), the second is Latin-1 Supplement (U+0080–U+00FF, encoded like ISO 8859-1) and so on.

Gumbo