views:

71

answers:

4

I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.

I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a LOT -- even in that article. I think at least two of them are talking about the same thing.

I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.

+1  A: 

A Character Set is just that, a set of characters that can be used.
Each of these characters is mapped to an integer called code point.
How these code points are represented in memory is the encoding. An encoding is just a method to transform a code-point (U+0041 - Unicode code-point for the character 'A') into raw data (bits and bytes).

Zaki
A: 

The chapter on Unicode in this book, Advanced Perl Programming contains the best description of encoding, character sets and the other entities of unicode that I've come across. Unfortunately I don't think its available for free on line.

ennuikiller
I have a Safari subscription. Just downloaded the chapter, thanks.
Deane
+4  A: 

A ‘character set’ is just what it says: a properly-specified list of distinct characters.

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.

UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.

A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.

When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.

[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]

bobince
+2  A: 

I thought Joel's article was pretty much spot on - it is the history behind the evolution of character sets and storage which has brought this about.

FWIW, in my oversimplistic view

  • Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations
  • Encoding would relate to the efficient storage of characters, ANSI, UTF-7, UTF-8 etc, for file, across the wire etc
  • Code Page would be the 'kluge' needed when the demand for the addition of new characters (without wanting to increase storage capacity) meant that (certain) characters were only knowable in the additional context of a code page.

IMHO Wikipedia currently doesn't help things by defining code page as 'another name for character encoding' and redirecting 'character set' to 'character encoding'

nonnb