views:

376

answers:

7

I am confused about the text encoding and charset. For many reasons, I have to learn non-Unicode, non-UTF8 stuff in my upcoming work.

I find the word "charset" in email headers as in "ISO-2022-JP", but there's no such a encoding in text editors. (I looked around the different text editors.)

What's the difference between text encoding and charset? I'd appreciate it if you could show me some use case examples.

+2  A: 

Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. A character is simply what it sounds like, a set of characters. There are a large number of charsets, including many that are intended for particular scripts or languages.

However, we are well along the way in the transition to Unicode, which includes a character set capable of representing almost all the world's scripts. However, there are multiple encodings for Unicode. An encoding is a way of mapping a string of characters to a string of bytes. Examples of Unicode encodings include UTF-8, UTF-16 BE, and UTF-16 LE . Each of these has advantages for particular applications or machine architectures.

Matthew Flaschen
+8  A: 

Basically:

  1. charset is the set of characters you can use
  2. encoding is the way these characters are stored into memory
Svetlozar Angelov
True, but in actual use "charset" usually refers to *both* the character repertoire and the encoding scheme.
Alan Moore
+2  A: 

A character set, or character repertoire, is simply a set (an unordered collection) of characters. A coded character set assigns an integer (a "code point") to each character in the repertoire. An encoding is a way of representing code points unambiguously as a stream of bytes.

Jonathan Feinberg
+2  A: 

A charset is just a set; it either contains, e.g. the Euro sign, or else it doesn't. That's all.

An encoding is a bijective mapping from a character set to a set of integers. If it supports the Euro sign, it must assign a specific integer to that character and to no other.

Kilian Foth
Does it have to be bijective?
Jörg W Mittag
Well, encoding and decoding should be deterministic, so there really can't be any ambiguous mappings. I suppose you could have a non-contiguous set of integers as the codomain, but that would waste space when you store text, and engineers hate wasted space.
Kilian Foth
Legacy character encodings are often not bijective. For example, in IBM437, both ß and β are represented by 0xE1.
dan04
+3  A: 

In addition to the other answers I think this article is a good read http://www.joelonsoftware.com/articles/Unicode.html

mattanja
Thanks a lot for introducing the article. It *is* a good one.
TK
+1  A: 

Googled for it. http://en.wikipedia.org/wiki/Character_encoding

The difference seems to be subtle. The term charset actually doesn't apply to Unicode. Unicode goes through a series of abstractions. abstract characters -> code points -> encoding of code points to bytes.

Charsets actually skip this and directly jump from characters to bytes. sequence of bytes <-> sequence of characters

In short, encoding : code points -> bytes charset: characters -> bytes

Fakrudeen
+2  A: 

A character encoding consists of:

  1. The set of supported characters
  2. A mapping between characters and integers ("code points")
  3. How code points are encoded as a series of "code units" (e.g., 16-bit units for UTF-16)
  4. How code units are encoded into bytes (e.g., big-endian or little-endian)

Step #1 by itself is a "character repertoire" or abstract "character set", and #1 + #2 = a "coded character set".

But back before Unicode became popular and everyone (except East Asians) was using a single-byte encoding, steps #3 and #4 were trivial (code point = code unit = byte). Thus, older protocols didn't clearly distinguish between "character encoding" and "coded character set". Older protocols use charset when they really mean encoding.

dan04