views:

102

answers:

5

What is the difference between charsets and character encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?

A: 

Charset is synonym for character encoding

Default encoding depends on the operating system and locale.

EDIT http://www.w3.org/TR/REC-xml/#sec-TextDecl

http://www.w3.org/TR/REC-xml/#NT-EncodingDecl

saugata
Then why do we have two attributes in xmls?charsets and encoding
Neeraj
It does not ... edited post
saugata
A: 

A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.

There are a lot of character sets

  • EBCDIC
  • Double Byte Character Set
  • ANSI
  • Different OEM char sets
  • Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.

When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.

  • In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
  • There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
  • There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
  • UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.

Amarghosh
A: 

Character set: definition which character has which numeric code point (ascii, jis, unicode)

Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)

devio
+1  A: 

UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.

Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.

The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.

Joseph Boyle
There are more exceptions that that: IBM037 and IBM500 have exactly the same character repertoire as ISO-8859-1, in a completely different order.
dan04
Sorry for missing that, EBCDIC is a whole different universe I rarely think about. At least if you are confusing an EBCDIC with an ASCII-based encoding, you'll get garbage even with English-language text, and have to fix it immediately instead of leaving it as a time bomb for the foreign users and the i18n devs.
Joseph Boyle
A: 

According to Unicode terminology

  • ACR: Abstract Character Repertoire = the set of characters to be encoded, for example, some alphabet or symbol set
  • CCS: Coded Character Set = a mapping from an abstract character repertoire to a set of nonnegative integers
  • CEF: Character Encoding Form = a mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
  • CES: Character Encoding Scheme = a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
  • CM: Character Map = a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
  • TES: Transfer Encoding Syntax = a reversible transform of encoded data, which may or may not contain textual data

Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.

dan04