views:

355

answers:

5

I don't think i fully understand character sets so i was wondering if anyone would be kind enough to explain it in layman's terms with examples ( for Dummies).I know there is utf8, latin1(ISO 8859-1), ascii ect

The more answers the better really.

Thank you in advance;-)

+19  A: 

Try this: http://www.joelonsoftware.com/articles/Unicode.html

Aaron
golden link! must be put in faq :)
Andrey
+4  A: 

People read characters, or glyphs. Computers use numbers exclusively. So, people decided to encode each character as a number so that computers could manipulate them. However, different people decided to use different numbering systems.

Most programming platforms support multiple "character encodings", which map each logical glyph, like an uppercase Roman letter 'A', to a series of one or more bytes. When one computer sends text to another, it is vital to know the character encoding that was used. This might be inferred from some rules of the protocol used, or the protocol might specify it explicitly, by name.

erickson
+2  A: 

This article is more targeted on Java (EE web) developers: Unicode - How to get characters right?

It explains the stuff in layman's terms and contains practical Java examples and solutions as well.

BalusC
+1  A: 

Character sets are simply names/aliases of encoding schemes used for representing symbols as a sequence of bytes.

ASCII is not universal. It only uses 7 bits per symbol, so it can't cover all language symbols.

Unicode is a universal standard that assigns each symbol in the world a unique numeric identifier, in the hexadecimal range 0 to 10FFFF. These unique codes are called code points. The code points are not used directly for encoding, but they can be transformed into bytes using any of the Unicode's standard encoding schemes: UTF-8, UTF-16, UTF-32, etc.

With UTF-8, for example, every code point is represented as 1,2,3 or 4 bytes. ASCII characters have the same 1 byte representation with UTF-8, so the two encoding schemes are equivalent for this subset of symbols.

In java, strings are stored internally as UTF-16, but this is rarely something the programmer should be aware of. Once a string is constructed, it should be considered as "encoding-less". The situations where encoding is relevant is when converting byte arrays to strings, and vice-versa:

byte[] ba = readSomething();
String s = new String(ba, "UTF-8");
byte[] ba2 = s.getBytes("UTF-16");

--EDIT--

As pointed out in the comment, the internal Java representation is important when performing character level manipulations on a string. For some languages (such as Chinese), more than one character is used for representing a single letter. Therefore, methods such as String.length() and String.charAt(i) return different results than expected. When dealing with such languages, it is important to use String.codePointCount(..) and String.codePointAt(..) instead.

Eyal Schneider
A Java developer should definitely be aware that the internal representation is UTF-16, or, more exactly, that what Java calls "characters" are actually BMP code points (including surrogate code points). Hence, String.length doesn't give the number of characters in a String, it gives the number of BMP code points, which can be higher than the number of characters if the string contains characters outside the BMP.
Artefacto
@Artefacto: I agree, so I added an explanation in the response. By the way, the length() method returns the number of UTF-16 code units, and not the number of code points (as returned by codePointCount(..))
Eyal Schneider
-1 Fundamental misconception: ASCII uses only 7 bits per character.
John Machin
@John Machin: thanks, fixed.
Eyal Schneider
+4  A: 

A character set is just that - a set of characters. Here's a set of three characters:

  • LATIN CAPITAL LETTER A
  • LATIN CAPITAL LETTER B
  • LATIN CAPITAL LETTER C

Unicode (the set of all* characters) calls each of these things a code point and assigns each one a number: U+0041, U+0042, U+0043. Go see the PDF charts for the assignments.

A character encoding maps these code points to the numerical byte sequences used in RAM or on disc. Anywhere characters are used, they need to be in an encoding of some form. The number of bytes used to encode each character varies (usually between 1 and 4). Different encodings use different sequences of bytes for their mappings. You can use a utility like this one to inspect the mappings.

The thing you see on the screen is a grapheme from a graphical font. It may be made up of more than one code point.

Back in olden times, a character set and a character encoding were pretty much the same thing and anyone who wanted their data to work on a computer in another country had major headaches. The Windows "ANSI" encoding 1252 uses a single byte for each character and can only support 256 values. The development of the Unicode standard separated the concept character sets and encodings. Unicode is supported by multiple encodings (Unicode Transformation Formats) and has room for over a million characters.


Some examples of the byte representations of various characters in different encodings (where they're supported):

Grapheme: A
Code point: U+0041 LATIN CAPITAL LETTER A

ASCII                41
Windows-1252         41
ISO-8859-15          41
UTF-8                41
UTF-16BE             00 41

Grapheme: €
Code point: U+20AC EURO SIGN

ASCII                -
Windows-1252         80
ISO-8859-15          A4
UTF-8                E2 82 AC
UTF-16BE             20 AC

Grapheme: 𝔊
Code point: U+1D50A MATHEMATICAL FRAKTUR CAPITAL G

ASCII                -
Windows-1252         -
ISO-8859-15          -
UTF-8                F0 9D 94 8A
UTF-16BE             D8 35 DD 0A

Grapheme: é
Code points: U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT

ASCII                65 - (doesn't support the combining accent)
Windows-1252         65 - (doesn't support the combining accent)
ISO-8859-15          65 - (doesn't support the combining accent)
UTF-8                65 CC 81
UTF-16BE             00 65 03 01

Most of the issues with character sets are when a programmer:

  • doesn't know when or how to transform from one set of mappings to another
  • chooses the wrong mapping
  • chooses a mapping that results in data loss
  • doesn't see that such transformations are being made by a library or tool

*OK, not all characters, but a lot.

You'll have to forgive any historical inaccuracies on my part - I know there are/were rival encodings to Unicode and I haven't done any research on who thought up what when. I recently wrote a post comparing character handling in different languages if you want to see some specifics.

McDowell