ansaurus

Question

Good resources for learning the different types of Character Encoding and converting between them

Answer 1

+2 A:

There's the famous Joel article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html

Edit: Although that's more about text formats, On re-reading I guess you're more interested in things like html encoding and url encoding? Which are for escaping special characters which have significant meanings within html or urls (eg < and > in html, or ? and = in urls)

Andrew M 2009-09-11 11:14:02

Answer 2

+2 A:

Wikipedia has a pretty good explanation of character encoding in general: http://en.wikipedia.org/wiki/Character_encoding.

If you are looking for details of UTF-8, which is one of the most popular characters encodings, you should read the UTF-8 and Unicode FAQ.

And, as was already pointed out, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is a very good beginners tutorial.

Avi 2009-09-11 11:28:39

Answer 3

+2 A:

I'd start with this question: what is a character?

The logical identity: a codepoint. Unicode assigns a number to each character that isn't necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.

This code transforms in.txt from windows-1252 to UTF-8 and saves it as out.txt.

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

Two transformations happen here. First, the bytes are decoded from windows-1252 to UTF-16 (little endian, I think) into the char buffer. Then the buffer is transformed into UTF-8.

Codepoints

Some example code points:

U+0041 is LATIN CAPITAL LETTER A (A)
U+00A3 is POUND SIGN (£)
U+042F is CYRILLIC CAPITAL LETTER YA (Я)
U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

Encodings

Anywhere you work with characters, it'll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.

You can think of an encoding as a tabular mapping between codepoints and byte representations.

CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
U+0041 (A)         00 41           41               41
U+00A3 (£)         00 A3        C2 A3               A3
U+042F (Ya)        04 2F        D0 AF                -
U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -

The System.Text.Encoding class exposes types/methods to perform the transformations.

Graphemes

The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.

('é' is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

Conclusions

When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
Encoding can be a lossy transformation - most non-Unicode encodings can only encode a subset of existing characters.
Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
The "length" of a string is context-sensitive, so you need to know what meaning you're applying and use the appropriate algorithm. How this is handled is defined by the programming language you're using.
Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.

(This is a little more long-winded than I intended, and probably more than you wanted, so I'll stop. I wrote an even more long-winded post on Java encoding here.)

McDowell 2009-09-11 14:25:06

Thanks McDowell, that's exactly the sort of thing I was looking for.

Nat Ryall 2009-09-11 15:22:41

ansaurus

tags:

views:

answers:

Good resources for learning the different types of Character Encoding and converting between them

Edit 1:

related questions