views:

380

answers:

3

One thing I have never truly understood is the concept of character encoding. The way encoding is handled in memory and code often baffles me in that I just copy an example from the internet without truly understanding what it does. I feel it's a really important and much overlooked subject that more people should take the time to get right (including myself).

I am looking for some good, to the point, resources for learning the different types of character encoding and converting between them (preferably in C#). Both books and online resources are welcome.

Thanks.


Edit 1:

Thanks for the responses so far. I am especially looking for some more info involving how .NET handles encoding. I know this may seem vague but I don't really know what to ask for. I guess I am curious as to how encoding is represented say in a C# string class and whether the class itself can manage different encoding types or there are seperate classes for this?

+2  A: 

There's the famous Joel article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html

Edit: Although that's more about text formats, On re-reading I guess you're more interested in things like html encoding and url encoding? Which are for escaping special characters which have significant meanings within html or urls (eg < and > in html, or ? and = in urls)

Andrew M
+2  A: 

Wikipedia has a pretty good explanation of character encoding in general: http://en.wikipedia.org/wiki/Character_encoding.

If you are looking for details of UTF-8, which is one of the most popular characters encodings, you should read the UTF-8 and Unicode FAQ.

And, as was already pointed out, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is a very good beginners tutorial.

Avi
+2  A: 

I'd start with this question: what is a character?

  • The logical identity: a codepoint. Unicode assigns a number to each character that isn't necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
  • The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
  • Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.

This code transforms in.txt from windows-1252 to UTF-8 and saves it as out.txt.

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

Two transformations happen here. First, the bytes are decoded from windows-1252 to UTF-16 (little endian, I think) into the char buffer. Then the buffer is transformed into UTF-8.

Codepoints

Some example code points:

  • U+0041 is LATIN CAPITAL LETTER A (A)
  • U+00A3 is POUND SIGN (£)
  • U+042F is CYRILLIC CAPITAL LETTER YA (Я)
  • U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

Encodings

Anywhere you work with characters, it'll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.

You can think of an encoding as a tabular mapping between codepoints and byte representations.

CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
U+0041 (A)         00 41           41               41
U+00A3 (£)         00 A3        C2 A3               A3
U+042F (Ya)        04 2F        D0 AF                -
U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -

The System.Text.Encoding class exposes types/methods to perform the transformations.

Graphemes

The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.

('é' is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

Conclusions

  • When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
  • Encoding can be a lossy transformation - most non-Unicode encodings can only encode a subset of existing characters.
  • Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
  • The "length" of a string is context-sensitive, so you need to know what meaning you're applying and use the appropriate algorithm. How this is handled is defined by the programming language you're using.
  • Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.

(This is a little more long-winded than I intended, and probably more than you wanted, so I'll stop. I wrote an even more long-winded post on Java encoding here.)

McDowell
Thanks McDowell, that's exactly the sort of thing I was looking for.
Nat Ryall