I'd start with this question: what is a character?
- The logical identity: a codepoint. Unicode assigns a number to each character that isn't necessarily related to any bit/byte form. Encodings (like UTF-8) define the mapping to byte values.
- The bits and bytes: the encoded form. One or more bytes per codepoint, values determined by the encoding used.
- Thing you see on the screen: a grapheme. The grapheme is created from one or more codepoints. This is the stuff at the presentation end of things.
This code transforms in.txt
from windows-1252
to UTF-8
and saves it as out.txt
.
using System;
using System.IO;
using System.Text;
public class Enc {
public static void Main(String[] args) {
Encoding win1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
using(StreamReader reader = new StreamReader("in.txt", win1252)) {
using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
char[] buffer = new char[1024];
while(reader.Peek() > 0) {
int r = reader.Read(buffer, 0, buffer.Length);
writer.Write(buffer, 0, r);
}
}
}
}
}
Two transformations happen here. First, the bytes are decoded from windows-1252
to UTF-16
(little endian, I think) into the char
buffer. Then the buffer is transformed into UTF-8
.
Codepoints
Some example code points:
- U+0041 is LATIN CAPITAL LETTER A (A)
- U+00A3 is POUND SIGN (£)
- U+042F is CYRILLIC CAPITAL LETTER YA (Я)
- U+1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)
Encodings
Anywhere you work with characters, it'll be in an encoding of some form. C# uses UTF-16 for its char type, which it defines as 16 bits wide.
You can think of an encoding as a tabular mapping between codepoints and byte representations.
CODEPOINT UTF-16BE UTF-8 WINDOWS-1252
U+0041 (A) 00 41 41 41
U+00A3 (£) 00 A3 C2 A3 A3
U+042F (Ya) 04 2F D0 AF -
U+1D50A D8 35 DD 0A F0 9D 94 8A -
The System.Text.Encoding class exposes types/methods to perform the transformations.
Graphemes
The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U+0065 and COMBINING ACUTE ACCENT U+0301.
('é' is more usually represented by the single codepoint U+00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)
Conclusions
- When you encode a C# string to an encoding, you are performing a transformation from UTF-16 to that encoding.
- Encoding can be a lossy transformation - most non-Unicode encodings can only encode a subset of existing characters.
- Since not all codepoints can fit into a single C# char, the number of chars in string may be more than the number of codepoints and the number of codepoints may be greater than the number of rendered graphemes.
- The "length" of a string is context-sensitive, so you need to know what meaning you're applying and use the appropriate algorithm. How this is handled is defined by the programming language you're using.
- Giving Latin-1 characters identical values in many encodings gives some people delusions of ASCII.
(This is a little more long-winded than I intended, and probably more than you wanted, so I'll stop. I wrote an even more long-winded post on Java encoding here.)