views:

66

answers:

2

I have some code to dump strings to stdout to check their encoding, it looks like this:

    private void DumpString(string s)
    {   
        System.Console.Write("{0}: ", s);
        foreach (byte b in s)
        {   
            System.Console.Write("{0}({1}) ", (char)b, b.ToString("x2"));
        }       
        System.Console.WriteLine();
    }

Consider two strings, each of which appear as "ë", but with different encodings. DumpString will produce the following output:

ë: e(65)(08)
ë: ë(eb)

The code looks like this:

DumpString(string1);
DumpString(string2);

How can I convert string2, using the System.Text.Encoding, to be byte equivalent to string1.

+4  A: 

You're looking for the String.Normalize method.

SLaks
+6  A: 

They don't have different encodings. Strings in C# are always UTF-16 (thus, you shouldn't use byte to iterate over strings because you'll lose the top 8 bits). What they have is different normalization forms.

Your first string is "\u0065\u0308": LATIN SMALL LETTER E + COMBINING DIAERESIS. This is the decomposed form (NFD).

The second is "\u00EB": LATIN SMALL LETTER E WITH DIAERESIS. This is the precomposed form (NFC).

You can convert between them with string.Normalize.

dan04