tags:

views:

1874

answers:

3

Is it possible in C# to use UTF-32 characters not in Plane 0 as a char?

string s = ""; // valid
char c = ''; // generates a compiler error ("Too many characters in character literal")

And in s it is represented by two characters, not one.

Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per character? For example if I want a for loop on utf-32 (maybe not in plane0) characters in a string.

+3  A: 

I only know this problem from Java and checked the documentation on char before answering and indeed the behavior is pretty much the same in .NET/C# and Java.

It seems that indeed a char is defined to be 16 bit and definitely can't hold anything outside of Plane 0. Only String/string is capable of handling those characters. In a char-array it will be represented as two surrogate characters.

Joachim Sauer
+2  A: 

C# System.String support UTF-32 just fine, but you can't iterate through the string like it is an array of System.Char or use IEnumerable.

for example:

// iterating through a string NO UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample[i]))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample[i]))
    {
        Console.WriteLine("IsLetter");
    }
}

// iterating through a string WITH UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample, i))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample, i))
    {
        Console.WriteLine("IsLetter");
    }

    if (Char.IsSurrogate(sample, i))
    {
        ++i;
    }
}

Note the subtle difference in the Char.IsDigit and Char.IsLetter calls. And that String.Length is always the number of 16-bit "characters", not the number of "characters" in the UTF-32 sense.

Off topic, but UTF-32 support is completely unnecessary for an application to handle international languages, unless you have a specific business case for an obscure historical/technical language.

What you're talking about is not UTF-32, it's just UTF-16 that happens to contain supplemental characters. In UTF-32, every character is stored as four bytes. .NET strings are always UTF-16.
Alan Moore
+1  A: 
Emperor XLII