views:

213

answers:

3

I trying to handle to following character: ⨝ (http://www.fileformat.info/info/unicode/char/2a1d/index.htm)

If you checking whether an empty string starting with this character, it always returns true, this does not make any sense! Why is that?

// visual studio 2008 hides lines that have this char literally (bug in visual studio?!?) so i wrote it's unicode instead.
char specialChar = (char)10781;
string specialString = specialChar.ToString();

// prints 1
Console.WriteLine(specialString.Length);

// prints 10781
Console.WriteLine((int)specialChar);

// prints false
Console.WriteLine(string.Empty.StartsWith("A"));

// both prints true WTF?!?
Console.WriteLine(string.Empty.StartsWith(specialString));
Console.WriteLine(string.Empty.StartsWith(((char)10781).ToString()));
+3  A: 

Nice unicode glitch ;-p

I'm not sure why it does this, but amusingly:

Console.WriteLine(string.Empty.StartsWith(specialString)); // true
Console.WriteLine(string.Empty.Contains(specialString)); // false
Console.WriteLine("abc".StartsWith(specialString)); // true
Console.WriteLine("abc".Contains(specialString)); // false

I'm guessing this is treated a bit like the non-joining character that Jon mentioned at devdays; some string functions see it, and some don't. And if it doesn't see it, this becomes "does (some string) start with an empty string", which is always true.

Marc Gravell
+1 from me. I hadn't seen Jon's talk.
RichardOD
+7  A: 

You can fix this bug by using ordinal StringComparison:

From the MSDN docs:

When you specify either StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase, the string comparison will be non-linguistic. That is, the features that are specific to the natural language are ignored when making comparison decisions. This means the decisions are based on simple byte comparisons and ignore casing or equivalence tables that are parameterized by culture. As a result, by explicitly setting the parameter to either the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase, your code often gains speed, increases correctness, and becomes more reliable.

    char specialChar = (char)10781;


    string specialString = Convert.ToString(specialChar);

    // prints 1
    Console.WriteLine(specialString.Length);

    // prints 10781
    Console.WriteLine((int)specialChar);

    // prints false
    Console.WriteLine(string.Empty.StartsWith("A"));

    // prints false
    Console.WriteLine(string.Empty.StartsWith(specialString, StringComparison.Ordinal));
RichardOD
Culture-sensitive-comparison-by-default seems like a big disastrous violation of the principle of least surprise. Is there any rule of thumb to determine which methods require a StringComparison to get ‘normal’ ordinal behaviour and which don't?
bobince
@bobince- have you seen this question- http://stackoverflow.com/questions/72696/which-is-generally-best-to-use-stringcomparison-ordinalignorecase-or-stringcom
RichardOD
+2  A: 

The underlying reason for this is the default string comparison is locale aware. This means using tables of locale data for comparisons (including equality).

Many (if not most) Unicode characters have no value for many locales, and thus don't exist (or do, but match anything, or nothing).

See entries on character weights on Michael Kaplan's blog "Sorting It All Out". This series of blogs contains a lot of background information (the APIs are native, but -- as I understand -- the mechanisms in .NET are the same).

Quick version: this is a complex area to get expected (normal language) comparisons right is hard, this tends to lead to odd things with code points for glyphs outside your language.

Richard