tags:

views:

172

answers:

4

I'm seeing some very strange sorting behaviour using CaseInsensitiveComparer.DefaultInvariant. Words that start with a leading hyphen "-" end up sorted as if the hyphen wasn't there rather than being sorted in front of actual letters which is what happens with other punctuation.

So given { "hello", ".net", "-less"} I end up with {".net", "hello", "-less" } instead of the expected {"-less", ".net", "hello"}.

Or, phrased as a test case:

[TestMethod]
public void TestMethod1()
{
    var rg = new String[] { 
        "x", "z", "y", "-less", ".net", "- more", "a", "b"
    };

    Array.Sort(rg, CaseInsensitiveComparer.DefaultInvariant);

    Assert.AreEqual(
        "- more,-less,.net,a,b,x,y,z", 
        String.Join(",", rg)
    );
}

... which fails like this:

Assert.AreEqual failed. 
Expected:<- more,-less,.net,a,b,x,y,z>. 
Actual:  <- more,.net,a,b,-less,x,y,z>.

Any ideas what's going on?

Edit:

Looks like, by default .NET does fancy things when sorting strings which causes leading hyphens to be sorted into strange places so that co-op and coop sort together. Thus, if you want your leading hyphen words to end up and the begining with the other punctutation you have to tell it not not to:

Array.Sort(rg, (a, b) => String.CompareOrdinal(a, b));
+2  A: 

My guess would be that a dash immedately before a letter is being ignored, for purposes of sorting. When you sort a list of words, you'd like "inter-nation" and "international" to be next to each other, wouldn't you? A dash by itself, on the other hand, is considered significant.

James Curran
Very nice guess..
Oren A
Not really - I'd like (and expect) sorting of embedded non-alpha chars according to their position in the ASCII charset. Are you saying that "inter-national" and "international" are the same according to this Comparer?
Steve Townsend
+1 for the genius guesswork
Steve Townsend
+8  A: 

Comparison procedures use the CultureInfo.InvariantCulture to determine the sort order and casing rules. String comparisons might have different results depending on the culture. For more information on culture-specific comparisons, see the System.Globalization namespace and Encoding and Localization. From here.

The interesting part:

A word sort performs a culture-sensitive comparison of strings in which certain nonalphanumeric Unicode characters might have special weights assigned to them. For example, the hyphen (-) might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. From here.

Forgotten Semicolon
Then OP can solve confusion as follows? "A string sort also performs a culture-sensitive comparison. It is similar to a word sort, except that there are no special cases, and all nonalphanumeric symbols come before all alphanumeric Unicode characters. Two strings can be compared using string sort rules by calling the CompareInfo.Compare method overloads that have an options parameter that is supplied a value of CompareOptions.StringSort. Note that this is the only method that the .NET Framework provides to compare two strings using string sort rules."
Steve Townsend
+1, good answer. Explode your brain by contemplating the many different kinds of dashes, the one on your keyboard is always the wrong one: http://en.wikipedia.org/wiki/Dash
Hans Passant
A: 

Sort order is dependent on the culture, so you can't assume characters will sort in ASCII order.

http://msdn.microsoft.com/en-us/library/a7zyyk0c.aspx

In your example, "h" (U+0048) is before "dash" (U+2013), so "hello" will appear before "-less". "." (U+002E) is before both, so ".net" appears first.

MikeWyatt
+1  A: 

To sort the strings in the way you need, you have to create a comparer class that compares strings using the Compareinfo class. This class allow you to specify various methods of comparison, the one that best matches yor needs is OrdinalIgnoreCase.

From MSDN:

Ignored Search Values

Comparison operations, such as those performed by the IndexOf or LastIndexOf methods, can yield unexpected results if the value to search for is ignored. The search value is ignored if it is an empty string (""), a character or string consisting of characters having code points that are not considered in the operation because of comparison options, or a value with code points that have no linguistic significance. If the search value for the IndexOf method is an empty string, for example, the return value is zero.

Note
When possible, the application should use string comparison methods that accept a CompareOptions value to specify the kind of comparison expected. As a general rule, user-facing comparisons are best served by the use of linguistic options (using the current culture), while security comparisons should specify Ordinal or OrdinalIgnoreCase.specify Ordinal or OrdinalIgnoreCase.

I have modified your test case, and this one execute correctly:

public class MyComparer:Comparer<string>
{
    private readonly CompareInfo compareInfo;

    public MyComparer()
    {
        compareInfo = CompareInfo.GetCompareInfo(CultureInfo.InvariantCulture.Name);
    }

    public override int Compare(string x, string y)
    {
        return compareInfo.Compare(x, y, CompareOptions.OrdinalIgnoreCase);
    }
}

public class Class1
{
    [Test]
    public void TestMethod1()
    {
        var rg = new String[] { 
    "x", "z", "y", "-less", ".net", "- more", "a", "b"
};

        Array.Sort(rg, new MyComparer());

        Assert.AreEqual(
            "- more,-less,.net,a,b,x,y,z",
            String.Join(",", rg)
        );


    }
}
Andrea Parodi