views:

86

answers:

3

In a test case I've written, the string comparison doesn't appear to work the same way between SQL server / .NET CLR.

This C# code:

string lesser =  "SR2-A1-10-90";
string greater = "SR2-A1-100-10";

Debug.WriteLine(string.Compare("A","B"));
Debug.WriteLine(string.Compare(lesser, greater));

Will output:

-1
1

This SQL Server code:

declare @lesser varchar(20);
declare @greater varchar(20);

set @lesser =  'SR2-A1-10-90';
set @greater = 'SR2-A1-100-10';

IF @lesser < @greater
    SELECT 'Less Than';
ELSE
    SELECT 'Greater than';

Will output:

Less Than

Why the difference?

+2  A: 
  • In SQL you used varchar which is basically ASCII (subject to collation) which will give - before 0
  • In C# all strings are Unicode

The finer points of UTF-xx (c#) vs UCS-2 (SQL Server) are quite tricky.

Edit:

I posted too soon

I get "Greater Than" on SQL Server 2008 with collation Latin1_General_CI_AS

Edit 2:

I'd also try SELECT ASCII(...) on your dash. For example, if the SQL snippet has ever been in a Word document the - (150) is not the - (45) I copied into SQL Server for testing out of my browser from your questions. See CP 1252 (= CP1 = SQL Server lingo)

Edit 3: See Martin Smith's answer: the 2 collations have different sort orders.

gbn
Ah, I see. When I use nvarchar(20), I get `Greater Than`.
Matt Brunell
I get this with varchar. Is your DB collation different to Server collation?
gbn
+2  A: 

Further to gbn's answer, you can make them behave the same by using CompareOptions.StringSort in C# (or by using StringComparison.Ordinal). This treats symbols as occurring before alphanumeric symbols, so "-" < "0".

However, Unicode vs ASCII doesn't explain anything, as the hex codes for the ASCII codepage are translated verbatim to the Unicode codepage: "-" is 002D (45) while "0" is 0030 (48).

What is happening is that .NET is using "linguistic" sorting by default, which is based on a non-ordinal ordering and weight applied to various symbols by the specified or current culture. This linguistic algorithm allows, for instance, "résumé" (spelled with accents) to appear immediately following "resume" (spelled without accents) in a sorted list of words, as "é" is given a fractional order just after "e" and well before "f". It also allows "cooperation" and "co-operation" to be placed closely together, as the dash symbol is given low "weight"; it matters only as the absolute final tiebreakers when sorting words like "bits", "bit's", and "bit-shift" (which would appear in that order).

So-called ordinal sorting (strictly according to Unicode values, with or without case insensitivity) will produce very different and sometimes illogical results, as variants of letters usually appear well after the basic undecorated Latin alphabet in ASCII/Unicode ordinals, while symbols occur before it. For instance, "é" comes after "z" and so the words "resume", "rosin", "ruble", "résumé" would be sorted in that order. "Bit's", "Bit-shift", "Biter", "Bits" would be sorted in that order as the apostrophe comes first, followed by the dash, then the letter "e", then the letter "s". Neither of these seem logical from a "natural language" perspective.

KeithS
Code page only affects chars > 127, no?
gbn
Cool. By using the `CompareOption.StringSort`, I can run my test, with the current database collation settings.
Matt Brunell
@gbn - Technically yes. It's also C# that seems to be sorting "wrong"; without StringSort, symbols chars come after alphanumerics. I'll edit.
KeithS
+5  A: 

This is documented here.

Windows collations (e.g. Latin1_General_CI_AS) use Unicode type collation rules. SQL Collations don't.

This causes the hyphen character to be treated differently between the two.

Martin Smith
select * from fn_helpcollations()where name like '%SQL_Latin1_General_CP1_CI_AS%'or name = 'Latin1_General_CI_AS'
gbn
The non-unicode sort is different, uses CP 1252. I'm sure LATIN1.. did the same too... no it doesn't.. http://msdn.microsoft.com/en-us/library/ms143515.aspx
gbn