ansaurus

Question

String Comparison differences between .NET and T-SQL?

Answer 1

+2 A:

In SQL you used varchar which is basically ASCII (subject to collation) which will give - before 0
In C# all strings are Unicode

The finer points of UTF-xx (c#) vs UCS-2 (SQL Server) are quite tricky.

Edit:

I posted too soon

I get "Greater Than" on SQL Server 2008 with collation Latin1_General_CI_AS

Edit 2:

I'd also try SELECT ASCII(...) on your dash. For example, if the SQL snippet has ever been in a Word document the - (150) is not the - (45) I copied into SQL Server for testing out of my browser from your questions. See CP 1252 (= CP1 = SQL Server lingo)

Edit 3: See Martin Smith's answer: the 2 collations have different sort orders.

gbn 2010-09-27 21:54:10

Ah, I see. When I use nvarchar(20), I get `Greater Than`.

Matt Brunell 2010-09-27 22:00:20

I get this with varchar. Is your DB collation different to Server collation?

gbn 2010-09-27 22:03:00

Answer 2

+2 A:

Further to gbn's answer, you can make them behave the same by using CompareOptions.StringSort in C# (or by using StringComparison.Ordinal). This treats symbols as occurring before alphanumeric symbols, so "-" < "0".

However, Unicode vs ASCII doesn't explain anything, as the hex codes for the ASCII codepage are translated verbatim to the Unicode codepage: "-" is 002D (45) while "0" is 0030 (48).

What is happening is that .NET is using "linguistic" sorting by default, which is based on a non-ordinal ordering and weight applied to various symbols by the specified or current culture. This linguistic algorithm allows, for instance, "résumé" (spelled with accents) to appear immediately following "resume" (spelled without accents) in a sorted list of words, as "é" is given a fractional order just after "e" and well before "f". It also allows "cooperation" and "co-operation" to be placed closely together, as the dash symbol is given low "weight"; it matters only as the absolute final tiebreakers when sorting words like "bits", "bit's", and "bit-shift" (which would appear in that order).

So-called ordinal sorting (strictly according to Unicode values, with or without case insensitivity) will produce very different and sometimes illogical results, as variants of letters usually appear well after the basic undecorated Latin alphabet in ASCII/Unicode ordinals, while symbols occur before it. For instance, "é" comes after "z" and so the words "resume", "rosin", "ruble", "résumé" would be sorted in that order. "Bit's", "Bit-shift", "Biter", "Bits" would be sorted in that order as the apostrophe comes first, followed by the dash, then the letter "e", then the letter "s". Neither of these seem logical from a "natural language" perspective.

KeithS 2010-09-27 22:05:27

Code page only affects chars > 127, no?

gbn 2010-09-27 22:10:12

Cool. By using the `CompareOption.StringSort`, I can run my test, with the current database collation settings.

Matt Brunell 2010-09-27 22:16:46

@gbn - Technically yes. It's also C# that seems to be sorting "wrong"; without StringSort, symbols chars come after alphanumerics. I'll edit.

KeithS 2010-09-27 22:50:47

Answer 3

+5 A:

This is documented here.

Windows collations (e.g. Latin1_General_CI_AS) use Unicode type collation rules. SQL Collations don't.

This causes the hyphen character to be treated differently between the two.

Martin Smith 2010-09-27 22:14:23

select * from fn_helpcollations()where name like '%SQL_Latin1_General_CP1_CI_AS%'or name = 'Latin1_General_CI_AS'

gbn 2010-09-27 22:25:22

The non-unicode sort is different, uses CP 1252. I'm sure LATIN1.. did the same too... no it doesn't.. http://msdn.microsoft.com/en-us/library/ms143515.aspx

gbn 2010-09-27 22:26:10

ansaurus

tags:

views:

answers:

String Comparison differences between .NET and T-SQL?

related questions