views:

537

answers:

5

This query:

select *
from op.tag
where tag = 'fussball'

Returns a result which has a tag column value of "fußball". Column "tag" is defined as nvarchar(150).

While I understand they are similar words grammatically, can anyone explain and defend this behavior? I assume it is related to the same collation settings which allow you to change case sensitivity on a column/table, but who would want this behavior? A unique constraint on the column also causes failure on inserts of one value when the other exists due to a constraint violation. How do I turn this off?

Follow-up bonus point question. Explain why this query does not return any rows:

select 1 
where 'fußball' = 'fussball'

Bonus question (answer?): @ScottCher pointed out to me privately that this is due to the string literal "fussball" being treated as a varchar. This query DOES return a result:

select 1 
where 'fußball' = cast('fussball' as nvarchar)

But then again, this one does not:

select 1 
where cast('fußball' as varchar) = cast('fussball' as varchar)

I'm confused.

+2  A: 

I guess the Unicode collation set for your connection/table/database specifies that ss == ß. The latter behavior would be because it's on a faulty fast path, or maybe it does a binary comparison, or maybe you're not passing in the ß in the right encoding (I agree it's stupid).

http://unicode.org/reports/tr10/#Searching mentions that U+00DF is special-cased. Here's an insightful excerpt:

Language-sensitive searching and matching are closely related to collation. Strings that compare as equal at some strength level are those that should be matched when doing language-sensitive matching. For example, at a primary strength, "ß" would match against "ss" according to the UCA, and "aa" would match "å" in a Danish tailoring of the UCA.

djc
Thanks! I edited to make a more direct link, and quote a relevant section of the article.
TheSoftwareJedi
+1  A: 

The SELECT does return a row with collation Latin1_General_CI_AS (SQL2000).

It does not with collation Latin1_General_BIN.

You can assign a table column a collation by using the COLLATE < collation > keyword after N/VARCHAR.

You can also compare strings with a specific collation using the syntax

string1 = string2 COLLATE < collation >
devio
+1  A: 

Some helper answers - not the complete one to your question, but still maybe helpful:

If you try:

SELECT 1 WHERE N'fußball' = N'fussball'  

you'll get "1" - when using the "N" to signify Unicode, the two strings are considered the same - why that's the case, I don't know (yet).

To find the default collation for a server, use

SELECT SERVERPROPERTY('Collation')

To find the collation of a given column in a database, use this query:

SELECT
  name 'Column Name', 
  OBJECT_NAME(object_id) 'Table Name', 
  collation_name
FROM sys.columns
WHERE object_ID = object_ID('your-table-name') 
AND name = 'your-column-name'
marc_s
+1  A: 

This isn't an answer that explains behavior, but may be relevant:

In this question, I learned that using the collation of

Latin1_General_Bin 

will avoid most collation quirks.

Michael J Swart
+1  A: 

Bonus question (answer?): @ScottCher pointed out to me privately that this is due to the string literal "fussball" being treated as a varchar. This query DOES return a result:

select 1 where 'fußball' = cast('fussball' as nvarchar)

Here you're dealing with the SQL Server data type precedence rules, as stated in Data Type Precedence. Comparisons are done always using the higher precedence type:

When an operator combines two expressions of different data types, the rules for data type precedence specify that the data type with the lower precedence is converted to the data type with the higher precedence.

Since nvarchar has a higher precedence than varchar, the comparison in your example will occur suing the nvarchar type, so it's really exactly the same as select 1 where N'fußball' =N'fussball' (ie. using Unicode types). I hope this also makes it clear why your last case doesn't return any row.

Remus Rusanu