views:

809

answers:

2

Hi, we faced very strange issue (really strange for such mature product): how to get number of characters in unicode string using Transact-SQL statements. The key problem of this issue that the len() TSQL function returns number of chars, excluding trailing blanks. the other variant is to use datalength (which return number of bytes) and divide by 2, so get numbers of unicode chars. But Unicode chars can be surrogate pairs so its wont work either.

We have 2 variants of solution, first is use len(replace()) and second is add single symbol and then substract 1 from result. But imo both variants are rather ugly.

declare @txt nvarchar(10)
set @txt = 'stack ' 
select @txt as variable, 
 len(@txt) as lenBehaviour, 
 DATALENGTH(@txt)/2 as datalengthBehaviour, 
 len(replace(@txt,' ','O')) as ReplaceBehaviour, 
 len(@txt+'.')-1 as addAndMinusBehaviour

Any other ideas how to count chars in string with traililng spaces ?

+2  A: 

I can't leave a comment so I will have to leave an answer (or shutup).

My vote would be for the addAndMinusBehaviour

I haven't got a good third alternative, there maybe some obscure whitespace rules to fiddle with in the options / SET / Collation assignment but don't know more detail off the top of my head.

but really addAndMinusBehaviour is probably the eaiest to implement, fastest to execute and if you document it, farily maintainable as well.

Rob
Vote up to give some points so you can comment in the future. (I hated not being able to comment too.)
Vaccano
+2  A: 

My understanding is that DATALENGTH(@txt)/2 should always give you the number of characters. SQL Server stores Unicode characters in UCS-2 which does not support surrogate pairs.

http://msdn.microsoft.com/en-us/library/ms186939.aspx

http://en.wikipedia.org/wiki/UCS2

Kim Major
After short discussion with our team lead we figured that UTF-16 text can be stored in UCS-2 chars subset. UTF-16 single symbol with surrogate will be transformed in 2 symbols of UCS-2 and thus broke correct len() result.The other bad side of locking on UTF-16 UCS-2 subset that you can't be sure about future versions of SQLServer. I cant found information about exactly SQL2008 nvarchar store mode, looks like it's similar to 2005's one you pointed. But as far we building long term supported system - we cant lock on that.BTW - good point about UCS-2, never noticed that before.
Alexey Shcherbak