Storing UTF-16/Unicode data in SQL Server

views:

1399

answers:

Storing UTF-16/Unicode data in SQL Server

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

+1 A:

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

So, as long as you use the proper functions in your code, everything should work transparently.

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

Rick 2009-04-30 03:50:50

You have misunderstood the question. UTF-16 allows for supplementary characters. This works by storing a single character (from the user's perspective) in 2 code units, ie 4 bytes. UCS-2 does not handle supplementary characters.Hence the 4 bytes are treated as two characters by SQL Server when in fact that are one character.

2009-04-30 04:00:16

That's only for characters outside the standard defined languages. The whitepaper states this is primarily for historical languages.

Rick 2009-04-30 04:05:39

Comment on the edit:SQL Server works fine on UCS-2 unicode data. UCS-2 is a deprecated standard, windows has used UTF-16 internally since Win2K.

2009-04-30 04:05:43

Sure. But to offer Unicode 3.1 support, the full character set should be supported.

2009-04-30 04:06:52

From the whitepaper:In the Unicode standard, there are 16 planes of characters, with the potential to define as many as 1,114,112 characters. Plane 0, or the Basic Multilingual Plane (BMP), can represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, all level-100 Zapf Dingbats, and punctuation marks.

Rick 2009-04-30 04:07:05

Yes, but it *does* *not* support the full unicode character set.

2009-04-30 04:09:27

I suspect the reason it sticks with UCS-2 rather than UTF-16 is that UCS-2 limits itself to two bytes in length, but is otherwise identical to UTF-16. This gives UCS-2 a high degree of compatibility with UTF-16, while also offering size consistency that makes the maximum sizes of char(8000 bytes) and nchar(4000 bytes) easier to enforce. Despite any justifications for sticking with UCS-2 over UTF-16, it indeed does NOT support surrogate pairs and therefore does not support the full Unicode character set, and that really really sucks.

Triynko 2010-04-05 22:43:35

CLR integration will not fix this. If, in fact, a .NET string stores UTF-16 data and SQL server stores UCS-2 data, then the types are ultimately not fully compatible. In other words, if a UTF-16 string (with 4-byte characters) goes into SQL Server and comes back out unscathed, then the decoding process must be incorrect or overly complex and inconsistent. The only legitimate work-arounds are (1) stripping UTF-16 strings of incompatible characters or (2) reading and writing the string's original bytes and processing it as a string only outside of SQL Server.

Triynko 2010-04-05 23:03:39

something to add, that I just learned the hard way:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

boomhauer 2010-02-06 00:44:59

ansaurus

tags:

views:

answers:

Storing UTF-16/Unicode data in SQL Server

related questions