views:

4966

answers:

4

Is there a rule when we must use the Unicode types?

I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns.

I am looking for something like:

  1. If you have Chinese --> use NVARCHAR
  2. If you have German and Arabic --> use NVARCHAR

What about the collation of the server/database?

I don't want to use always NVARCHAR like suggested here http://stackoverflow.com/questions/35366/varchar-vs-nvarchar-performance

+2  A: 

Greek would need UTF-8 on N column types: αβγ ;)

cherouvim
+2  A: 

You should use NVARCHAR anytime you have to store multiple languages. I believe you have to use it for the Asian languages but don't quote me on it.

Here's the problem if you take Russian for example and store it in a varchar, you will be fine so long as you define the correct code page. But let's say your using a default english sql install, then the russian characters will not be handled correctly. If you were using NVARCHAR() they would be handled properly.

Edit

Ok let me quote MSDN and maybee I was to specific but you don't want to store more then one code page in a varcar column, while you can you shouldn't

When you deal with text data that is stored in the char, varchar, varchar(max), or text data type, the most important limitation to consider is that only information from a single code page can be validated by the system. (You can store data from multiple code pages, but this is not recommended.) The exact code page used to validate and store the data depends on the collation of the column. If a column-level collation has not been defined, the collation of the database is used. To determine the code page that is used for a given column, you can use the COLLATIONPROPERTY function, as shown in the following code examples:

Here's some more:

This example illustrates the fact that many locales, such as Georgian and Hindi, do not have code pages, as they are Unicode-only collations. Those collations are not appropriate for columns that use the char, varchar, or text data type

So Georgian or Hindi really need to be stored as nvarchar. Arabic is also a problem:

Another problem you might encounter is the inability to store data when not all of the characters you wish to support are contained in the code page. In many cases, Windows considers a particular code page to be a "best fit" code page, which means there is no guarantee that you can rely on the code page to handle all text; it is merely the best one available. An example of this is the Arabic script: it supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language as defined in Windows code page 1256. If you attempt to store these extra characters in a non-Unicode column that has the Arabic collation, the characters are converted into question marks.

Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h.

All in all with all the issues you have to deal with when dealing with internalitionalization. It is my opinion that is easier to just use Unicode characters from the start, avoid the extra conversions and take the space hit. Hence my statement earlier.

JoshBerke
>>You should use NVARCHAR anytime you have to store multiple languagesThis is not true. German and Italian and English fit well in same table with VARCHAR columns. Please be more specific
Peter Gfader
+4  A: 

The real reason you want to use NVARCHAR is when you have different languages in the same column, you need to address the columns in T-SQL without decoding, you want to be able to see the data "natively" in SSMS, or you want to standardize on Unicode.

If you treat the database as dumb storage, it is perfectly possible to store wide strings and different (even variable-length) encodings in VARCHAR (for instance UTF-8). The problem comes when you are attempting to encode and decode, especially if the code page is different for different rows. It also means that the SQL Server will not be able to deal with the data easily for purposes of querying within T-SQL on (potentially variably) encoded columns.

Using NVARCHAR avoids all this.

I would recommend NVARCHAR for any column which will have user-entered data in it which is relatively unconstrained.

I would recommend VARCHAR for any column which is a natural key (like a vehicle license plate, SSN, serial number, service tag, order number, airport callsign, etc) or user-entered, but very constrained (like a phone number) or a code (ACTIVE/CLOSED, Y/N, M/F, M/S/D/W, etc). There is absolutely no reason to use NVARCHAR for those.

So for a simple rule:

VARCHAR when guaranteed to be constrained NVARCHAR otherwise

Cade Roux
>> when you have different languages in the same column .... That's it!
Peter Gfader
+2  A: 

Josh says: "....Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h."

I'm a native Spanish Speaker and "ch" is not a letter but two "c" and "h" and the Spanish alphabet is like: abcdefghijklmn ñ opqrstuvwxyz We don't expect "ch" after "h" but "i" The alphabet is the same as in English except for the ñ or in HTML "&ntilde ;"

Alex

Alex
Hi Alex, did you ever store different languages in 1 column? We had different columns for different languages in 1 table.
Peter Gfader