views:

466

answers:

2

I have a classic ASP page that gets POSTed to. The data gets POSTed as UTF-8 (I can see this in Fiddler). I then open an ADODB connection to a database and store the data in a VARCHAR field. If the data can be represented by 8859-1 (e.g. iñtërnâtiônàlizætiøn) it is stored correctly in the varchar field. If I try strings that can't be mapped to 8859 (e.g. Здравствуйте!) I get ????????????!. This all makes sense as the varchar field cannot hold unicode. I also understand the using an nvarchar field should enable me to store utf-8 strings.

My question is this. What settings in SQL Server or in the ADODB object control how the strings are converted from UTF-8 to 8859-1? Does VBScript (ASP) send the strings to ADODB.Connection.Execute as UTF-8 (or what I think it is actually doing - UTF-16) and the database itself handles the conversion? Is this controlled by the collation of the database (SQL_Latin1_General_CP1_CI_AS in this case)?

+1  A: 

You are correct.

VBScript and ADODB only know strings as Unicode (or UTF-16 as its sometimes refered to).

Its part of the DBs collation settings that determine how the VARCHAR fields are encoded.

In SQL_Latin1_General_CP1_CI_AS its really the CP1 bit which is determining the CodePage to use. In this case 1 is a legacy reference to Windows-1252 which is a superset of ISO-8859-1.

AnthonyWJones
Unicode is more than just UTF-16; UTF-16 is one of many Unicode encodings.
Dave DuPlantis
@Dave, ADODB, VBScript, VB6, .NET, Windows API you name it, all use the 2 byte encoding of Unicode. Hence the term "Unicode" has become synonymous with the UTF-16 encoding (e.g., take a look at the API documentation of Scripting.FileSystemObject.OpenTextStream no mention of UTF-16 just unicode). Whilst this is technically inaccurate for all practical purposes it serves us well. I know of no system which actually stores Unicode characters as 32bit words. In addition UTF-8 is used consistently to refer to the 8bit encoding. Hence most people understand "Unicode" to mean the 16bit encoding.
AnthonyWJones
A: 

If you switch to using NVARCHAR instead then you'll need to remember to use the N specifier in your SQL commands like so whenever you use a string which is Unicode

INSERT INTO SOME_TABLE (someField) VALUES (N'Some Unicode Text')

SELECT * FROM SOME_TABLE WHERE someField=N'Some Unicode Text'

If you don't do this then the strings won't get treated as Unicode and your data will be silently converted to Latin1 or whatever the default character set for the relevant database/table/field even if that field is a NVARCHAR

RobV