tags:

views:

2526

answers:

4

Hi folks,

I have an issue with Encoding. I want to put data from a UTF-8-Encoded file into a MSSQL 2008 database. MSSQL only features UCS2 encoding, so I decided to explicitely convert the retrieved data.

// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);

Here's the conversion routine for the data:

private string ConvertTitle(string title)
{
  string utf8_String = Regex.Replace(Regex.Replace(title, @"\\.", _myEvaluator), @"(?<=[^\\])_", " ");
  byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
  byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
  string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);

  return ucs2_String;
}

When stepping through the code for critical titles, variable watch shows the correct characters for both utf8 and ucs2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.

Wrong: ń becomes an n Right: É or é are for example inserted correctly.

Any idea where the problem might be and how to solve it?

Thans in advance, Frank

+3  A: 

I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.

Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that — to you, they're just strings of characters.

What your function does is:

  1. Takes a string and converts it to UTF-8 bytes.
  2. Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
  3. Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!

So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.

The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.

What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.

SQL Server does use ‘UCS2’ (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.

bobince
Yap, this encoding/Unicode/UTF stuff still gives me headaches. Anyways, you hit the nail on the head. After changing my column from varchar to nvarchar, the character is stored correctly. Many thanks!
Aaginor
+1  A: 

Hey we were also very confused about encoding.. here's a useful page that explains it:

http://www.joelonsoftware.com/articles/Unicode.html

Also the answer to this question will help to explain it too:

http://stackoverflow.com/questions/1426733/in-c-string-character-encoding-what-is-the-difference-between-getbytes-getstr

CraftyFella
Yap, I already red the article of Joel and agree with you that it's a pretty good one.
Aaginor
A: 

Convert data from SQL Server to a encoded file in UTF-8 : http://www.xoowiki.com/Article/Batch/sql-server-utf-8-477.aspx

Sacha