views:

258

answers:

3

I'm importing files in codepage 1252 encoding to a SQL Server 2008 database.

Some data contains a comma that isn't the traditional comma (keycode 44) but instead 8218.

The column that contains this value is encrypted via an algorithm in VB6. When I implement the same algorithm in C# I get value 130 which then will does not match 8218.

What am I missing?

EDIT Thought I would share the solution.... Thank god for Reflector. It was that simple...

+2  A: 

As ever, the key thing is to separate out each bit of the process, and check the strings at each stage.

So first write a program which just reads the file and dumps out the details of the strings, in terms of the Unicode values. I have some code on my strings page which will help with this. When you read the file, specify the encoding explicitly.

Then write a separate program with hardcoded literals (using \uxxxx where necessary) to upload into the database. Then examine the strings in the database as accurately as you can. I would expect the actual uploading bit to just work, so long as the database has the appropriate settings.

There's a bit more on this general process on my "debugging unicode problems" page.

Jon Skeet
A: 

After fiddling a bit I came up with this:

/// <summary>
/// Some charcodes produced by unicode character handling
/// does not map correctly to codepage 1252. This function
/// translates every char to codepage 1252, unless the char
/// takes more than one byte. Then it gets encoded using Unicode.
/// </summary>
/// <param name="chars"></param>
/// <returns></returns>
private string GetStringAfterFixingEncoding(IEnumerable<char> chars)
{
    var result = new StringBuilder();

    foreach (var c in chars)
    {
     var unicodeBytesForChar = Encoding.Unicode.GetBytes(new[] { c });

     if (unicodeBytesForChar.Length > 1 && unicodeBytesForChar[1] != 0)
      result.Append(Encoding.Unicode.GetChars(unicodeBytesForChar)[0]);
     else
      result.Append(_encoding.GetChars(unicodeBytesForChar)[0]);
    }

    return result.ToString();
}
Daniel
+3  A: 

130 is the windows-1252 encoding for the character U+201A (decimal 8218), "Single Low-9 Quotation Mark". If you decode it correctly, the resulting char will have the numeric value 8218 because .NET uses UTF-16 ("Unicode") internally.

It sounds like you decoded the windows-1252 byte sequence as ISO-8859-1, which maps 0x82 (decimal 130) to a control character with numeric value 130. If that's the case, the real solution to your problem is to go back and change the part that's decoding it wrong in the first place.

Alan Moore
Yes, but I don't own that data and even if I have a copy of the data I have a requirement to leave it in original state.//D
Daniel