ansaurus

Question

Converting VB6 encoding application into C#

Answer 1

+2 A:

As ever, the key thing is to separate out each bit of the process, and check the strings at each stage.

So first write a program which just reads the file and dumps out the details of the strings, in terms of the Unicode values. I have some code on my strings page which will help with this. When you read the file, specify the encoding explicitly.

Then write a separate program with hardcoded literals (using \uxxxx where necessary) to upload into the database. Then examine the strings in the database as accurately as you can. I would expect the actual uploading bit to just work, so long as the database has the appropriate settings.

There's a bit more on this general process on my "debugging unicode problems" page.

Jon Skeet 2009-09-19 13:31:05

Answer 2

A:

After fiddling a bit I came up with this:

/// <summary>
/// Some charcodes produced by unicode character handling
/// does not map correctly to codepage 1252. This function
/// translates every char to codepage 1252, unless the char
/// takes more than one byte. Then it gets encoded using Unicode.
/// </summary>
/// <param name="chars"></param>
/// <returns></returns>
private string GetStringAfterFixingEncoding(IEnumerable<char> chars)
{
    var result = new StringBuilder();

    foreach (var c in chars)
    {
     var unicodeBytesForChar = Encoding.Unicode.GetBytes(new[] { c });

     if (unicodeBytesForChar.Length > 1 && unicodeBytesForChar[1] != 0)
      result.Append(Encoding.Unicode.GetChars(unicodeBytesForChar)[0]);
     else
      result.Append(_encoding.GetChars(unicodeBytesForChar)[0]);
    }

    return result.ToString();
}

Daniel 2009-09-19 18:51:57

Answer 3

+3 A:

130 is the windows-1252 encoding for the character U+201A (decimal 8218), "Single Low-9 Quotation Mark". If you decode it correctly, the resulting char will have the numeric value 8218 because .NET uses UTF-16 ("Unicode") internally.

It sounds like you decoded the windows-1252 byte sequence as ISO-8859-1, which maps 0x82 (decimal 130) to a control character with numeric value 130. If that's the case, the real solution to your problem is to go back and change the part that's decoding it wrong in the first place.

Alan Moore 2009-09-20 01:04:02

Yes, but I don't own that data and even if I have a copy of the data I have a requirement to leave it in original state.//D

Daniel 2009-09-20 05:31:42

ansaurus

tags:

views:

answers:

Converting VB6 encoding application into C#

related questions