views:

1169

answers:

1

We are having trouble getting a Unicode string to convert to a UTF-8 string to send over the wire:

// Start with our unicode string.
string unicode = "Convert: \u10A0";

// Get an array of bytes representing the unicode string, two for each character.
byte[] source = Encoding.Unicode.GetBytes(unicode);

// Convert the Unicode bytes to UTF-8 representation.
byte[] converted = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, source);

// Now that we have converted the bytes, save them to a new string.
string utf8 = Encoding.UTF8.GetString(converted);

// Send the converted string using a Microsoft function.
MicrosoftFunc(utf8);

Although we have converted the string to UTF-8, it's not arriving as UTF-8.

+4  A: 

After a much troubled and confusing morning, we found the answer to this problem.

The key point we were missing, which was making this very confusing, was that string types are always encoded in 16-bit (2-byte) Unicode. This means that when we do a GetString() on the bytes, they are automatically being re-encoded into Unicode behind the scenes and we are no better off than we were in the first place.

When we started to get character errors, and double byte data at the other end, we knew something was wrong but at a glance of the code we had, we couldn't see anything wrong. After learning what we have explained above, we realised that we needed to send the byte array if we wanted to preserve the encoding. Luckily, MicrosoftFunc() had an overload which was able to take a byte array instead of a string. This meant that we could convert the unicode string to an encoding of our choice and then send it off exactly as we expect it. The code changed to:

// Convert from a Unicode string to an array of bytes (encoded as UTF8).
byte[] source = Encoding.UTF8.GetBytes(unicode); 

// Send the encoded byte array directly! Do not send as a Unicode string.
MicrosoftFunc(source);

Summary:

So in conclusion, from the above we can see that:

  • GetBytes() amongst other things, does an Encoding.Convert() from Unicode (because strings are always Unicode) and the specified encoding the function was called from and returns an array of encoded bytes.
  • GetString() amongst other things, does an Encoding.Convert() from the specified encoding the function was called from to Unicode (because strings are always Unicode) and returns it as a string object.
  • Convert() actually converts a byte array of one encoding to another byte array of another encoding. Obviously strings cannot be used (because strings are always Unicode).
Nat Ryall