views:

1814

answers:

3

I have an old MySQL database with encoding set to UTF-8. I am using Ado.Net Entity framework to connect to it.

The string that I retrieve from it have strange characters when ë like characters are expected.

For example: "ë" is "ë".

I thought I could get this right by converting from UTF8 to UTF16.

 return Encoding.Unicode.GetString(                
            Encoding.Convert(
            Encoding.UTF8,
            Encoding.Unicode,
            Encoding.UTF8.GetBytes(utf8)));
    }

This however doesn't change a thing.

How could I get the data from this database in proper form?

+1  A: 

Even if the database is set to UTF8 you must do the following things to get Unicode fields to work correctly:

  1. Ensure you are using a Unicode field type like NVARCHAR or TEXT CHARSET utf8
  2. Whenever you insert anything into the field you must prefix it with the N character to indicate Unicode data as shown in the examples below
  3. Whenever you select based on Unicode data ensure you use the N prefix again


MySqlCommand cmd = new MySqlCommand("INSERT INTO EXAMPLE (someField) VALUES (N'Unicode Data')");

MySqlCommand cmd2 = new MySqlCommand("SELECT * FROM EXAMPLE WHERE someField=N'Unicode Data'");

If the database wasn't configured correctly or the data was inserted without using the N prefix it won't be possible to get the correct data out since it will have been downcast into the Latin 1/ASCII character set

RobV
I'm afraid that that downcast issue is the case.The tables are in UTF8, but the fields are just VARCHAR. Is there no way to "upcast" it to utf8 or whatever coding to get the data back?
Peter
Strangely, the Ruby on Rails app that uses the database doesn't seem to have any problems with getting the data correctly
Peter
A: 

Try set the encoding by "set names utf8" query. You can set this parameter in mysql config too.

erenon
A: 

As others have said this could be a db issue, but it could also be caused by using an old version of the .net mysql connector.

What I actually wanted to comment on was the utf8 to utf16 conversion. The string you are trying to convert is actually alreay unicode encoded, so your "ë" characters actually takes up 4 bytes (or more) and are no longer, at the point of your conversion, a misrepresentation of the "ë" character. That is the reason why your conversion doesn't do anything. If you want to do a conversion like that I think you would have to encode your utf8 string as a old style 1 byte per character string, using a codepage where the byte values of à and « actually represent the utf8 byte sequence of ë and then treat the bytes of this new string as an utf8 string. Fun stuff.

JJJ