I came upon trying to convert a database that is encoded in UTF8 from what it looks like, into a windows 1251 encoding (dont ask, but I need to do this). All of the Russian, encoded characters in the db show up as абвгдÐ. When I pull them out of the db into my C# app, into strings, I still see абвгдÐ. No matter what I try to do to interpret this string as UTF8 encoded string, it seems to be interpreted as latin1 single byte string, and I do not see my text show up as russian. What I basically need to do is convert this latin1 looking-utf8 encoded string into Unicode, so that I can convert it later to 1251, but I have not been able to do this successfully. Anyone got any ideas?
Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").getBytes(s))
Now you have a normal Unicode string containing Cyrillic.
Note that it is possible that your ‘Latin-1’ misencoded string might actually be a ‘Windows codepage 1252’ misencoded string; I can't tell from the given example as it doesn't use any of the characters that are different between the two encodings. If this is the case use GetEncoding(1252)
instead.
Also this is assuming that it's the contents of the database at fault. If the database is supposed to be storing UTF-8 strings but you're pulling them out as if they were Latin-1 (or codepage 1252 due to that being the system codepage) then really you need to reconfigure your data access layer to set the right encoding. If you're using SQL Server, better to start using NVARCHAR.
I am using sql server, and all columns are nvarchar. The data was imported with mysql dump from a db that was latin1, not utf8. So all the unicode strings are simply latin1 encoded. In any case, I figured it out, and its very similar to what you suggested. here's what I did to convert the latin1 encoded utf8 into 1251.
//re interpret latin1 in proper utf8 encoding
str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));
//convert from utf8 to 1251
str = Encoding.GetEncoding(1251).GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(1251), Encoding.UTF8.GetBytes(str)));
small error:
str = Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(str));