ansaurus

Question

How to a recover a text from a wrong encoding?

Answer 1

A:

The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So

Encoding.Convert(chinese, unicode, chineseBytes);

might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.

Joey 2009-10-14 06:45:34

i tried both combinations before asking the question, and did not worked, so i was thinking the one i posted it was right, because the source encoding is not chinese, right?

Magnetic_dud 2009-10-14 09:07:49

Answer 2

+3 A:

It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:

>>> print 'ÐÂ¸è+¾«Ñ¡Õä²ØºÏ¼'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑

I'm sure you can do something similar in C#.

Lukáš Lalinský 2009-10-14 06:50:14

Suggest `u'ÐÂ¸è+¾«Ñ¡Õä²ØºÏ¼'.encode('cp1252').decode('cp936')`: the UTF-8 was only coping with the pasted bytes. Either way this depends on the encoding of the terminal you paste the string into.

bobince 2009-10-14 08:47:13

Yep, you are right.

Lukáš Lalinský 2009-10-14 09:11:13

you both gave me a good hint on where to look, thank you

Magnetic_dud 2009-10-14 09:51:54

Answer 3

A:

Do you really know that source encoding is 936? I tried online tool charset converter, and don`t see any valid hieroglyphs.

St.Shadow 2009-10-14 06:58:03

Answer 4

+2 A:

Encoding unicode = Encoding.Unicode;

That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.

Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.

Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);

chinese.getChars(latin.getBytes(s));

bobince 2009-10-14 08:44:24

it works, thank you!

Magnetic_dud 2009-10-14 09:45:21

ansaurus

tags:

views:

answers:

How to a recover a text from a wrong encoding?

related questions