views:

321

answers:

4

I have got some files created from some asian OS (chinese and japanese XPs) the file name is garbled, for example:

иè+¾«Ñ¡Õä²ØºÏ¼­

how i can recover the original text? I tried with this in c#

Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)

and tried to change unicode to windows-1252 but no luck

A: 

The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So

Encoding.Convert(chinese, unicode, chineseBytes);

might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.

Joey
i tried both combinations before asking the question, and did not worked, so i was thinking the one i posted it was right, because the source encoding is not chinese, right?
Magnetic_dud
+3  A: 

It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:

>>> print 'иè+¾«Ñ¡Õä²ØºÏ¼­'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑

I'm sure you can do something similar in C#.

Lukáš Lalinský
Suggest `u'иè+¾«Ñ¡Õä²ØºÏ¼­'.encode('cp1252').decode('cp936')`: the UTF-8 was only coping with the pasted bytes. Either way this depends on the encoding of the terminal you paste the string into.
bobince
Yep, you are right.
Lukáš Lalinský
you both gave me a good hint on where to look, thank you
Magnetic_dud
A: 

Do you really know that source encoding is 936? I tried online tool charset converter, and don`t see any valid hieroglyphs.

St.Shadow
+2  A: 
Encoding unicode = Encoding.Unicode;

That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.

Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.

Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);

chinese.getChars(latin.getBytes(s));
bobince
it works, thank you!
Magnetic_dud