ansaurus

Question

What is the character encoding that could match this conversion: From "§" To "Ç"?

Answer 1

+3 A:

The files themselves doesn't have the wrong encoding, it's when you read the file that you use the wrong encoding to decode them.

The correction is to use the same encoding to decode the file that was used to encode it.

If you don't know what encoding that is, you should find out the byte code for the problematic characters before they are decoded, and look for an encoding with a character set where the character code matches the character that you want.

For example, the file could be encoded using IBM905 so that the character "Ç" is encoded into the byte code 74. If you then decode the file using IBM278, the byte code 74 is interpreted as the character "§".

Here is a list of the possible combinations that I found in the built in encodings:

from cp875 to IBM290
from cp875 to IBM420
from cp875 to x-EBCDIC-KoreanExtended
from cp875 to IBM-Thai
from cp875 to IBM880
from IBM290 to IBM290
from IBM290 to IBM420
from IBM290 to x-EBCDIC-KoreanExtended
from IBM290 to IBM-Thai
from IBM290 to IBM880
from IBM420 to IBM290
from IBM420 to IBM420
from IBM420 to x-EBCDIC-KoreanExtended
from IBM420 to IBM-Thai
from IBM420 to IBM880
from IBM424 to IBM290
from IBM424 to IBM420
from IBM424 to x-EBCDIC-KoreanExtended
from IBM424 to IBM-Thai
from IBM424 to IBM880
from x-EBCDIC-KoreanExtended to IBM290
from x-EBCDIC-KoreanExtended to IBM420
from x-EBCDIC-KoreanExtended to x-EBCDIC-KoreanExtended
from x-EBCDIC-KoreanExtended to IBM-Thai
from x-EBCDIC-KoreanExtended to IBM880
from IBM-Thai to IBM290
from IBM-Thai to IBM420
from IBM-Thai to x-EBCDIC-KoreanExtended
from IBM-Thai to IBM-Thai
from IBM-Thai to IBM880
from IBM880 to IBM290
from IBM880 to IBM420
from IBM880 to x-EBCDIC-KoreanExtended
from IBM880 to IBM-Thai
from IBM880 to IBM880
from cp1025 to IBM290
from cp1025 to IBM420
from cp1025 to x-EBCDIC-KoreanExtended
from cp1025 to IBM-Thai
from cp1025 to IBM880
from IBM1026 to IBM01143
from IBM1026 to IBM278
from IBM905 to IBM01143
from IBM905 to IBM278

Guffa 2010-06-23 11:42:46

@Guffa: I think that's what the question was (i.e the process descibed in your last paragraph), using SO's Mechanical Turk implementation.

Andrzej Doyle 2010-06-23 11:46:01

@Guffa, see if that image helps to identify the encoding.

Acacio Nerull 2010-06-23 12:03:52

@Guffa, do you know if is possible to do this conversion using PHP?

Acacio Nerull 2010-06-23 12:07:59

@Acacio: From the image I can tell that the character is encoded as the two bytes C2 A7. I haven't found any built in encoding that decodes this as the character "Ç". It seems that the contents of the files are once decoded using the wrong encoding, then saved using UTF-8, so the original information is lost for ever. The best you could do is to try do the wrong conversion in reverse and hope to get back as much information as possible.

Guffa 2010-06-23 12:15:33

Ok thanks a lot!

Acacio Nerull 2010-06-23 12:18:49

@Acacio: I don't know what capabilities PHP has for encoding and decoding, so I don't know if it's easy to try the recovery in PHP.

Guffa 2010-06-23 12:19:28

ansaurus

tags:

views:

answers:

What is the character encoding that could match this conversion: From "§" To "Ç"?

related questions