views:

68

answers:

1

The line bellow is as an example of one of many files with wrong character encoding that I have;

REAPRESENTA§AO VIA DTENTRY

The correct presentation should be this:

REAPRESENTAÇAO VIA DTENTRY

There's more characters with wrong encoding. How do I correct this?

alt text

+3  A: 

The files themselves doesn't have the wrong encoding, it's when you read the file that you use the wrong encoding to decode them.

The correction is to use the same encoding to decode the file that was used to encode it.

If you don't know what encoding that is, you should find out the byte code for the problematic characters before they are decoded, and look for an encoding with a character set where the character code matches the character that you want.

For example, the file could be encoded using IBM905 so that the character "Ç" is encoded into the byte code 74. If you then decode the file using IBM278, the byte code 74 is interpreted as the character "§".

Here is a list of the possible combinations that I found in the built in encodings:

from cp875 to IBM290
from cp875 to IBM420
from cp875 to x-EBCDIC-KoreanExtended
from cp875 to IBM-Thai
from cp875 to IBM880
from IBM290 to IBM290
from IBM290 to IBM420
from IBM290 to x-EBCDIC-KoreanExtended
from IBM290 to IBM-Thai
from IBM290 to IBM880
from IBM420 to IBM290
from IBM420 to IBM420
from IBM420 to x-EBCDIC-KoreanExtended
from IBM420 to IBM-Thai
from IBM420 to IBM880
from IBM424 to IBM290
from IBM424 to IBM420
from IBM424 to x-EBCDIC-KoreanExtended
from IBM424 to IBM-Thai
from IBM424 to IBM880
from x-EBCDIC-KoreanExtended to IBM290
from x-EBCDIC-KoreanExtended to IBM420
from x-EBCDIC-KoreanExtended to x-EBCDIC-KoreanExtended
from x-EBCDIC-KoreanExtended to IBM-Thai
from x-EBCDIC-KoreanExtended to IBM880
from IBM-Thai to IBM290
from IBM-Thai to IBM420
from IBM-Thai to x-EBCDIC-KoreanExtended
from IBM-Thai to IBM-Thai
from IBM-Thai to IBM880
from IBM880 to IBM290
from IBM880 to IBM420
from IBM880 to x-EBCDIC-KoreanExtended
from IBM880 to IBM-Thai
from IBM880 to IBM880
from cp1025 to IBM290
from cp1025 to IBM420
from cp1025 to x-EBCDIC-KoreanExtended
from cp1025 to IBM-Thai
from cp1025 to IBM880
from IBM1026 to IBM01143
from IBM1026 to IBM278
from IBM905 to IBM01143
from IBM905 to IBM278
Guffa
@Guffa: I think that's what the question was (i.e the process descibed in your last paragraph), using SO's Mechanical Turk implementation.
Andrzej Doyle
@Guffa, see if that image helps to identify the encoding.
Acacio Nerull
@Guffa, do you know if is possible to do this conversion using PHP?
Acacio Nerull
@Acacio: From the image I can tell that the character is encoded as the two bytes C2 A7. I haven't found any built in encoding that decodes this as the character "Ç". It seems that the contents of the files are once decoded using the wrong encoding, then saved using UTF-8, so the original information is lost for ever. The best you could do is to try do the wrong conversion in reverse and hope to get back as much information as possible.
Guffa
Ok thanks a lot!
Acacio Nerull
@Acacio: I don't know what capabilities PHP has for encoding and decoding, so I don't know if it's easy to try the recovery in PHP.
Guffa