ansaurus

Question

Best way to correct garbled data caused by false encoding

Answer 1

A:

Download iconv - you can get binaries for Win32 as well as Unix/Linux. It is a command-line tool that takes a source file and, after specifying input encodings and output encodings, will do the necessary conversion for you to STDOUT.

I find myself using this extensively to convert from UTF-16 (as output by SQL Server 2005 export files) to ASCII.

You can download from here: http://gnuwin32.sourceforge.net/packages/libiconv.htm

PP 2010-03-09 15:23:35

It is a nice tool but here is the challenge: There is both *München* and *MÃ¼nchen* in the file. When I make a conversion from UTF-8 to ISO-8859-1, the second one gets corrected, but the first one gets corrupt. I need something smarter.

ercan 2010-03-09 16:07:46

Could be you have to write a best-guess script. Clearly you're carrying multiple encodings in a single file. Perhaps you could write a Perl script that decodes each line internally according to the following rule: if ( m/[\x80-\xFF][a-zA-Z0-9]/ ) { decode(iso) } else { decode(utf) }

PP 2010-03-09 16:27:11

Sometimes both "München" and "MÃ¼nchen" appear in the same line! But you're right. It is almost impossible to make an intelligent guess in that situation. I will write a script and try converting every two-byte-string that begins with 'Ã' to ISO Latin 1.

ercan 2010-03-10 08:16:38

Answer 2

A:

Given the complexity of the data (multiple encodings on a single line/entry). I think you'll have to export/dump the data, and then run processing against that.

I think the best way will end up being a sequence of manual replacements. Maybe some kind of spelling correction code could find all the errors - then you can add explicit correction code. Then iterate till the spelling check stops finding errors?

(Obviously add any correct words to the dictionary for the spelling checker).

Douglas Leeder 2010-03-10 14:36:02

Answer 3

+1 A:

I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added, ... Each row could contain any number of encodings.

The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my "Utf8Voodoo" class, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.

    // ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
    // site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
    // characters in 0x80-0x9F), but mb_detect_encoding never returns that
    // encoding when ISO-8859-* is in the detect list, so we cannot use it.
    // CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
    // just a few times, but enough to make it pretty much impossible to
    // automagically detect exactly which non-ISO encoding was used. Hence the
    // need for "likely bytes" in addition to the "magic bytes" below.

    /**
     * This array contains the magic bytes that determine possible encodings.
     * It works by elimination: the most specific byte patterns (the array's
     * keys) are listed first. When a match is found, the possible encodings
     * are that entry's value.
     */
    public static $legacyEncodingsMagicBytes = array(
        '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
        '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
        '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
    );

    /**
     * This array contains the bytes that make it more likely for a string to
     * be a certain encoding. The keys are the pattern, the values are arrays
     * with (encoding => likeliness score modifier).
     */
   public static $legacyEncodingsLikelyBytes = array(
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x80 | -      | -      | €      | Ç
        '/\x80/' => array(
            'Windows-1252' => +10,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x93 | -      | -      | “      | ô
        // 0x94 | -      | -      | ”      | ö
        // 0x95 | -      | -      | •      | ò
        // 0x96 | -      | -      | –      | û
        // 0x97 | -      | -      | —      | ù
        // 0x99 | -      | -      | ™      | Ö
        '/[\x93-\x97\x99]/' => array(
            'Windows-1252' => +1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x86 | -      | -      | †      | å
        // 0x87 | -      | -      | ‡      | ç
        // 0x89 | -      | -      | ‰      | ë
        // 0x8A | -      | -      | Š      | è
        // 0x8C | -      | -      | Œ      | î
        // 0x8E | -      | -      | Ž      | Ä
        // 0x9A | -      | -      | š      | Ü
        // 0x9C | -      | -      | œ      | £
        // 0x9E | -      | -      | ž      | ×
        '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
            'Windows-1252' => -1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0xA4 | ¤      | €      | ¤      | ñ
        '/\xA4/' => array(
            'ISO-8859-15' => +10,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0xA6 | ¦      | Š      | ¦      | ª
        // 0xBD | ½      | œ      | ½      | ¢
        '/[\xA6\xBD]/' => array(
            'ISO-8859-15' => -1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x82 | -      | -      | ‚      | é
        // 0xA7 | §      | §      | §      | º
        // 0xFD | ý      | ý      | ý      | ²
        '/[\x82\xA7\xCF\xFD]/' => array(
            'CP850' => +1
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x91 | -      | -      | ‘      | æ
        // 0x92 | -      | -      | ’      | Æ
        // 0xB0 | °      | °      | °      | ░
        // 0xB1 | ±      | ±      | ±      | ▒
        // 0xB2 | ²      | ²      | ²      | ▓
        // 0xB3 | ³      | ³      | ³      | │
        // 0xB9 | ¹      | ¹      | ¹      | ╣
        // 0xBA | º      | º      | º      | ║
        // 0xBB | »      | »      | »      | ╗
        // 0xBC | ¼      | Œ      | ¼      | ╝
        // 0xC1 | Á      | Á      | Á      | ┴
        // 0xC2 | Â      | Â      | Â      | ┬
        // 0xC3 | Ã      | Ã      | Ã      | ├
        // 0xC4 | Ä      | Ä      | Ä      | ─
        // 0xC5 | Å      | Å      | Å      | ┼
        // 0xC8 | È      | È      | È      | ╚
        // 0xC9 | É      | É      | É      | ╔
        // 0xCA | Ê      | Ê      | Ê      | ╩
        // 0xCB | Ë      | Ë      | Ë      | ╦
        // 0xCC | Ì      | Ì      | Ì      | ╠
        // 0xCD | Í      | Í      | Í      | ═
        // 0xCE | Î      | Î      | Î      | ╬
        // 0xD9 | Ù      | Ù      | Ù      | ┘
        // 0xDA | Ú      | Ú      | Ú      | ┌
        // 0xDB | Û      | Û      | Û      | █
        // 0xDC | Ü      | Ü      | Ü      | ▄
        // 0xDF | ß      | ß      | ß      | ▀
        // 0xE7 | ç      | ç      | ç      | þ
        // 0xE8 | è      | è      | è      | Þ
        '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
            'CP850' => -1
        ),
/* etc. */

Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.

janmoesen 2010-03-14 10:18:31

This is written in PHP right? A great example to provide a starting point for my problem too. Thanks! Do you have the complete class available on your website or somewhere else? I would understand it if you don't want to open the source but it might be very helpful to others.

ercan 2010-03-15 09:25:18

I would love to open-source it, but it was written on company time. I will have to check. And yes, it is PHP.

janmoesen 2010-03-15 10:16:53

ansaurus

tags:

views:

answers:

Best way to correct garbled data caused by false encoding

related questions