views:

168

answers:

3

Hi all,

I have a set of data that contains garbled text fields because of encoding errors during many import/exports from one database to another. Most of the errors were caused by converting UTF-8 to ISO-8859-1. Strangely enough, the errors are not consistent: the word 'München' appears as 'München' in some place and also as 'MÃœnchen' in somewhere else.

Is there a trick in SQL server to correct this kind of crap? The first thing that I can think of is to exploit the COLLATE clause, so that ü is interpreted as ü, but I don't exactly know how. If it isn't possible to make it in the DB level, do you know any tool that helps for a bulk correction? (no manual find/replace tool, but a tool that guesses the garbled text somehow and correct them)

A: 

Download iconv - you can get binaries for Win32 as well as Unix/Linux. It is a command-line tool that takes a source file and, after specifying input encodings and output encodings, will do the necessary conversion for you to STDOUT.

I find myself using this extensively to convert from UTF-16 (as output by SQL Server 2005 export files) to ASCII.

You can download from here: http://gnuwin32.sourceforge.net/packages/libiconv.htm

PP
It is a nice tool but here is the challenge: There is both *München* and *München* in the file. When I make a conversion from UTF-8 to ISO-8859-1, the second one gets corrected, but the first one gets corrupt. I need something smarter.
ercan
Could be you have to write a best-guess script. Clearly you're carrying multiple encodings in a single file. Perhaps you could write a Perl script that decodes each line internally according to the following rule: if ( m/[\x80-\xFF][a-zA-Z0-9]/ ) { decode(iso) } else { decode(utf) }
PP
Sometimes both "München" and "München" appear in the same line! But you're right. It is almost impossible to make an intelligent guess in that situation. I will write a script and try converting every two-byte-string that begins with 'Ã' to ISO Latin 1.
ercan
A: 

Given the complexity of the data (multiple encodings on a single line/entry). I think you'll have to export/dump the data, and then run processing against that.

I think the best way will end up being a sequence of manual replacements. Maybe some kind of spelling correction code could find all the errors - then you can add explicit correction code. Then iterate till the spelling check stops finding errors?

(Obviously add any correct words to the dictionary for the spelling checker).

Douglas Leeder
+1  A: 

I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added, ... Each row could contain any number of encodings.

The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my "Utf8Voodoo" class, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.

    // ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
    // site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
    // characters in 0x80-0x9F), but mb_detect_encoding never returns that
    // encoding when ISO-8859-* is in the detect list, so we cannot use it.
    // CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
    // just a few times, but enough to make it pretty much impossible to
    // automagically detect exactly which non-ISO encoding was used. Hence the
    // need for "likely bytes" in addition to the "magic bytes" below.

    /**
     * This array contains the magic bytes that determine possible encodings.
     * It works by elimination: the most specific byte patterns (the array's
     * keys) are listed first. When a match is found, the possible encodings
     * are that entry's value.
     */
    public static $legacyEncodingsMagicBytes = array(
        '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
        '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
        '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
    );

    /**
     * This array contains the bytes that make it more likely for a string to
     * be a certain encoding. The keys are the pattern, the values are arrays
     * with (encoding => likeliness score modifier).
     */
   public static $legacyEncodingsLikelyBytes = array(
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x80 | -      | -      | €      | Ç
        '/\x80/' => array(
            'Windows-1252' => +10,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x93 | -      | -      | “      | ô
        // 0x94 | -      | -      | ”      | ö
        // 0x95 | -      | -      | •      | ò
        // 0x96 | -      | -      | –      | û
        // 0x97 | -      | -      | —      | ù
        // 0x99 | -      | -      | ™      | Ö
        '/[\x93-\x97\x99]/' => array(
            'Windows-1252' => +1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x86 | -      | -      | †      | å
        // 0x87 | -      | -      | ‡      | ç
        // 0x89 | -      | -      | ‰      | ë
        // 0x8A | -      | -      | Š      | è
        // 0x8C | -      | -      | Œ      | î
        // 0x8E | -      | -      | Ž      | Ä
        // 0x9A | -      | -      | š      | Ü
        // 0x9C | -      | -      | œ      | £
        // 0x9E | -      | -      | ž      | ×
        '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
            'Windows-1252' => -1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0xA4 | ¤      | €      | ¤      | ñ
        '/\xA4/' => array(
            'ISO-8859-15' => +10,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0xA6 | ¦      | Š      | ¦      | ª
        // 0xBD | ½      | œ      | ½      | ¢
        '/[\xA6\xBD]/' => array(
            'ISO-8859-15' => -1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x82 | -      | -      | ‚      | é
        // 0xA7 | §      | §      | §      | º
        // 0xFD | ý      | ý      | ý      | ²
        '/[\x82\xA7\xCF\xFD]/' => array(
            'CP850' => +1
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x91 | -      | -      | ‘      | æ
        // 0x92 | -      | -      | ’      | Æ
        // 0xB0 | °      | °      | °      | ░
        // 0xB1 | ±      | ±      | ±      | ▒
        // 0xB2 | ²      | ²      | ²      | ▓
        // 0xB3 | ³      | ³      | ³      | │
        // 0xB9 | ¹      | ¹      | ¹      | ╣
        // 0xBA | º      | º      | º      | ║
        // 0xBB | »      | »      | »      | ╗
        // 0xBC | ¼      | Œ      | ¼      | ╝
        // 0xC1 | Á      | Á      | Á      | ┴
        // 0xC2 | Â      | Â      | Â      | ┬
        // 0xC3 | Ã      | Ã      | Ã      | ├
        // 0xC4 | Ä      | Ä      | Ä      | ─
        // 0xC5 | Å      | Å      | Å      | ┼
        // 0xC8 | È      | È      | È      | ╚
        // 0xC9 | É      | É      | É      | ╔
        // 0xCA | Ê      | Ê      | Ê      | ╩
        // 0xCB | Ë      | Ë      | Ë      | ╦
        // 0xCC | Ì      | Ì      | Ì      | ╠
        // 0xCD | Í      | Í      | Í      | ═
        // 0xCE | Î      | Î      | Î      | ╬
        // 0xD9 | Ù      | Ù      | Ù      | ┘
        // 0xDA | Ú      | Ú      | Ú      | ┌
        // 0xDB | Û      | Û      | Û      | █
        // 0xDC | Ü      | Ü      | Ü      | ▄
        // 0xDF | ß      | ß      | ß      | ▀
        // 0xE7 | ç      | ç      | ç      | þ
        // 0xE8 | è      | è      | è      | Þ
        '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
            'CP850' => -1
        ),
/* etc. */

Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.

janmoesen
This is written in PHP right? A great example to provide a starting point for my problem too. Thanks! Do you have the complete class available on your website or somewhere else? I would understand it if you don't want to open the source but it might be very helpful to others.
ercan
I would love to open-source it, but it was written on company time. I will have to check. And yes, it is PHP.
janmoesen