views:

1468

answers:

1

My Perl app and MySQL database now handle incoming UTF-8 data properly, but I have to convert the pre-existing data. Some of the data appears to have been encoded as CP-1252 and not decoded as such before being encoded as UTF-8 and stored in MySQL. I've read the O'Reilly article Turning MySQL data in latin1 to utf8 utf-8, but although it's frequently referenced, it's not a definitive solution.

I've looked at Encode::DoubleEncodedUTF8 and Encoding::FixLatin, but neither worked on my data.

This is what I've done so far:

#Return the $bytes from the DB using BINARY()
my $characters = decode('utf-8', $bytes);
my $good = decode('utf-8', encode('cp-1252', $characters));

That fixes most of the cases, but if run against proplerly-encoded records, it mangles them. I've tried using Encode::Guess and Encode::Detect, but they cannot distinguish between the properly encoded and the misencoded records. So I just undo the conversion if the \x{FFFD} character is found after the conversion.

Some records, though, are only partially converted. Here's an example where the left curly quotes are properly converted, but the right curly quotes get mangled.

perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "\xC3\xA2\xE2\x82\xAC\xC5\x93four score\xC3\xA2\xE2\x82\xAC\xC2\x9D")))'

And and here's an example where a right single quote did not convert:

perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "bob\xC3\xAF\xC2\xBF\xC2\xBDs")))'

Am I also dealing with double encoded data here? What more must I do to convert these records?

+5  A: 

With the "four score" example, it almost certainly is doubly-encoded data. It looks like either:

  1. cp1252 data that was run through a cp1252 to utf8 process twice, or
  2. utf8 data that was run through a cp1252 to utf8 process

(Naturally, both cases look identical)

Now, that's what you expected, so why didn't your code work?

First, I'd like to refer you to this table which shows the conversion from cp1252 to unicode. The important thing I want you to note is that there are some bytes (such as 0x9D) which are not valid in cp1252.

When I imagine writing a cp1252 to utf8 converter, therefore, I need to do something with those bytes that aren't in cp1252. The only sensible thing I can think of is to transform the unknown bytes into unicode characters at the same value. In fact, this appears to be what happened. Let's take your "four score" example back one step at a time.

First, since it is valid utf-8, let's decode with:

$ perl -CO -MEncode -e '$a=decode("utf-8", 
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

This yields this sequence of unicode code points:

e2 20ac 153 66 6f 75 72 20 73 63 6f 72 65 e2 20ac 9d

("fmt" is a unix command that just reformats text so that we have nice line breaks with long data)

Now, let's represent each of these as a byte in cp1252, but when the unicode character can't be represented in cp1252, let's just replace it with a byte that has the same numeric value. (Instead of the default, which is to replace it with a question mark) We should then, if we're correct about what happened to the data, have a valid utf8 byte stream.

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  $a=encode("cp-1252", $a, sub { chr($_[0]) } );
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

That third argument to encode - when it's a sub - tells what to do with unrepresentable characters.

This yields:

e2 80 9c 66 6f 75 72 20 73 63 6f 72 65 e2 80 9d

Now, this is a valid utf8 byte stream. Can't tell that by inspection? Well, let's ask perl to decode this byte stream as utf8:

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  $a=encode("cp-1252", $a, sub { chr($_[0]) } );
  $a=decode("utf-8", $a, 1);
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

Passing "1" as the third argument to decode ensures that our code will croak if the byte stream is invalid. This yields:

201c 66 6f 75 72 20 73 63 6f 72 65 201d

Or printed:

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
  "four score" .
  "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
  $a=encode("cp-1252", $a, sub { chr($_[0]) } );
  $a=decode("utf-8", $a, 1);
  print "$a\n"'
“four score”

So I think that the full algorithm should be this:

  1. Grab the byte stream from mysql. Assign this to $bytestream.
  2. While $bytestream is a valid utf8 byte stream:
    1. Assign the current value of $bytestream to $good
    2. If $bytestream is all-ASCII (i.e., every byte is less than 0x80), break out of this "while ... valid utf8" loop.
    3. Set $bytestream to the result of "demangle($bytestream)", where demangle is given below. This routine undoes the cp1252-to-utf8 converter we think this data has suffered from.
  3. Put $good back in the database if it isn't undef. If $good was never assigned, assume $bytestream was a cp1252 byte stream and convert it to utf8. (Of course, optimize and don't do this if the loop in step 2 didn't change anything, etc.)

.

sub demangle {
  my($a) = shift;
  eval { # the non-string form of eval just traps exceptions
         # so that we return undef on exception
    local $SIG{__WARN__} = sub {}; # No warning messages
    $a = decode("utf-8", $a, 1);
    encode("cp-1252", $a, sub {$_[0] <= 255 or die $_[0]; chr($_[0])});
  }
}

This is based on the assumption that it's actually very rare for a string that isn't all-ASCII to be a valid utf-8 byte stream unless it really is utf-8. That is, it's not the sort of thing that happens accidentally.

EDITED TO ADD:

Note that this technique does not help too much with your "bob's" example, unfortunately. I think that that string also went through two rounds of cp1252-to-utf8 conversion, but unfortunately there was also some corruption. Using the same technique as before, we first read the byte sequence as utf8 and look at the sequence of unicode character references we get:

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "bob\xC3\xAF\xC2\xBF\xC2\xBDs");
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

This yields:

62 6f 62 ef bf bd 73

Now, it just so happens that for the three bytes ef bf bd, unicode and cp1252 agree. So representing this sequence of unicode code points in cp1252 is just:

62 6f 62 ef bf bd 73

That is, the same sequence of numbers. Now, this is in fact a valid utf-8 byte stream, but what it decodes to may surprise you:

$ perl -CO -MEncode -e '$a=decode("utf-8",
  "bob\xC3\xAF\xC2\xBF\xC2\xBDs");
  $a=encode("cp-1252", $a, sub { chr(shift) } );
  $a=decode("utf-8", $a, 1);
  for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt

62 6f 62 fffd 73

That is, the utf-8 byte stream, though a legitimate utf-8 byte stream, encoded the character 0xFFFD, which is generally used for "untranslatable character". I suspect that what happened here is that the first *-to-utf8 transformation saw a character it didn't recognize and replaced it with "untranslatable". There's no way to then programmatically recover the original character.

A consequence is that you can't detect whether a stream of bytes is valid utf8 (needed for that algorithm I gave above) simply by doing a decode and then looking for 0xFFFD. Instead, you should use something like this:

sub is_valid_utf8 {
  defined(eval { decode("utf-8", $_[0], 1) })
}
Daniel Martin