views:

108

answers:

2

Until recently, my blog used mismatched character encoding settings for PHP and MySQL. I have since fixed the underlying problem, but I still have a ton of text that is filled with garbage. For instance, ï has become ï.

Is there software that can use pattern recognition and statistics to automatically discover broken text and fix it?

For example, it looks like U+00EF (UTF-8 0xC3 0xAF) has become U+00C3 U+00AF (UTF-8 0xC3 0x83 0xC2 0xAF). In other words, the hexadecimal encoding has been used for the code points. This pattern has happened to (seemingly random) non-ASCII characters across my site.

A: 

You probably want to look into regex, http://en.wikipedia.org/wiki/Regular%5Fexpression. Using this you can then search out and replace the characters in question.

Here is the MySQL regex documentation, http://dev.mysql.com/doc/refman/5.1/en/regexp.html.

Michael Baker
I've edited my post to clarify: It's not just this character.I mean, I could certainly dump the DB, locate all non-ASCII character sequences, find their original values (where appropriate) and run a straight find-and-replace across the file... but I'm looking for a more general solution.
phyzome
+1  A: 

The example you cite looks like good old utf8-over-latin1. You might quickly try out a query like:

select convert(convert(the_problem_column using binary) using utf8)

to see if it irons out the problem.

An encoding conversion along those lines should work as long as all of your data went through the same sequence of encoding transformations, and as long as none of those transformations were lossy - you're just reversing the effect of some of those transformations.

If you can't rely on the data having gone through the same set of encoding transformations, then it's a matter of scanning through the data for garbage characters and replacing them with the intended character, which is risky because it depends on somebody's definition of what was garbage and what was intended.

Some discussion in this answer on how you might do that kind of repair using handmade scripts. I don't know of a tool that's aware of the full range of natural languages and encodings, that takes a more advanced statistical approach in spotting possible problems, and that recommends the exact transformation to fix the problem - something like that would be useful.

d__