Until recently, my blog used mismatched character encoding settings for PHP and MySQL. I have since fixed the underlying problem, but I still have a ton of text that is filled with garbage. For instance, ï
has become ï
.
Is there software that can use pattern recognition and statistics to automatically discover broken text and fix it?
For example, it looks like U+00EF
(UTF-8 0xC3 0xAF
) has become U+00C3 U+00AF
(UTF-8 0xC3 0x83 0xC2 0xAF
). In other words, the hexadecimal encoding has been used for the code points. This pattern has happened to (seemingly random) non-ASCII characters across my site.