ansaurus

Question

How do I recover a document that has been sent through the character encoding wringer?

Answer 1

A:

You probably want to look into regex, http://en.wikipedia.org/wiki/Regular%5Fexpression. Using this you can then search out and replace the characters in question.

Here is the MySQL regex documentation, http://dev.mysql.com/doc/refman/5.1/en/regexp.html.

Michael Baker 2009-09-12 23:23:10

I've edited my post to clarify: It's not just this character.I mean, I could certainly dump the DB, locate all non-ASCII character sequences, find their original values (where appropriate) and run a straight find-and-replace across the file... but I'm looking for a more general solution.

phyzome 2009-09-14 14:03:45

Answer 2

+1 A:

The example you cite looks like good old utf8-over-latin1. You might quickly try out a query like:

select convert(convert(the_problem_column using binary) using utf8)

to see if it irons out the problem.

An encoding conversion along those lines should work as long as all of your data went through the same sequence of encoding transformations, and as long as none of those transformations were lossy - you're just reversing the effect of some of those transformations.

If you can't rely on the data having gone through the same set of encoding transformations, then it's a matter of scanning through the data for garbage characters and replacing them with the intended character, which is risky because it depends on somebody's definition of what was garbage and what was intended.

Some discussion in this answer on how you might do that kind of repair using handmade scripts. I don't know of a tool that's aware of the full range of natural languages and encodings, that takes a more advanced statistical approach in spotting possible problems, and that recommends the exact transformation to fix the problem - something like that would be useful.

d__ 2009-09-14 23:43:17

ansaurus

tags:

views:

answers:

How do I recover a document that has been sent through the character encoding wringer?

related questions