views:

58

answers:

1

I have Unicode strings stored in a database. Some of the character encodings are wrong and instead of displaying actual characters for the language, it's now displaying characters that make no sense. How do I fix this issue? Is there a way to detect if strings have a wrong encoding?

+2  A: 

The problem with mojibake (the Japanese slang "mojibake" gets used in English because the historical status of Japan as a non-Western country with heavy early computer use meant the issue was encountered a lot there) is that the characters will generally be valid in themselves, but nonsense, which is much harder to detect with 100% accuracy.

The first thing you need to do is identify the encoding that the data was really in, the encoding the data was read as being in, and write a converter to undo that.

For example, if UTF-8 had been mis-interpreted as ISO 8859-1, then you would want to read through the stream, and create the binary stream of encoding it back into ISO 8859-1, and then create the text stream of reading that binary stream as UTF-8, as should have been done in the first place.

Now for the hard part, finding the incorrect streams. If you can do this by some means that isn't heuristic, then this is the way to go (e.g. if you knew that every record added within a particular range of id numbers was invalid, just use that).

Failing that, your best bet is to do some heuristics as follows:

  1. If a character in the text is not a graphical character, then its probably caused by this mojibake issue.
  2. Certain sequences will be common in the given case of mojibake. For example, é in UTF-8 mis-interpreted as ISO 8859-1 will become é. Since é is an extremely rare combination in real data (about the only time you'll see it deliberately is in a case like this when someone is talking about how it can appear by mistake), then any text containing it is almost certainly one that needs to be fixed. If you have some of the original data, you can find the sequences you need to look for by identifying those characters in the original data that differ in the two encodings, and producing the sequence necessary (e.g. if we find that ç appears in the data, and we find that this would have the sequence ç, then we know that's a sequence to look for.

Note that we can compute such sequences if we have System.Text.Encoding objects that correspond to the mojikbake. If for example you had read as your system's default encoding when you should have read as UTF-8 then you could use:

Encoding.Default.GetString(Encoding.UTF8.GetBytes(testString))

For example:

Encoding.Default.GetString(Encoding.UTF8.GetBytes("ç"))

returns "ç".

Jon Hanna