views:

50

answers:

2

Somewhere upstream of me, "something" happened that looks like unicode mangling. One symptom is that a lowercase u umlaut (ü) gets converted to "ü" (ie, character FC gets converted to C3 BC). Assuming that I have no control over this upstream process, how can I reverse-engineer what's going on? And if that is possible, can I crank the sausage machine backwards and get the original text back?

(If it helps to understand this case, the text I received was in the form of a MySQL dump. I think somwewhere in the dump/transport process it got mangled.)

+1  A: 

First of all, it looks like you've got UTF-8 encoded text (as you've found ü interpreted in your expected encoding, maybe Latin-1).

You could guess this encoding being used by checking that the correct byte sequences are used (and the illegal ones not used, of course). See the Wikipedia article for a reference and look for valid and invalid byte sequences. You can be pretty sure about the encoding if the text starts with a BOM, but that's not required for UTF-8.

To get the text back in your required encoding, several tools are available, GNU recode for one.

mkluwe
Thank you - the Wikipedia article explained a lot. So essentially what I had was a string (in Java) consisting of characters that had somehow missed being decoded from UTF-8. So the fix in the end consisted of replacing:x = results.getString("field");withx = new String(rs.getBytes("field"), "UTF-8");Presumably I'll find a more elegant way of doing this, but this is a big step forward, especially in my understanding. Thanks both.
Steve Bennett
+3  A: 

Your text isn't 'mangled'. It's just in UTF8. C3 BC is what the ü is supposed to be encoded as. Just set whatever software you use to UTF8 also, and all pain will go away. If you can't set your software to Unicode, seriously consider switching to newer software.

I know it's scary at first, but you will have to do that eventually, anyway. My favorite music typesetter switched to Unicode-only input a while ago (they even deliberately removed support for the old 8-bit code pages to get people to switch), and I was upset, thinking that Latin-1 was good enough for me, and it was stupid to break stuff that was working perfectly well... and then I got over it and just set emacs to Unicode buffers, and now I'll never have to think about character encoding again in my life!

Kilian Foth