ansaurus

Question

How to decode a string that has been UTF-8 encoded twice to simple UTF-8?

Answer 1

A:

You could try using

SELECT CONVERT(`your_column` USING ascii)
FROM `your_table`

at the MySQL query level. This is a stab in the dark, though.

chaos 2009-09-19 18:10:28

Answer 2

+6 A:

Peculiar problem, but I think I can reproduce it by a suitably-unholy mix of UTF-8 and Latin-1 (not by just two uses of UTF-8 without an interspersed mis-step in Latin-1 though). Here's the whole weird round trip, "there and back again" (Python 2.* or IronPython should both be able to reproduce this):

# -*- coding: utf-8 -*-
uni = u'Újratárgyalja'
enc1 = uni.encode('utf-8')
enc2 = enc1.decode('latin-1').encode('utf-8')
dec3 = enc2.decode('utf-8')
dec4 = dec3.encode('latin-1').decode('utf-8')

for x in (uni, enc1, enc2, dec3, dec4):
  print repr(x), x

This is the interesting output...:

u'\xdajrat\xe1rgyalja' Újratárgyalja
'\xc3\x9ajrat\xc3\xa1rgyalja' Újratárgyalja
'\xc3\x83\xc2\x9ajrat\xc3\x83\xc2\xa1rgyalja' ÃjratÃ¡rgyalja
u'\xc3\x9ajrat\xc3\xa1rgyalja' ÃjratÃ¡rgyalja
u'\xdajrat\xe1rgyalja' Újratárgyalja

The weird string starting with Ã appears as enc2, i.e. two utf-8 encodings WITH an interspersed latin-1 decoding thrown into the mix. And as you can see it can be undone by the exactly-converse sequence of operations: decode as utf-8, re-encode as latin-1, re-decode as utf-8 again -- and the original string is back (yay!).

I believe that the normal round-trip properties of both Latin-1 (aka ISO-8859-1) and UTF-8 should guarantee that this sequence will work (sorry, no C# around to try in that language right now, but I would expect that the encoding/decoding sequences should not depend on the specific programming language in use).

Alex Martelli 2009-09-19 18:15:17

Ingenious. And the answer is accepted. For my own curiosity though, I tried reproducing your results on Windows with Python 2.6.1. Had a tough time of it, because the very act of copying and pasting your code created issues (for example, pasting into Notepad and the IDLE editor were quite different!). Then executing it created more issues (had to resort to printing only repr(x) if outside of IDLE). [I know, I know, get a Real Operating System, etc.]

John Y 2009-09-19 20:00:28

Answer 3

+1 A:

When you write "The MySQL .Net connector downloads them this way." there's a good chance this means the MySQL .Net connector believes it is speaking Latin-1 to MySQL, while MySQL believes the conversation is in UTF-8. There's also a chance the column is declared as Latin-1, but actually contains UTF-8 data.

If it's the latter (column labelled Latin-1 but data is actually UTF-8) you will get mysterious collation problems and other bugs if you make use of MySQL's text processing functions, ORDER BY on the column, or other situations where the text "means something" rather than just being bytes sent over the wire.

In either case you should try to fix the underlying problem, not least because it is going to be a complete headache for whoever has to maintain the system otherwise.

tialaramex 2009-10-12 13:24:17

The column is declared as UTF-8 and the data stored in it is also UTF-8, but for some mysterious reasons the PHP's PDO extension encoded it twice.

RoliSoft 2009-10-13 10:30:03

You never mentioned anything about PHP before. So is the data actually corrupt in the MySQL DB?

tialaramex 2009-10-13 17:50:15

I mentioned it right in the first sentence, that it is stored in the MySQL database that way. However, I didn't mention that PHP's PDO extension stored them this way, because the original question was how to decode a string in C# that has been UTF-8 encoded twice.

RoliSoft 2009-10-13 20:04:01

OK, so in this case you definitely ought to fix it in the database, for all the reasons I gave in the second paragraph.It is annoyingly easy to misconfigure PHP + MySQL and do this, I've done it myself, but fortunately I caught it pretty quickly. You can do the same decode / encode run around inside MySQL, but you need to be careful (create a backup first).

tialaramex 2009-10-19 17:05:19

Answer 4

A:

Too bad no-one has answered the problem? Yes, we all agree that it is the "wrong" way to solve it, but sometimes you need to do a temporary workaround... and right now, I would have needed that snippet of code. :-)

Will publish it if I manage to solve it.

Per Lundberg 2010-02-11 08:46:01

ansaurus

tags:

views:

answers:

How to decode a string that has been UTF-8 encoded twice to simple UTF-8?

related questions