views:

59

answers:

1

It seems that we have managed to insert into our database 2 unicode characters for each of the unicode characters we want,

For example, for the unicde char 0x3CBC, we've inserted the unicode equivalents for each of it's components (0xC383 AND 0xC2BC)

Can anyone think of a simple solution for fixing this?

I've come up with something like

SELECT replace(name, CONCAT(0xC3,0x83,0xc2,0xbc), CONCAT(0xc3,0xbc)) FROM lang

For the above, but don't want to have to do this for every unicode character!

+3  A: 

for the unicde char 0x3CBC

I am presuming you mean the Unicode char U+00FC LATIN SMALL LETTER U WITH DIAERESIS (ü), which is encoded in UTF-8 as \xC3\xBC.

I don't think you can make the change inside MySQL. You can do:

-- convert doubly-encoded UTF-8 to singly-encoded
ALTER TABLE table MODIFY column TEXT CHARACTER SET latin1;
-- deliberately lose encoding information
ALTER TABLE table MODIFY column BLOB;
-- interpret the single-encoded UTF-8 bytes as UTF-8
ALTER TABLE table MODIFY column TEXT CHARACTER SET utf8;

for each column in the schema. This works for the specific example you give, but fails when one of the UTF-8 trail bytes is in the range 0x80-0x9F. This is because MySQL's ‘latin’ encoding is not really ISO-8859-1, but actually Windows cp1252, which maps characters in the range differently.

Probably the easiest way would be dumping the lot and doing the conversion on the mysqldump file. eg. from Python:

# Remove one level of UTF-8 encoding
#
dump= open('/path/to/dump.sql', 'rb').read()
dump= dump.decode('utf-8').encode('iso-8859-1')
open('/path/to/dump-out.sql', 'wb').write(dump)
bobince
+1 for the python solution. The .encode('iso-8859-1') is a nice hack to pull the raw bytes out of the unicode object.
Ian Clelland