ansaurus

Question

Fixing Unicode Oops

Answer 1

+3 A:

for the unicde char 0x3CBC

I am presuming you mean the Unicode char U+00FC LATIN SMALL LETTER U WITH DIAERESIS (ü), which is encoded in UTF-8 as \xC3\xBC.

I don't think you can make the change inside MySQL. You can do:

-- convert doubly-encoded UTF-8 to singly-encoded
ALTER TABLE table MODIFY column TEXT CHARACTER SET latin1;
-- deliberately lose encoding information
ALTER TABLE table MODIFY column BLOB;
-- interpret the single-encoded UTF-8 bytes as UTF-8
ALTER TABLE table MODIFY column TEXT CHARACTER SET utf8;

for each column in the schema. This works for the specific example you give, but fails when one of the UTF-8 trail bytes is in the range 0x80-0x9F. This is because MySQL's ‘latin’ encoding is not really ISO-8859-1, but actually Windows cp1252, which maps characters in the range differently.

Probably the easiest way would be dumping the lot and doing the conversion on the mysqldump file. eg. from Python:

# Remove one level of UTF-8 encoding
#
dump= open('/path/to/dump.sql', 'rb').read()
dump= dump.decode('utf-8').encode('iso-8859-1')
open('/path/to/dump-out.sql', 'wb').write(dump)

bobince 2009-10-08 15:08:12

+1 for the python solution. The .encode('iso-8859-1') is a nice hack to pull the raw bytes out of the unicode object.

Ian Clelland 2009-10-08 17:39:56

ansaurus

tags:

views:

answers:

Fixing Unicode Oops

related questions