views:

54

answers:

1

I'm not 100% sure on the particulars, so I'd love someone straightening me out, but I'll forge ahead with what I think is going on...

When I first setup my database, I used the default character encoding of the system without even thinking, and it was latin1. I never even thought about i18n/l10n. It just didn't occur to me. I just accepted the defaults and went with it.

Anyways, I've been using the database exclusively for a Rails app, and we've now got several GB of data, 100,000s of rows, and many international users. I've noticed that many of our foreign users are inserting data that seems to be Unicode / non-latin1. Here is an example:

What about crazy Unicode stuff? ☢ ☠ ☭

database.yml

Here is our database.yml file.

development:
  adapter: mysql
  database: XXX
  username: YYY
  password: ZZZ
  host: localhost
  encoding: utf8

As you can see, we're setting our character encoding to utf8. However, all our tables have a default character set of latin1. I'm sure of this.

Update After looking closely, our production database.yml does not specify an encoding, while my local copy was specifying utf8. This was causing problems when I would dump the production database and import it locally. It seems now that the import was working fine, but Rails was reading it incorrectly.

mysql CLI tool

When I view the data via the mysql CLI tool, it displays all the Unicode characters correctly. However, the 'show create table' statement clearly shows that the tables are default charset latin1. This leads me to believe that MySQL is somehow smart enough to store non-latin1 data.

HTTP header

Our HTTP Content-Type header is set to utf-8, like so:

Content-Type: text/html; charset=utf-8

Conversion Attempts

I've played a little with converting our tables to utf-8 encoding, all with no success. Mainly I tried dumping the database, running iconv to convert, then re-importing with the tables set to utf-8. MySQL had no errors, but all the Unicode data was garbled.

What to do?

I'm kind of stuck as to what to do (if anything). I'm a strong believer in not fixing what isn't broken, but this whole situation worries me. We've never had any complaints from users about not being able to store their data, and everything seems to be working fine. I'd just like to know what exactly is going on, who/what is doing the conversion (MySQL? Ruby? Rails? MySQL connection?), and any tips on how to proceed.

+1  A: 

Most likely the data stored in your tables is valid UTF-8, but MySQL thinks it's Latin-1 (because that's the datatype the column was declared with). It is also valid Latin-1, of course, since AFAIK any arbitrary sequence of bytes is valid Latin-1.

What happens when you convert to UTF-8 is that MySQL sees valid Latin-1 encoded data and converts that to the equivalent valid UTF-8. This means that you get data that's double-UTF-8-encoded, which is why it is garbled.

The way to get around this is to convert the column to a binary string and then to UTF-8 from there. MySQL does not convert the string when you do this (because you're converting it via a format that just says, "treat this string as a series of 0s and 1s").

ALTER TABLE MyTable
    MODIFY MyColumn CHAR(100) CHARACTER SET binary,
    MODIFY MyColumn CHAR(100) CHARACTER SET utf8
Hammerite
I've played a bit with the binary modification, too, and it just takes so damn long. I think I'm going to dump the entire database (mysqldump) and then change the table defaults to utf8 in the SQL dump, then re-import. I *think* that should work, as MySQL won't do any conversion in that case, just take the data in normally.
Micah