views:

2095

answers:

3

The MySQL database used by my Rails application currently has the default collation of latin1_swedish_ci. Since the default charset of Rails applications (including mine) is UTF-8, it seems sensible to me to use the utf8_general_ci collation in the database.

Is my thinking correct?

Assuming it is, what would be the best approach to migrate the collation and all the data in the database to the new encoding?

+1  A: 

Convert to UTF-8 as the charset.

Collation settings are only used for sorting and stuff like that. Choose the collation that most of your users would expect.

Christoph Schiessl
+1  A: 

UTF-8, as well as any other Unicode encoding scheme, can store characters in any language, so it is an excellent choice of codepage for your database.

The collation setting, on the other hand, is a completely separate issue from the encoding scheme. It involves sort orders, upper/lowercase conversions, string equality comparisons, and things like that which are language-specific. The collation setting should match the language that is used in the database.

The UTF-8 general collation is (I am assuming here—I'm not familiar with MySQL in particular) used for situations where the language is unknown and some simple default ordering is needed. It probably corresponds to the Unicode code point ordering, which is almost certainly not what you want if you're storing Swedish.

Jeffrey L Whitledge
+1  A: 

Providing your existing data in the database is CORRECTLY encoded in latin1, converting the tables to utf8 (using ALTER TABLE, as described in the docs) should just work.

Then all your application needs to do is continue doing whatever it did before. If your application wants to use unicode characters, it should set its connection encoding to utf8 and use utf8, but that's its own problem.


The problem is that a large number of crap web apps have historically sent utf8 data to mysql and told it to treat it as latin1. MySQL will honour this perfectly and save junk into the tables, as instructed.

Converting the tables from latin1 to utf8 will NOT repair this mistake, as you genuinely do have total rubbish in there. Repairing them is nontrivial, particularly if during the lifetime of the app it's been talking different types of rubbish to the database.

MarkR
Well, the data is coming from a Rails app which has character encoding set to utf-8, not latin1. Presumably this puts my app into the 'crap web app' category which is sending utf-8 to a latin1 table? What do you suggest I do to convert the data?
Olly