views:

529

answers:

5

Does anyone have any tips or gotcha moments to look out for when trying to migrate MySQL tables from the the default case-insenstive swedish or ascii charsets to utf-8? Some of the projects that I'm involved in are striving for better internationalization and the database is going to be a significant part of this change.

Before we look to alter the database, we are going to convert each site to use UTF-8 character encoding (from least critical to most) to help ensure all input/output is using the same character set.

Thanks for any help

+1  A: 

I am going to be going over the following sites/articles to help find an answer.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software

UTF-8 And Unicode FAQ

Hanselminutes episode "Sorting out Internationalization with Michael Kaplan"

And I also just found a very on topic post by Derek Sivers @ O'Reilly ONLamp Blog as I was writing this out. Turning MySQL data in latin1 to utf8 utf-8

Mike H
A: 

Some hints:

  • Your CHAR and VARCHAR columns will use up to 3 times more disk space. (You probably won't get much disk space grow for Swedish words.)
  • Use SET NAMES utf8 before reading or writing to the database. If you don't this then you will get partially garbled characters.
Harry
A: 

Your CHAR and VARCHAR columns will use up to 3 times more disk space.

Only if they're stuffed full of latin-1 with ordinals > 128. Otherwise, the increased space use of UTF-8 is minimal.

John Millikin
A: 

The collations are not always favorable. You'll get umlats collating to non umlatted versions which is not always correct. Might want to go w/ utf8_bin, but then everything is case sensitive as well.

A: 

Beware index length limitations. If a table is structured, say:

a varchar(255) b varchar(255) key ('a', 'b')

You're going to go past the 1000 byte limit on key lengths. 255+255 is okay, but 255*3 + 255*3 isn't going to work.

JBB