views:

2756

answers:

1

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.

My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.

1) Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?

2) What type of data would UTF-8 Binary be applicable to?

+9  A: 

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct.

Here is the difference:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Both utf8_general_ci and utf8_unicode_ci perform case-insensitive comparison. In constrast, utf8_bin is case-sensitive (among other differences), because it compares the binary values of the characters.

Sagi
Thanks... performance is not a factor I had thought of, but it is *quite* important, so that helps!
Dolph
I think that if you don't have a good reason to use _unicode_ci, then use _general_ci.
Sagi
and what about utf-8 bin?
Thorpe Obazee
@Thorpe Obazee - I've added explanation to my answer.
Sagi
I knew about the case-sensitivity thing though. +1
Thorpe Obazee