Between utf8_general_ci
and utf8_unicode_ci
, are there any differences in terms of performance?
views:
1774answers:
2See the mysql manual, Unicode Character Sets section:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which should implement the entire standard. The general_ci set will be faster because there is less computation to do.
There are at least two important differences.
Performance
utf8_general_ci is much faster at comparisons and sorting, because it only sorts each character as a single value. That is, for comparison and sorting, each character is converted into a single numeric value and then these values are compared.
utf8_unicode_ci uses a much more complex comparison algorithm where up to 4 parameters need to be taken into account for each character.
Accuracy of sorting in various languages
utf8_unicode_ci is based on the Unicode standard for sorting. utf8_general_ci is very close, but is not Unicode compliant, because compromises have been made to make it faster.
Unicode defines sets of rules for how characters should be sorted. These rules need to take into account the local conventions; not everybody sorts their characters in what we would call 'alphabetical order'. As far as latin ie "european" languages go, there is not much difference between the Unicode sorting and the simplified utf8_general_ci sorting in MySQL, but there are still a few differences.
For examples, Unicode collation sorts "ß" like "ss", and "Œ" like "OE", whereas utf8_general_ci sorts them as single characters like "s" and presumably "e" respectively.
In non-latin languages, such as Asian languages or languages with different alphabets, utf8_unicode_ci may either make no difference or a lot of difference depending on the language.
Some Unicode characters are defined as ignorable, which means they shouldn't count toward the sort order and you should move on to the next character instead. utf8_unicode_ci handles these properly, whereas for performance reasons utf8_general_ci doesn't and a word with the ignorable character will be sorted differently to a word without.
I tend to use utf8_general_ci in general, except where sorting is going to be important enough to justify the performance cost, where I'll use utf8_unicode_ci. Alternatively you could just use utf8_unicode_ci all the time, and only worry about performance if and when it becomes a problem.