views:

9262

answers:

5

I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.

In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_bin before.

+1  A: 

For UTF-8 textual information, you should use utf8_general_ci because...

  • utf8_bin: compare strings by the binary value of each character in the string

  • utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.

mepcotterell
+16  A: 

The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.

utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.

You can read more on specific unicode character sets on the MySQL manual - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Eran Galperin
small performance improvements ? are you sure about this ? http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=/com.ibm.db2.luw.admin.nls.doc/doc/c0053263.html The collation you choose can significantly impact the performance of queries in the database.
Adam Ramadhan
This is for DB2 not MySQL. Also, there are no concrete numbers or benchmarks so you are just basing it on the opinion of the writer.
Eran Galperin
+4  A: 

Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.

  • utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
  • utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store english text, these shouldn't differ.

Vegard Larsen
+6  A: 

Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.

Example from the documentation:

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.

Tomalak
+2  A: 

Be very, very aware of this problem that can occur when using utf8_general_ci.

MySQL will not distinguish between some characters in select statements, if the utf8_general_ci collate is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.

This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour as changed later.

I'm no DBA, but to avoid this problem, I always go with 'utf8-bin' instead of a case-insensitive one.

The script below describes the problem by example.

-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;

-- next, make sure that your client connection is of the same 
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci

-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
    CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');

-- (verify)
SELECT * FROM `test`;

-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are 
-- case insensitive (ending with _ci) do not distinguish between 
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to 
-- do the same with the 'latin1' charset:
--

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci

-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected). This shows 
-- that the problem with utf8/utf8_generic_ci isn't present 
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same 
-- way (for any sceptics out there):

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Two keys.
--

DROP DATABASE sandbox;
Guus