views:

491

answers:

1

I'd like to be able to do queries that normalize accented characters, so that for example:

é, è, and ê

are all treated as 'e', in queries using '=' and 'like'. I have a row with username field set to 'rené', and I'd like to be able to match on it with both 'rene' and 'rené'.

I'm attempting to do this with the 'collate' clause in MySQL 5.0.8. I get the following error:

mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'

FWIW, my table was created with:

CREATE TABLE `User` (
  `id` bigint(19) NOT NULL auto_increment,
  `username` varchar(32) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `uniqueUsername` (`username`)
) ENGINE=InnoDB AUTO_INCREMENT=56790 DEFAULT CHARSET=utf8
+3  A: 

I'd suggest that you save the normalized versions to your table in addition with the real username. Changing the encoding on the fly can be expensive, and you have to do the conversion again for every row on every search.

If you're using PHP, you can use iconv() to handle the conversion:

$username = 'rené';
$normalized = iconv('UTF-8', 'ASCII//TRANSLIT', $string);

Then you'd just save both versions and use the normalized version for searching and normal username for display. Comparing and selecting will be alot faster from the normalized column, provided that you normalize the search string also:

$search = mysql_real_escape_string(iconv('UTF-8', 'ASCII//TRANSLIT', $_GET['search']));
mysql_query("SELECT * FROM User WHERE normalized LIKE '%".$search."%'");

Of course this method might not be viable if you have several columns that need normalizations, but in your specific case this might work allright.

Tatu Ulmanen
Hmm, I'm a bit leery of keeping data in multiple places (DRY), unless it proves to be a bottleneck. In this case, it would involve 3 existing fields- username, firstName, and lastName (I vastly simplified my table structure for the purpose of asking a simple question).
Caffeine Coma