tags:

views:

145

answers:

2

I have table with words dictionary in my language (latvian).

CREATE TABLE words (
value varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

And let's say it has 3 words inside:
INSERT INTO words (value) VALUES ('tēja');
INSERT INTO words (value) VALUES ('vējš');
INSERT INTO words (value) VALUES ('feja');

What I want to do is I want to find all words that is exactly 4 characters long and where second character is 'ē' and third character is 'j'

For me it feels that correct query would be:
SELECT * FROM words WHERE value LIKE '_ēj_';
But problem with this query is that it returs not 2 entries ('tēja','vējš') but all three. As I understand it is because internally MySQL converts strings to some ASCII representation?

Then there is BINARY addition possible for LIKE
SELECT * FROM words WHERE value LIKE BINARY '_ēj_';
But this also does not return 2 entries ('tēja','vējš') but only one ('tēja'). I believe this has something to do with UTF-8 2 bytes for non ASCII chars?

So question:
What MySQL query would return my exact two words ('tēja','vējš')?

Thank you in advance

A: 

You have to use proper collation.
Dunno for the latvian but here is the example for the german: http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html
to give you an idea

You can try some of the baltic collations

Col. Shrapnel
A: 

What MySQL query would return my exact two words ('tēja','vējš')?

SELECT * FROM words WHERE value LIKE '_ēj_' COLLATE utf8_bin;

The utf8_bin collation is not just diacritical-sensitive, but also case-sensitive. If you want to match only the letter-with-diacritical and you don't care about upper/lower case, you would have to find a utf_..._ci collation that doesn't treat e and ē as the same letter.

I can't immediately see one (there are plenty that don't collate ē at all, which would be okay if you only need case-sensitive matching on the non-diacritical letters). Interesting that the Latvian collation treats macron-letters as the same as plain letters, which you don't want (it knows š is different from s).

Anyway, whatever collation you end up with, you will want to put your tables in that collation rather than manually specifying it in a query, so that comparisons can be properly indexed.

bobince
Thank you, I did exactly as you said - changed table to: CHARACTER SET utf8 COLLATE utf8_bin. I expect to use also some cyrilic symbols so I`ll stick to UTF-8
oskarae