tags:

views:

541

answers:

3

I have a column containing the list of names. I need to select only the first (unique) letters from the name. For non-utf-8 characters the following query works pretty well:

SELECT DISTINCT LEFT(T1.Name, 1) AS firstLetter

However, when the Name starts with UTF-8 encoded character this returns: �-sign. I suppose it's only the first character of a UTF-8 string.

Thanks in advance.

The question is how do I properly select complete first character from the column above.

P.S.: Table character set is utf8 and collation set to utf8_bin, the fields character set is utf8 and collation set to utf8_turkish_ci.

A: 

According to the documentation, the substring function is multi-byte safe. I tried with a russion database. This should do the trick:

SELECT DISTINCT SUBSTRING(T1.Name, 1, 1) AS firstLetter FROM T1
Daniel Schneller
Unfortunately, it didn't work out :(
turezky
What do you issue the select with? The command line client usually does not behave that well with multi-byte, as it depends on the underlying terminal. Try MySQL Query Browser or something like that.
Daniel Schneller
A: 

The ORD function will return the code for the left most character both if it is multibyte or basic ASCII. You can do something like:

SELECT DISTINCT ORD(T1.Name) AS firstCode

To get the characters back, you can then use the CHAR function specifying the UTF-8 character set to get something like:

SELECT DISTINCT CHAR(ORD(T1.NAME) USING utf8) as firstLetter
Il-Bhima
+5  A: 

LEFT(str, 1) is supposed to the leftmost character, not the leftmost byte. This means the query is doing what you want, even if the first character is a multibyte character.

I'm guessing the � sign emerges later, due to a connection/encoding/font/rendering problem. Try

SELECT LENGTH(LEFT(T1.Name, 1)) AS charLength

LENGTH returns how many bytes a string takes up, so if this query gives you any results of 2 or more, this means that LEFT() is indeed returning multibyte characters and your problem lies beyond the query itself.

If you are executing the query at the command line, maybe your terminal cannot render the characters, or otherwise they are getting mangled somewhere else. If you are using a scripting language, try to use that language's string length, and ord() functions, to help find out what's going on.

EDIT: Since you are using PHP, try this:

//Store a character returned from the database in $unicodechar
$unicodechar = $row[0];

//Now print out the value of each byte in the character
for($i = 0; $i < strlen($unicodechar); $i++)
{
    echo '0x' . dechex(ord($char[$i])) . ' ';
}
echo '\n';

If for example the result is this character then you should get "0xC4 0x9E". If you do indeed get this kind of thing, then PHP is getting the multibyte character properly, and the problem is either in the encoding of the web page itself (see this W3C page) or the browser/font cannot render that particular character.

Artelius
Right, this is what I suspect as well. See the comment in my answer. I tried LEFT, too, in the meantime and it is also multi-byte aware.
Daniel Schneller
Indeed, it returned 2 as the length of the char. Thanks for the heads up!//I am using PHP to view the results.
turezky