views:

396

answers:

2

I am writing a string compare function to sort medical terms that often contain special accented characters from many different European languages, and I need to somehow achieve a collation similar to MySQL's latin1_general_ci.

First, I'm doing some basic munging on the strings to remove spaces, quotes, hyphens, parentheses, etc. The problem comes when I pass the strings on to strcoll() using the default locale, because it is not smart enough to consider, for example, an accented e as lexicographically equivalent to a normal e.

I'm wary to use a locale like German or French because it probably will not include all of the special characters I need to consider. Is there a locale that will give me something to similar to the latin1_general_ci collation? Or is there maybe another solution?

My naive solution would be to create a large associative array to map accented letters to their regular letter equivalents, then using this with str_replace(), but that sounds slow and tedious (and error-prone). I would rather use something built into the language if possible.

Also on that note, does strcmp() or strcasecmp() respect the collation of the current locale, or is it just strcoll() that does this?

+1  A: 

Maybe this:

setlocale(LC_COLLATE, 'fr_FR.Latin1', 'fr.Latin1', 'fr_FR.Latin-1', 'fr.Latin-1');

strcmp() and strcasecmp() are not localized.

chaos
Is that French? Won't there be characters in, say, German that won't be accounted for in that collation? Or is FR doing something special?I did find an "Indo-European" locale and I am currently testing whether it produces the desired result and accounts for the special characters that I'm after.
Jonathan Collins
It is French, but I'm trying to use the .Latin1 / .Latin-1 modifier to force that charset. What it takes for that to actually be accepted is the mysterious part.
chaos
I just tried this and oddly enough it worked. Setting a locale other than the default 'C' enables strcoll() to sort all of the accented characters, even ones that aren't in that particular language. For example, setting fr_FR makes strcoll() aware of the german ß character. Odd! Thanks for your help.
Jonathan Collins
A: 

You can also try the iconv functions to help normalize the strings. That'll handle the accented e to normal e situations. See this related question about sorting utf8 strings, too.

Richard Levasseur
How exactly can I use iconv? I tried this: iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Déjérine-Klumpke')but it turned the accented e characters into question marks.
Jonathan Collins
I figured that out. For some reason to do that transliteration, you need to set a locale other than the default 'C' locale.
Jonathan Collins
Note that it still isn't able to transliterate characters that aren't in that locale. For example, I tried en_US and it still turned the accented e above into a question mark.I believe the correct solution is still to set a locale other than 'C" and then use strcoll(), as it is seemingly able to collate all of the special characters regardless of the chosen locale.
Jonathan Collins
have you tried converting the strings to utf and setting the locale to utf8? In python, i managed to do what you want using http://docs.python.org/library/unicodedata.html. I -thought- i had seen a php library to do the normalization/decomposition, but I can't find it now.
Richard Levasseur