views:

935

answers:

4

Is it possible to sort an array with Unicode / UTF-8 characters in PHP using a natural order algorithm? For example (the order in this array is correctly ordered):

$array = array
(
    0 => 'Agile',
    1 => 'Ágile',
    2 => 'Àgile',
    3 => 'Âgile',
    4 => 'Ägile',
    5 => 'Ãgile',
    6 => 'Test',
);

If I try with asort($array) I get the following result:

Array
(
    [0] => Agile
    [6] => Test
    [2] => Àgile
    [1] => Ágile
    [3] => Âgile
    [5] => Ãgile
    [4] => Ägile
)

And using natsort($array):

Array
(
    [2] => Àgile
    [1] => Ágile
    [3] => Âgile
    [5] => Ãgile
    [4] => Ägile
    [0] => Agile
    [6] => Test
)

How can I implement a function that returns the correct result order (0, 1, 2, 3, 4, 5, 6) under PHP 5? All the multi byte string functions (mbstring, iconv, ...) are available on my system.

EDIT: I want to natsort() the values, not the keys - the only reason why I'm explicitly defining the keys (and using asort() instead of sort()) is to ease the job of finding out where the sorting of unicode values went wrong.

A: 
natsort($array);
$array = array_values($array);
JW
good one. Got my vote.
Babiker
The keys in my example are not the problem, they are only there to help sorting the unicode values.
Alix Axel
+11  A: 

The question is not as easy to answer as it seems on the first look. This is one of the areas where PHP's lack of unicode supports hits you with full strength.

Frist of all natsort() as suggested by other posters has nothing to do with sorting arrays of the type you want to sort. What you're looking for is a locale aware sorting mechanism as sorting strings with extended characters is always a question of the used language. Let's take German for example: A and Ä can sometimes be sorted as if they were the same letter (DIN 5007/1), and sometimes Ä can be sorted as it was in fact "AE" (DIN 5007/2). In Swedish, in contrast, Ä comes at the end of the alphabet.

If you don't use Windows, you're lucky as PHP provides some functions to exactly this. Using a combination of setlocale(), usort(), strcoll() and the correct UTF-8 locale for your language, you get something like this:

$array = array('Àgile', 'Ágile', 'Âgile', 'Ãgile', 'Ägile', 'Agile', 'Test');
$oldLocal = setlocale(LC_COLLATE, '<<your_RFC1766_language_code>>.utf8');
usort($array, 'strcoll');
setlocale(LC_COLLATE, $oldLocal);

Please note that it's mandatory to use the UTF-8 locale variant in order to sort UTF-8 strings. I reset the locale in the example above to its original value as setting a locale using setlocale() can introduce side-effects in other running PHP script - please see PHP manual for more details.

When you do use a Windows machine, there is currently no solution to this problem and there won't be any before PHP 6 I assume. Please see my own question on SO targeting this specific problem.

Stefan Gehrig
Great insight, I'm developing on Windows but this will run on *nix machines. If I'm not mistaken PHP 5.3 will support this kind of sorting though some kind of class however I want to refrain myself from relying on set_locale() for mostly two reasons: 1) it's unpredictable (depends on the locales the OS has available) and 2) it's not thread-safe and may cause unexpected behavior on the server.
Alix Axel
Sorting using a multi byte version of the ord() function gives me the exactly same results as a simple sort(). =(
Alix Axel
Sorry but I cannot follow your second comment...
Stefan Gehrig
Regarding your first comment: you're absolutely right, that the solution presented in my answer is not the one, one might expect as it's neither portable nor free of side-effects. But: it's the only one right now - besides implementing your own sorting on a character and byte level using for example ext/mbstring.
Stefan Gehrig
Regarding my second comment, I used the mbstring extension to code a multi byte equivalent of the original PHP ord() function, but the results it gave me where the same as the sort() function.
Alix Axel
Can MySQL be used to provide a steady workaround to this problem?
Alix Axel
Yes sorting the data on the MySQL server would be a feasible workaorund. MySQL does not suffer from those limitations. You can control sort-order by choosing the right collaction for your data.
Stefan Gehrig
@Stefan Gehrig: http://stackoverflow.com/questions/2146420/sorting-problem-overlapping-array-keys
Alix Axel
A: 

Nailed it!

$array = array('Ägile', 'Ãgile', 'Test', 'カタカナ', 'かたかな', 'Ágile', 'Àgile', 'Âgile', 'Agile');

function Unaccent($string)
{
    return html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8');
}

array_multisort(array_map('Unaccent', $array), $array);

Output:

Array
(
    [0] => Agile
    [1] => Àgile
    [2] => Ágile
    [3] => Âgile
    [4] => Ãgile
    [5] => Ägile
    [6] => Test
    [7] => かたかな
    [8] => カタカナ
)
Alix Axel
+1  A: 

Thanks, but that not working with Vietnamese language. Example with array. $arrary = array('tôi', 'dinh', 'dũng', 'đức', 'đăng', 'diep', 'điệp', 'ân', 'anh'); I need help. Thanks in advance.

mienh