views:

3649

answers:

6

I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no problem) is no solution. The following does not work on my windows development machine (although I'd think that this should be at least a possible solution):

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.65001'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);

The output is:

string(20) "German_Germany.65001"
string(1) "C"
array(6) {
  [0]=>
  string(6) "Birnen"
  [1]=>
  string(9) "Ungetiere"
  [2]=>
  string(6) "Äpfel"
  [3]=>
  string(5) "Apfel"
  [4]=>
  string(9) "Ungetüme"
  [5]=>
  string(11) "Österreich"
}

This is complete nonsense. Using 1252 as the codepage for setlocale() gives another output but still a plainly wrong one:

string(19) "German_Germany.1252"
string(1) "C"
array(6) {
  [0]=>
  string(11) "Österreich"
  [1]=>
  string(6) "Äpfel"
  [2]=>
  string(5) "Apfel"
  [3]=>
  string(6) "Birnen"
  [4]=>
  string(9) "Ungetüme"
  [5]=>
  string(9) "Ungetiere"
}

Is there a way to sort an array with UTF-8 strings locale aware?

Just noted that this seems to be PHP on Windows problem, as the same snippet with de_DE.utf8 used as locale works on a Linux machine. Nevertheless a solution for this Windows-specific problem would be nice...

+3  A: 

This is a very complex issue, since UTF-8 encoded data can contain any Unicode character (i.e. characters from many 8-bit encodings which collate differently in different locales).

Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKD and then sorting on code points might give some collation that would make sense to you (ie "A" before "Ä").

Check the links I provided.

EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale.

ΤΖΩΤΖΙΟΥ
Thanks for that info - I'll have a look at the links. But I doubt that the effort is worth the result as I just want so sort a list of country and state names. Perhaps there is a more simple solution.
Stefan Gehrig
Seems to be a reasonable solution... I'll try sorting the converted array. You're right, that Windows-1252 should cover all the characters used.
Stefan Gehrig
What do you mean convert UTF-8 into Unicode. UTF-8 is a variable-length character encoding for Unicode.
grom
I mean a byte string of Unicode code points encoded as UTF-8 to the internal representation as a string of Unicode codepoints, whatever that representation would be in PHP (be it UCS-2, UCS-4). I am assuming that PHP has such a concept.
ΤΖΩΤΖΙΟΥ
Thank you so much for this, I'll bookmark it as a reference.
Alix Axel
+1  A: 

Using your example with codepage 1252 worked perfectly fine here on my windows development machine.

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);

...snip...

This was with PHP 5.2.6. btw.


The above example is wrong, it uses ASCII encoding instead of UTF-8. I did trace the strcoll() calls and look what I found:

function traceStrColl($a, $b) {
    $outValue = strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
print_r($array);

gives:

Ungetüme Äpfel 2147483647
Ungetüme Birnen 2147483647
Ungetüme Apfel 2147483647
Ungetüme Ungetiere 2147483647
Österreich Ungetüme 2147483647
Äpfel Ungetiere 2147483647
Äpfel Birnen 2147483647
Apfel Äpfel 2147483647
Ungetiere Birnen 2147483647

I did find some bug reports which have been flagged being bogus... The best bet you have is filing a bug-report I suppose though...

Huppie
Are you sure, that your PHP file used for testing is UTF-8 encoded? If I use ISO-8859-1 encoding for the file itself, I get the same result you posted above.
Stefan Gehrig
I doublechecked it with a second file (made sure it was UTF-8 encoded) but now it (indeed) seems to replicate your problem, sorry for the crap in that case.
Huppie
A: 

Your collation needs to match the character set. Since your data is UTF-8 encoded, you should use a UTF-8 collation. It could be named differently on different platforms, but a good guess would be de_DE.utf8.

On UNIX systems, you can get a list of currently installed locales with the command

locale -a
troelskn
I'm using a Windows machine for developing... The appropriate UTF-8 codepage in Windows is 65001 - thats why my locale should be German_Germany.65001.
Stefan Gehrig
A: 

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
    $array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
  [0]=>
  string(1) "c"
  [1]=>
  string(1) "B"
  [2]=>
  string(1) "s"
  [3]=>
  string(1) "C"
  [4]=>
  string(1) "k"
  [5]=>
  string(1) "D"
  [6]=>
  string(2) "ä"
  [7]=>
  string(1) "E"
  [8]=>
  string(1) "g"
  [...]

The same snippet works on a Linux machine without any problems producing the following output:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
  [0]=>
  string(1) "a"
  [1]=>
  string(1) "A"
  [2]=>
  string(2) "ä"
  [3]=>
  string(2) "Ä"
  [4]=>
  string(1) "b"
  [5]=>
  string(1) "B"
  [6]=>
  string(1) "c"
  [7]=>
  string(1) "C"
  [...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

Thanks to all of you.

Stefan Gehrig
Your bug report got my vote ;-)
Huppie
A: 

Note that the sort order depends on the language. In German, A and Ä can sometimes be sorted as if they were the same letter, and sometimes Ä can be sorted as it was in fact "AE".

Is Swedish, however, Ä comes at the end of the alphabet.

Carl

Carl Seleborg
You're right - this property is respected by using the correct locale and strcoll() for sorting. The problem here is, that on Windows strcoll() seems to have a problem when the input strings are UTF-8 encoded.
Stefan Gehrig
+1  A: 

Update on this issue:

Even though the discussion around this problem revealed that we could have discovered a PHP bug with strcoll() and/or setlocale(), this is clearly not the case. The problem is rather a limitation of the Windows CRT implementation of setlocale() (PHPs setlocale() is just a thin wrapper around the CRT call). The following is a citation of the MSDN page "setlocale, _wsetlocale":

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL. The set of language and country/region codes supported by setlocale is listed in Language and Country/Region Strings.

It therefore is impossible to use locale-aware string operations within PHP on Windows when strings are multi-byte encoded.

Stefan Gehrig