ansaurus

Question

Answer 1

+3 A:

This is a very complex issue, since UTF-8 encoded data can contain any Unicode character (i.e. characters from many 8-bit encodings which collate differently in different locales).

Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKD and then sorting on code points might give some collation that would make sense to you (ie "A" before "Ä").

Check the links I provided.

EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale.

ΤΖΩΤΖΙΟΥ 2008-09-23 11:12:10

Thanks for that info - I'll have a look at the links. But I doubt that the effort is worth the result as I just want so sort a list of country and state names. Perhaps there is a more simple solution.

Stefan Gehrig 2008-09-23 11:35:46

Seems to be a reasonable solution... I'll try sorting the converted array. You're right, that Windows-1252 should cover all the characters used.

Stefan Gehrig 2008-09-23 12:20:01

What do you mean convert UTF-8 into Unicode. UTF-8 is a variable-length character encoding for Unicode.

grom 2008-09-23 12:46:42

I mean a byte string of Unicode code points encoded as UTF-8 to the internal representation as a string of Unicode codepoints, whatever that representation would be in PHP (be it UCS-2, UCS-4). I am assuming that PHP has such a concept.

ΤΖΩΤΖΙΟΥ 2008-09-23 19:41:31

Thank you so much for this, I'll bookmark it as a reference.

Alix Axel 2009-08-10 23:11:23

Answer 2

+1 A:

Using your example with codepage 1252 worked perfectly fine here on my windows development machine.

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);

...snip...

This was with PHP 5.2.6. btw.

The above example is wrong, it uses ASCII encoding instead of UTF-8. I did trace the strcoll() calls and look what I found:

function traceStrColl($a, $b) {
    $outValue = strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
print_r($array);

gives:

Ungetüme Äpfel 2147483647
Ungetüme Birnen 2147483647
Ungetüme Apfel 2147483647
Ungetüme Ungetiere 2147483647
Österreich Ungetüme 2147483647
Äpfel Ungetiere 2147483647
Äpfel Birnen 2147483647
Apfel Äpfel 2147483647
Ungetiere Birnen 2147483647

I did find some bug reports which have been flagged being bogus... The best bet you have is filing a bug-report I suppose though...

Huppie 2008-09-23 11:21:18

Are you sure, that your PHP file used for testing is UTF-8 encoded? If I use ISO-8859-1 encoding for the file itself, I get the same result you posted above.

Stefan Gehrig 2008-09-23 11:29:56

I doublechecked it with a second file (made sure it was UTF-8 encoded) but now it (indeed) seems to replicate your problem, sorry for the crap in that case.

Huppie 2008-09-23 11:32:49

Answer 3

A:

Your collation needs to match the character set. Since your data is UTF-8 encoded, you should use a UTF-8 collation. It could be named differently on different platforms, but a good guess would be de_DE.utf8.

On UNIX systems, you can get a list of currently installed locales with the command

locale -a

troelskn 2008-09-23 14:40:05

I'm using a Windows machine for developing... The appropriate UTF-8 codepage in Windows is 65001 - thats why my locale should be German_Germany.65001.

Stefan Gehrig 2008-09-23 16:21:13

Answer 4

A:

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
    $array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
  [0]=>
  string(1) "c"
  [1]=>
  string(1) "B"
  [2]=>
  string(1) "s"
  [3]=>
  string(1) "C"
  [4]=>
  string(1) "k"
  [5]=>
  string(1) "D"
  [6]=>
  string(2) "ä"
  [7]=>
  string(1) "E"
  [8]=>
  string(1) "g"
  [...]

The same snippet works on a Linux machine without any problems producing the following output:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
  [0]=>
  string(1) "a"
  [1]=>
  string(1) "A"
  [2]=>
  string(2) "ä"
  [3]=>
  string(2) "Ä"
  [4]=>
  string(1) "b"
  [5]=>
  string(1) "B"
  [6]=>
  string(1) "c"
  [7]=>
  string(1) "C"
  [...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

Thanks to all of you.

Stefan Gehrig 2008-09-24 07:42:28

Your bug report got my vote ;-)

Huppie 2008-09-24 18:29:38

Answer 5

A:

Note that the sort order depends on the language. In German, A and Ä can sometimes be sorted as if they were the same letter, and sometimes Ä can be sorted as it was in fact "AE".

Is Swedish, however, Ä comes at the end of the alphabet.

Carl

Carl Seleborg 2008-09-24 08:16:04

You're right - this property is respected by using the correct locale and strcoll() for sorting. The problem here is, that on Windows strcoll() seems to have a problem when the input strings are UTF-8 encoded.

Stefan Gehrig 2008-09-24 08:57:12

Answer 6

+1 A:

Update on this issue:

Even though the discussion around this problem revealed that we could have discovered a PHP bug with strcoll() and/or setlocale(), this is clearly not the case. The problem is rather a limitation of the Windows CRT implementation of setlocale() (PHPs setlocale() is just a thin wrapper around the CRT call). The following is a citation of the MSDN page "setlocale, _wsetlocale":

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL. The set of language and country/region codes supported by setlocale is listed in Language and Country/Region Strings.

It therefore is impossible to use locale-aware string operations within PHP on Windows when strings are multi-byte encoded.

Stefan Gehrig 2008-12-08 09:54:40

ansaurus

tags:

views:

answers:

How to sort an array of UTF-8 strings?

related questions