views:

31

answers:

1

Look at IBM's Unicode for the working PHP programmer, especially listings 3 and 4.

On Ubuntu Lucid I get the same output from the code as IBM does, viz:

Здравсствуйте
Array
(
    [1] => 65279
    [2] => 1047
    [3] => 1076
    [4] => 1088
    [5] => 1072
    [6] => 1074
    [7] => 1089
    [8] => 1089
    [9] => 1090
    [10] => 1074
    [11] => 1091
    [12] => 1081
    [13] => 1090
    [14] => 1077
)
Здравсствуйте

However, on Windows I get a completely different response.

ðùð┤ÐÇð░ð▓ÐüÐüÐéð▓Ðâð╣ÐéðÁ
Array
(
    [1] => -131072
    [2] => 386138112
    [3] => 872677376
    [4] => 1074003968
    [5] => 805568512
    [6] => 839122944
    [7] => 1090781184
    [8] => 1090781184
    [9] => 1107558400
    [10] => 839122944
    [11] => 1124335616
    [12] => 956563456
    [13] => 1107558400
    [14] => 889454592
)
ðùð┤ÐÇð░ð▓ÐüÐüÐéð▓Ðâð╣ÐéðÁ

Aside from the fact that the Russian characters (which are in UTF-32) don't render in a CMD.EXE shell (because they're in UTF-32 not Windows' own UTF-16), why do the character values differ so significantly?

+2  A: 
function utf8_to_unicode_code($utf8_string)
{
    $expanded = iconv("UTF-8", "UTF-32", $utf8_string);
    return unpack("L*", $expanded);
}

This does two things wrong:

  1. It uses “UTF-32”, which will drop an unwanted BOM at the start of the string, which is why you get 65279 (0xFEFF BOM). You don't want stray BOMs hanging around the place causing trouble.

  2. It uses machine-specific byte endianness (capital L) which iconv may well not agree with. To be honest I wouldn't have expected it to clash on a Windows box (as i386 is little-endian regardless of OS), but clearly it has, as the values you've got are all what would result from a reversed byte order.

Better to state both byte orderings explicitly, and avoid the BOM. Use UCS-4LE as the encoding, and unpack with V*. The same goes for unicode_code_to_utf8.

Also ignore listing 6. The ellipsis character—like the fi-ligature and others—is a ‘compatibility character’ which we wouldn't use in the modern Unicode-and-OpenType world. It's up to the font to provide contextual alternatives for fi or ... if it wants to, instead of requiring us to mangle the text.

bobince
+1 for catching the reverse byte order value. I was still staring and trying to figure out where those numbers came from.
steven_desu
Big green tick for you, @bobince. Thanks very much.
boost