ansaurus

Question

PHP and Unicode: Weirdness between Windows and Linux.

Answer 1

+2 A:

function utf8_to_unicode_code($utf8_string)
{
    $expanded = iconv("UTF-8", "UTF-32", $utf8_string);
    return unpack("L*", $expanded);
}

This does two things wrong:

It uses “UTF-32”, which will drop an unwanted BOM at the start of the string, which is why you get 65279 (0xFEFF BOM). You don't want stray BOMs hanging around the place causing trouble.
It uses machine-specific byte endianness (capital L) which iconv may well not agree with. To be honest I wouldn't have expected it to clash on a Windows box (as i386 is little-endian regardless of OS), but clearly it has, as the values you've got are all what would result from a reversed byte order.

Better to state both byte orderings explicitly, and avoid the BOM. Use UCS-4LE as the encoding, and unpack with V*. The same goes for unicode_code_to_utf8.

Also ignore listing 6. The ellipsis character—like the fi-ligature and others—is a ‘compatibility character’ which we wouldn't use in the modern Unicode-and-OpenType world. It's up to the font to provide contextual alternatives for fi or ... if it wants to, instead of requiring us to mangle the text.

bobince 2010-10-04 12:28:26

+1 for catching the reverse byte order value. I was still staring and trying to figure out where those numbers came from.

steven_desu 2010-10-04 13:13:49

Big green tick for you, @bobince. Thanks very much.

boost 2010-10-07 04:12:17

ansaurus

tags:

views:

answers:

PHP and Unicode: Weirdness between Windows and Linux.

related questions