ansaurus

Question

How would you create a string of all UTF-8 characters? [PHP]

Answer 1

A:

:) of course last one wouldn't work. \x sequence belongs to the double-quoted strings.

what's wrong with $char = chr(196).chr(128); ? with chr($a).chr($b) I mean.

Col. Shrapnel 2010-05-01 05:15:25

Answer 2

+3 A:

I'm not sure you can do this programmatically, mostly because there is a difference between a Unicode code point and a character. See http://www.unicode.org/standard/where for a few examples of characters that are represented by a combination of code points.

Some code points make no sense on their own and can only be used in conjunction with another character (think accents). See http://www.unicode.org/charts/charindex.html for a list of code points, and look at the section with all the "combining" code points.

Also, for use in testing applications, you'd need something else besides a list of possible UTF-8 code points, namely several invalid/malformed UTF-8 sequences that your app needs to be able to recover gracefully from.

For this, take a look at Markus Kuhn's Unicode stress test.

Tim Pietzcker 2010-05-01 05:23:44

Answer 3

+1 A:

I quickly translated this from C, but it should give you the idea:

function encodeUTF8( $inValue ) {
    $result = "";

    if ( $inValue < 0x00000080 ) {
        $result .= chr( $inValue );
        $extra = 0;
    } else if ( $inValue < 0x00000800 ) {
        $result .= chr( 0x00C0 | ( ( $inValue >> 6 ) & 0x001F ) );
        $extra = 6;
    } else if ( $inValue < 0x00010000 ) {
        $result .= chr( 0x00E0 | ( ( $inValue >> 12 ) & 0x000F ) );
        $extra = 12;
    } else if ( $inValue < 0x00200000 ) {
        $result .= chr( 0x00F0 | ( ( $inValue >> 18 ) & 0x0007 ) );
        $extra = 18;
    } else if ( $inValue < 0x04000000 ) {
        $result .= chr( 0x00F8 | ( ( $inValue >> 24 ) & 0x0003 ) );
        $extra = 24;
    } else if ( $inValue < 0x80000000 ) {
        $result .= chr( 0x00FC | ( ( $inValue >> 30 ) & 0x0001 ) );
        $extra = 30;
    }

    while ( $extra > 0 ) {
        $result .= chr( 0x0080 | ( ( $inValue >> ( $extra -= 6 ) ) & 0x003F ) );
    }

    return $result;
}

The logic is sound but I am not sure about the php so be sure to check it over. I have never tried to use chr like this.

There are a lot of values that you would not want to encode, like 0xD000-0xDFFF, 0xE000-0xF8FF and 0xFFF0-0xFFFF, and there are several other gaps for combining characters and reserved characters.

drawnonward 2010-05-01 06:52:55

Answer 4

+3 A:

You can leverage iconv (or a few other functions) to convert a code point number to a UTF-8 string:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

$codeunits= [];
for ($i= 0; $i<0xD800; i++)
    $codeunits[]= unichr($i);
for ($i= 0xE000; $i<0xFFFF; i++)
    $codeunits[]= unichr($i);
$all= implode($codeunits);

(I avoided the surrogate range 0xD800–0xDFFF as they aren't valid to put in UTF-8 themselves; that would be “CESU-8”.)

bobince 2010-05-01 09:14:28

+1 Bingo. This is the best way, I guess. You take eachcodepoint (integer), pack it in 32 bits LE (which amounts to "encode" it yourself by hand in UCS-4LE), and ask iconv to convert the encoding to UTF-8. (Did I already say that PHP sucks at Unicode?)

leonbloy 2010-05-01 12:34:42

I'm not sure. I can say “PHP sucks at Unicode” for you just in case you didn't, if that'd help.

bobince 2010-05-01 13:43:36

Awesome! I now have a useful list of UTF-8 characters to run through regex tests.

Xeoncross 2010-05-01 16:41:33

ansaurus

tags:

views:

answers:

How would you create a string of all UTF-8 characters? [PHP]

related questions