tags:

views:

58

answers:

4

Given a Unicode decimal or hex number for a character that's wanting to be output from a CLI PHP script, how can PHP generate it? The chr() function seems to not generate the proper output. Here's my test script, using the Section Break character U+00A7 (A7 in hex, 167 in decimal, should be represented as C2 A7 in UTF-8) as a test:

<?php
echo "Section sign: ".chr(167)."\n"; // Using CHR function
echo "Section sign: ".chr(0xA7)."\n";
echo "Section sign: ".pack("c", 0xA7)."\n"; // Using pack function?
echo "Section sign: §\n"; // Copy and paste of the symbol into source code

The output I get (via a SSH session to the server) is:

Section sign: ?
Section sign: ?
Section sign: ?
Section sign: §

So, that proves that the terminal font I'm using has the Section Break character in it, and the SSH connection is sending it along successfully, but chr() isn't constructing it properly when constructing it from the code number.

If all I've got is the code number and not a copy/paste option, what options do I have?

+2  A: 

PHP has no knowledge of Unicode when excluding the mb_ functions and iconv. You'll have to UTF-8 encode the character yourself.

For that, Wikipedia has an excellent overview on how UTF-8 is structured. Here's a quick, dirty and untested function based on that article:

function codepointToUtf8($codepoint)
{
    if ($codepoint < 0x7F) // U+0000-U+007F - 1 byte
        return chr($codepoint);
    if ($codepoint < 0x7FF) // U+0080-U+07FF - 2 bytes
        return chr(0xC0 | ($codepoint >> 6)).chr(0x80 | ($codepoint & 0x3F);
    if ($codepoint < 0xFFFF) // U+0800-U+FFFF - 3 bytes
        return chr(0xE0 | ($codepoint >> 12)).chr(0x80 | (($codepoint >> 6) & 0x3F).chr(0x80 | ($codepoint & 0x3F);
    else // U+010000-U+10FFFF - 4 bytes
        return chr(0xF0 | ($codepoint >> 18)).chr(0x80 | ($codepoint >> 12) & 0x3F).chr(0x80 | (($codepoint >> 6) & 0x3F).chr(0x80 | ($codepoint & 0x3F);
}
Michael Madsen
An excellent way to do that would just be to write the PHP document in UTF-8 in the first place.
Billy ONeal
Excellent answer, Michael; and thanks for the function! I created something similar for myself using the pack function rather than chr repeatedly. Billy, for this specific purpose, I need to go from a codepoint to a character; writing the PHP script in UTF-8 with the characters already embedded isn't an option.
MidnightLightning
+1  A: 

Don't forget that UTF-8 is a variable-length encoding.

§ is not included in the first 128 (ASCII) characters that UTF-8 is able to display in one byte. § is a multi-byte character in UTF-8, prepended by a c2 byte that signifies first byte of a two-byte sequence.. This should work:

echo "Section sign: ".chr(0xC2).chr(0xA7)."\n"; 
Pekka
A: 
chr

(PHP 4, PHP 5)

chr — Return a specific character

Report a bug
 Description

string chr ( int $ascii )
Returns a one-character string containing the character specified by ascii.

This function complements ord().

important is the word ascii :) try this one:

  function uchr ($codes) {
        if (is_scalar($codes)) $codes= func_get_args();
        $str= '';
        foreach ($codes as $code) $str.= html_entity_decode('&#'.$code.';',ENT_NOQUOTES,'UTF-8');
        return $str;
    }
    echo "Section sign: ".uchr(167)."\n"; // Using CHR function
    echo "Section sign: ".uchr(0xA7)."\n";
Joe Hopfgartner
+2  A: 

Assuming you have iconv, here's a simple way that doesn't involve implementing UTF-8 yourself:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
bobince