views:

564

answers:

3

Hi,

I'm having a lot of trouble with unicode (UTF-16) values and PHP/XML. I want to read a set of unicode values from XML and output the correct glyphs to the browser. I've tried with UTF-8 and I get the same problem.

This is a simple working example I used for my first test:

$text = "\x00\x41";

$text = mb_convert_encoding($text, "ASCII", "UTF-16");

echo $text;

Output of above code:

A

However, when I try to get the values from XML things stop working.

XML:

<glyphs>
    <code>0041</code>
    <code>0042</code>
    <code>0043</code>
    <code>0044</code>
    <code>0045</code>
    <code>0046</code>
</glyphs>

In php I read each value from the above xml, split into pairs and format, e.g. \x00\x41, etc.

PHP:

// load xml
$xml = simplexml_load_file('encoding.xml');

if ($xml) {

    // get families
    foreach($xml->children() as $item) {

        $pairs = str_split($item, 2);

        $hex = "\x" . $pairs[0] . "\x" . $pairs[1];

        // check value...
        echo $hex . '<br/>';

        $text = mb_convert_encoding($hex, "ASCII", "UTF-16");

        echo $text;
    }

}
else {
    return 'The input is malformed.';
}

Output in browser:

\x00\x41
????
\x00\x42
????
\x00\x43
????
\x00\x44
????
\x00\x45
????
\x00\x46
????

Question marks should be A, B, C, D, E, F.

What am I doing wrong?

Thanks.

A: 

Are you setting the output correctly in your header?

header('Content-Type: text/html; charset=utf-8');

...and also in the HTML head?

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Peter Loron
DB's code is converting to utf-16 so i would specify that as charset rather than utf-8.
fsb
Yep, the charset is set.
DB
+1  A: 

"\x00" is hex notation inside a string, which is processed at compile time.
I think that when you use "\x" + "00" the compiler first tries to figure out what "\x" is (I have no clue what is the result), and only afterward concatenates the "00", so the result is not what you expect.

Maybe this question can help, although it is in Java -> http://stackoverflow.com/questions/2126378/java-convert-string-uffff-into-char/

EDIT: just following up on the comment. Placing the literal "\x41" in your xml won't help either, because then you are reading a string of 4 characters.
So your problem can be restated as: how to convert a string representation of numerical values in hex to a single character, using UTF-16. It is the same problem as in the question that I linked above, except that you want to do it in php, not Java.

Yoni
I wondered about that too. I tried changing the XML to <code>\x00\x41</code> and removing the string split and concatenation. It didn't work – I get the same output. I'll look into it some more.
DB
\x00 in the raw xml gets you a 4-chars string in memory. You need to parse it and somehow convert it to a single character, that's why I referred you to the other SO question. I know how to do it in Java, not in PHP
Yoni
A: 

Your test program writes for each test character few ASCII characters followed by '
' in ASCII followed by two bytes of UTF-16. This won't work. A file should use only one character encoding at a time.

First, rewrite your script to convert all the output to UTF-16 (or whatever).

Second, it appears that your browser is interpreting your mixed-encoding file as something other than UTF-16, perhaps ISO 8859-1, or Windows Latin 1 which are common defaults. It's unlikely that a browser would interpret a file as UTF-16 unless explicitly directed to (in the HTTP header or content type meta tag). If you left content type unspecified (check if your web server is sending a default) then some browsers attempt to guess the encoding. I doubt any would guess your mixed file was UTF-16.

Don't expect anything to work as you want until you've verified that the browser is interpreting the file according to the content type you specify.

Finally, I recommend using iconv instead of mb_convert_encoding. iconv is better maintained and has a wider set of supported encodings.

fsb
Thanks. I'm not quite sure how to do this. My xml contains UTF-16 values that I want to interpret in php. I don't mind if these values are converted into another encoding, I just want 0041 to display an A, 0042 a B, and so forth. Ultimately I will output as an image using imagettftext.
DB