tags:

views:

323

answers:

3
$string = file_get_contents('http://example.com');

if ('UTF-8' === mb_detect_encoding($string)) {
    $dom = new DOMDocument();
    // hack to preserve UTF-8 characters
    $dom->loadHTML('<?xml encoding="UTF-8">' . $string);
    $dom->preserveWhiteSpace = false;
    $dom->encoding = 'UTF-8';
    $body = $dom->getElementsByTagName('body');
    echo htmlspecialchars($body->item(0)->nodeValue);
}

This changes all UTF-8 characters to Å, ¾, ¤ and other rubbish. Is there any other way how to preserve UTF-8 characters?

Don't post answers telling me to make sure I am outputting it as UTF-8, I made sure I am.

Thanks in advance :)

+1  A: 

In case it is definitely the DOM screwing up the encoding, this trick did it for me a while back the other way round (accepting ISO-8859-1 data). DOMDocument should be UTF-8 by default in any case but you can still try:

    $dom = new DOMDocument('1.0', 'utf-8');
Pekka
This didn't help but andrewmabbott solved my problem already - check his answer ;)
Richard Knop
+1  A: 

I had similar problems recently, and eventually found this workaround - convert all the non-ascii characters to html entities before loading the html

$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($string);
andrewmabbott
WOW. Thanks a lot, worked perfectly. This was already driving me to the brink of madness.
Richard Knop
This is a great workaround but it would still be interesting to find out why your production server's DOM screws up the UTF8 in the first place. Maybe something to ask the administrator, if there is one.
Pekka
I am the administrator :D and I have no idea. I am using a very common set up of Debian 5.0 Lenny. Maybe it's some security "feature" that does this?
Richard Knop
Furthermore, I'm using the default php5 package for Debian from official repositories, so it's the default installation with default settings. I haven't changed any default settings, I just added few extensions I need for my applications like ioncube, imagick, gd, curl (I think that's all of them).
Richard Knop
+1  A: 

At the top of the script where your php code lies(the code you posted here), make sure you send a utf-8 header. I bet your encoding is a some variant of latin1 right now. Yes, I know the remote webpage is utf8, but this php script isn't.

chris