+3  A: 

Update: I can reproduce the problem. Also, Firefox is auto-sniffing the character set as "chinese simplified" when I output the raw XML feed. Either the Google feed is serving incorrect data (Chinese Simplified characters instead of UTF-8 ones), or it is serving different data when not fetched in a browser - the content-type header in Firefox clearly says utf-8.

Converting the incoming feed from Chinese Simplified (GB18030, this is what Firefox gave me) into UTF-8 works:

 $incoming = file_get_contents('http://www.google.com/ig/api?weather=11791&hl=zh-CN');
 $xml = iconv("GB18030", "utf-8", $incoming);
 $xml = simplexml_load_string($xml);

it doesn't explain nor fix the underlying problem yet, though. I don't have time to take a deep look into this right now, maybe somebody else does. To me, it looks like Google are in fact serving incorrect data (which would surprise me. I didn't know they made mistakes like us mortals. :P)

Pekka
Pekka: I've tried that, the xml looks fine, but I get tons of parse errors when I pass it to simplexml_load_string :(. Do I need to cast it to a UTF-8 string or something? Does loading it through php give you an error?
John Himmelman
@John hang on, I'll try it out.
Pekka
@John see my update.
Pekka
@Pekka: Thanks! Atleast now I can feel good knowing that it wasn't my code that broke the system xD.
John Himmelman
@John you're welcome. I can be wrong but the way it looks like, this seems to actually be faulty data.
Pekka
Actually, the server does advertise using the GB2312 charset for the response.
Josh Davis
+3  A: 

The problem here is that SimpleXML doesn't look at the HTTP header to determine the character encoding used in the document and simply assumes it's UTF-8 even though Google's server does advertise it as

Content-Type: text/xml; charset=GB2312

You can write a function that will take a look at that header using the super-secret magic variable $http_response_header and transform the response accordingly. Something like that:

function sxe($url)
{   
    $xml = file_get_contents($url);
    foreach ($http_response_header as $header)
    {   
        if (preg_match('#^Content-Type: text/xml; charset=(.*)#i', $header, $m))
        {   
            switch (strtolower($m[1]))
            {   
                case 'utf-8':
                    // do nothing
                    break;

                case 'iso-8859-1':
                    $xml = utf8_encode($xml);
                    break;

                default:
                    $xml = iconv($m[1], 'utf-8', $xml);
            }
            break;
        }
    }

    return simplexml_load_string($xml);
}
Josh Davis
+1 Aaah, very nice. I was fooled by the fact that it serves `Content-Type: text/xml; charset=UTF-8` when called in a browser.
Pekka
+1  A: 

Excellent script.. It worked for me.

It is awsome code. It made my day.

Thanks

Viren