tags:

views:

708

answers:

5

Hi, how come this isn't working:

$url = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20xpath%3D%22%2F%2Fmeta%22%20and%20url%3D%22http://www.cnn.com%22&format=xml&diagnostics=false";

$xml = (simplexml_load_file($url))

I get multiple errors telling me the HTTP request failed. Ultimately I want to get the results from this file into an array eg

Description = CNN.com delivers the latest breaking news etc.

Keywords = CNN, CNN news, CNN.com, CNN TV etc.

But this initial stage isn't working. Any help please?

EDIT Additional information:

Errors:

warning: simplexml_load_file(http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20xpath%3D%22//meta%22%20and%20url%3D%22http://www.cnn.com%22&format=xml&diagnostics=false) [function.simplexml-load-file]: failed to open stream: HTTP request failed!
# warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20xpath%3D%22//meta%22%20and%20url%3D%22http://www.cnn.com%22&format=xml&diagnostics=false" 
  • From my phpinfo(): allow_url_fopen On On
  • PHP version 5.2.11
  • Think it's valid (http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20xpath%3D%22//meta%22%20and%20url%3D%22http://www.cnn.com%22&format=xml&diagnostics=false)
+2  A: 
  1. could you please post the exact errors?
  2. is url_fopen allowed on your server?
  3. what version of php do you use?
  4. is the requested file a valid xml-file?
oezi
I have updated my question with the relevant answers. Thanks for your help.
Sean McRaghty
A: 

Well, the XML is GETable. As for valid, it lacks <?xml version="1.0"?>, yet I think it's not required.

<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="5" yahoo:created="2010-03-09T05:09:03Z" yahoo:lang="en-US" yahoo:updated="2010-03-09T05:09:03Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+*+from+html+where+xpath%3D%22%2F%2Fmeta%22+and+url%3D%22http%3A%2F%2Fwww.cnn.com%22"&gt;&lt;results&gt;&lt;meta content="HTML Tidy for Java (vers. 26 Sep 2004), see www.w3.org" name="generator"/><meta content="1800;url=?refresh=1" http-equiv="refresh"/><meta content="CNN.com delivers the latest breaking news and information on the latest top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN.com provides special reports, video, audio, photo galleries, and interactive guides." name="Description"/><meta content="CNN, CNN news, CNN.com, CNN TV, news, news online, breaking news, U.S. news, world news, weather, business, CNN Money, sports, politics, law, technology, entertainment, education, travel, health, special reports, autos, developing story, news video, CNN Intl" name="Keywords"/><meta content="text/html; charset=iso-8859-1" http-equiv="content-type"/></results></query><!-- total: 250 --> 

Tested it on my local server (PHP 5.3), no errors reported. I've used your source code and it works. Here's a print_r():


SimpleXMLElement Object
(
    [results] => SimpleXMLElement Object
        (
            [meta] => Array
                (
                    [0] => SimpleXMLElement Object
                        (
                            [@attributes] => Array
                                (
                                    [content] => HTML Tidy for Java (vers. 26 Sep 2004), see www.w3.org
                                    [name] => generator
                                )

                        )

                    [1] => SimpleXMLElement Object
                        (
                            [@attributes] => Array
                                (
                                    [content] => 1800;url=?refresh=1
                                    [http-equiv] => refresh
                                )

                        )

                    [2] => SimpleXMLElement Object
                        (
                            [@attributes] => Array
                                (
                                    [content] => CNN.com delivers the latest breaking news and information on the latest top stories, weather, business, entertainment, politics, and more. For in-depth coverage, CNN.com provides special reports, video, audio, photo galleries, and interactive guides.
                                    [name] => Description
                                )

                        )

                    [3] => SimpleXMLElement Object
                        (
                            [@attributes] => Array
                                (
                                    [content] => CNN, CNN news, CNN.com, CNN TV, news, news online, breaking news, U.S. news, world news, weather, business, CNN Money, sports, politics, law, technology, entertainment, education, travel, health, special reports, autos, developing story, news video, CNN Intl
                                    [name] => Keywords
                                )

                        )

                    [4] => SimpleXMLElement Object
                        (
                            [@attributes] => Array
                                (
                                    [content] => text/html; charset=iso-8859-1
                                    [http-equiv] => content-type
                                )

                        )

                )

        )

)

I'd suggest you to encode the URL, but that's already done. You could try performing the query with cURL.

Joel Alejandro
A: 

i tested this on my local xampp-installation and i can't reproduce your error messages. my script:

<?php
$xml = simplexml_load_file("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20xpath%3D%22//meta%22%20and%20url%3D%22http://www.cnn.com%22&amp;format=xml&amp;diagnostics=false");
var_dump($xml);
?>

could you please try to open the requested page in your browser, save it as test.xml, upload it to your server and try a simplexml_load_file("test.xml");? do you get the same errors when you try this?

oezi
I tried your script (no luck) but then with a local file as you suggest and it DID work! So what does this mean? Thanks for your great help thus far...
Sean McRaghty
hmmm... there must be something wrong with your server- oder php-settings - but i've no idea what... give me some more time: i'll be back
oezi
A: 

so, here are 2 more ideas

1. user_agent

maybe the external script blocks script-calls (or: tries to block this). please write this line in the head of your file (before the first output):

ini_set('user_agent', "Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.8) Gecko/20051111 Firefox/1.5");

and test it again.

2. very dumb workaround

if nothing helps, use copy($url,"/temp.xml"); or wget to download the site in your script and open this (and: open the downloaded file with your browser, maybe you'll see a 403-forbidden or something which would confirm my first thought)

oezi
Your first method doesn't seem to work (although I am using a CMS so can't be sure that line is the first thing) and I'm afraid you'll need to explain the second one a bit before I understand it. It seems a bit inefficient and a large load on the server though? Thanks....
Sean McRaghty
+2  A: 

(Note: Potentially useless answer once a real answer has been found…)


While you're figuring out the XML problem (keep working on it!) know that you can also get the YQL response back as JSON. Here's a quickie example:

$url = "http://query.yahooapis.com/v1/public/yql?q=select+%2A+"
     . "from+html+where+xpath%3D%22%2F%2Fmeta%5B%40name%3D%27"
     . "Keywords%27+or+%40name%3D%27Description%27%5D%22+and+"
     . "url%3D%22http%3A%2F%2Fwww.cnn.com%22&format=json&diagnostics=false";

// Grab YQL response and parse JSON
$json   = file_get_contents($url);
$result = json_decode($json, TRUE);

// Loop over meta results looking for what we want
$items = $result['query']['results']['meta'];
$metas = array();
foreach ($items as $item) {
    $metas[$item['name']] = $item['content'];
}
print_r($metas);

Giving an array like (text truncated for the screen):

Array
(
    [Description] => CNN.com delivers the latest breaking news and …
    [Keywords] => CNN, CNN news, CNN.com, CNN TV, news, news online …
)

Note that the YQL query (try it in the console) is slightly different to yours, to make the PHP simpler.

salathe
Hi, this is great, but I want all the meta data, not just the description and keywords. The script needs to be applicable to every site, so it can parse every meta tag the site throws at it, not just those on a specified list. Thanks.
Sean McRaghty
In that case, change the YQL query to be less restrictive (like your original). Note that not all meta tags will contain `name` and `content` attributes.
salathe
I actually just tried your code and unfortunately I get similar errors: `file_get_contents(http://query.yahooapis.com/v1/public/yql?q=select+%2A+from+html+where+xpath%3D%22%2F%2Fmeta...) [function.file-get-contents]: failed to open stream: HTTP request failed!` and `warning: Invalid argument supplied for foreach()` What could be wrong????
Sean McRaghty
This is a long-shot but do you have a firewall between wherever you're running the script from and the outside world? Try using cURL to see if it also gets blocked. Can you telnet to query.yahooapis.com?
salathe
Sorry, but I have no idea how to form a cURL request (is there some simple code I could try?) and I don't have a telnet client. Seems a shame that in my development environment (locahost) things don't work that will probably work in my production environment (ie the web).
Sean McRaghty
For a basic example of using cURL: http://php.net/curl.examples-basic . You also didn't answer my hunch about a firewall being in the way: do you have one set up between you and the outside world?
salathe
This was correct, the code worked when I changed internet connections, must've been a firewall....
Sean McRaghty
Good to hear you found the issue. :-)
salathe