views:

934

answers:

4

I want to download a page from the web, it's allowed to do when you are using a simple browser like Firefox, but when I use "file_get_contents" the server refuses and replies that it understands the command but don't allow such downloads.

So what to do? I think I saw in some scripts (on Perl) a way to make your script like a real browser by creating a user agent and cookies, which makes the servers think that your script is a real web browser.

Does anyone have an idea about this, how it can be done?

+8  A: 

Use CURL.

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "example.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);


        //change the UA to spoof IE7.
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');

        // $output contains the output string
        $output = curl_exec($ch);

        // close curl resource to free up system resources
        curl_close($ch);     
?>

(From http://uk.php.net/manual/en/curl.examples-basic.php)

Rich Bradshaw
Good! Still don't work though I need the script to tell the server that I'm using a browser
Omar Abid
Oh, sorry - just add a curl_setopt for the UA - I've added it into my answer.
Rich Bradshaw
A: 

This answer takes your comment to Rich's answer in mind.

The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:

 //pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
 //pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
Pim Jager
A: 

Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:

$handle = fopen("http://www.example.com/", "r");//open specified URL for reading

It's especially useful if cURL isn't available.

karim79
+1  A: 

Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.

function __construct($useragent,$url)
    {
     $this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
     $this->url=$url;


     $ch = curl_init();
     curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
     curl_setopt($ch, CURLOPT_URL,$url);
     curl_setopt($ch, CURLOPT_FAILONERROR, true);
     curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
     curl_setopt($ch, CURLOPT_AUTOREFERER, true);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
     curl_setopt($ch, CURLOPT_TIMEOUT, 10);
     $html= curl_exec($ch);
     $dom = new DOMDocument();
     @$dom->loadHTML($html);
     $this->xpath = new DOMXPath($dom);
    }
...
public function displayResults($site)
$data=$this->path[0]->length;
    for($i=0;$i<$data;$i++)
    { 
    $delData=$this->path[0]->item($i);

    //setting the href and title properties 
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href'); 
    $titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;

    //setting the saves and additoinal
      $saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
    if ($saves==NULL)
    {
     $saves=0;
    }

    //build the array
    $this->newSiteBookmark[$i]['source']='delicious.com';
    $this->newSiteBookmark[$i]['url']=$urlSite;
    $this->newSiteBookmark[$i]['title']=$titleSite;
    $this->newSiteBookmark[$i]['saves']=$saves;


       }

The latter is a part of a class that scrapes data from delicious.com .Not very legal though.

chosta
It's perfectly legal, the data is already available, just an inefficient way of doing it (HTML isn't the best format for data). Been wishing delicious provided more data (namely search results) in XML recently.
Ross
well, i wish delicious provided a method to their API that can actually access bookmarks that don't come from your own profile like the ma.gnolia.org "bookmark_find" method. That would have saved some sleeples nights doing my bachelor thesis :=)
chosta