tags:

views:

451

answers:

3

I'm writing a website in PHP that aggregates data from various other websites. I have a function 'returnPageSource' that takes a URL and returns the html from that URL as a string.

function returnPageSource($url){
 $ch = curl_init();
 $timeout = 5; // set to zero for no timeout  

 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);  // means the page is returned
 curl_setopt($ch, CURLOPT_URL, $url);
 curl_setopt($ch, CURLOUT_CONNECTTIMEOUT, $timeout); // how long to wait to connect
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);  // follow redirects
 //curl_setopt($ch, CURLOPT_HEADER, False);   // only request body

 $fileContents = curl_exec($ch); // $fileContents contains the html source of the required website
 curl_close($ch);

 return $fileContents;
}

This works fine for some of the websites I need, like http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310, but not for others, like http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0 . Does anybody have any idea why?

Update

Thanks for the responses. I've changed my useragent to be the same as my browser (Firefox 3, which can access the sites fine), changed timeout to 0 and I still can't connect, but I can get some error messages. curl_error() gives me the error "couldn't connect to host", and curl_getinfo($ch, CURLINFO_HTTP_CODE); returns HTTP code 0...neither of which is very helpful. I've also tried curl_setopt($ch, CURLOPT_VERBOSE, 1);, but that displayed nothing. Does anybody have any other ideas?

Final Update

I just realised I didn't explain what was wrong - I just needed to enter the proxy settings for my university (I'm using the university's server). Everything worked fine after that!

+1  A: 

I assume you've tried setting the timeout to 0.

What HTTP status codes are these sites returning? Check curl_getinfo($ch, CURLINFO_HTTP_CODE);.

Something else to try could be spoofing the User-Agent header, perhaps with that of your own browser since you know that works to access these pages. They may just be trying to stop bots accessing the page.

Investigating the headers and http codes should give you a little more information.

Edit:

I looked into this a bit more. One thing is that you've got a typo for the connection timeout - should be CURLOPT_CONNECTTIMEOUT.

Anyway, I ran this script (below) which returned what you're looking for (I think). Check to see what's different between it and yours. I'm using PHP 5.2.8 if it helps.

<?php

$addresses = array(
    'http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310',
    'http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0'
);

foreach ($addresses as $address) {
    echo "Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0\n";
    // This box doesn't have http registered as a transport layer - pfft
    //var_dump(fsockopen($address, 80));

    $ch = curl_init($address);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);

    $fc = curl_exec($ch);

    echo "Info: " . print_r(curl_getinfo($ch), true) . "\n";

    echo "$fc\n";

    curl_close($ch);
}

Which returns the following (TL;DR: my cURL can read the pages fine):

C:\Users\Ross>php -e D:\sandbox\curl.php

Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0

Info: Array
(
    [url] => http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310
    [content_type] => text/html; charset=ISO-8859-1
    [http_code] => 200
    [header_size] => 168
    [request_size] => 151
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.654
    [namelookup_time] => 0.004
    [connect_time] => 0.044
    [pretransfer_time] => 0.044
    [size_upload] => 0
    [size_download] => 7531
    [speed_download] => 11515
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 0.57
    [redirect_time] => 0
)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-gb"  lang="en-gb">
<head>
  <title>AtEnsembl release 49: Arabidopsis thaliana TAIR EnsEMBL UniSearch results</title>
  <style type="text/css" media="all">
    @import url(/css/ensembl.css);
    @import url(/css/content.css);
  </style>
  <style type="text/css" media="print">
    @import url(/css/printer-styles.css);
  </style>
  <style type="text/css" media="screen">
    @import url(/css/screen-styles.css);
  </style>
  <script type="text/javascript" src="/js/protopacked.js"></script>
  <script type="text/javascript" src="/js/core42.js"></script>
  <!-- Snipped for freedom - lots of lines -->
</body>
</html>

Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0

Info: Array
(
    [url] => http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0
    [content_type] => text/html; charset=UTF-8
    [http_code] => 200
    [header_size] => 146
    [request_size] => 155
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 2.695
    [namelookup_time] => 0.004
    [connect_time] => 0.131
    [pretransfer_time] => 0.131
    [size_upload] => 0
    [size_download] => 14156
    [speed_download] => 5252
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 2.306
    [redirect_time] => 0
)

<html>
<head>
  <title>Arabidopsis eFP Browser</title>
  <link rel="stylesheet" type="text/css" href="efp.css"/>
  <link rel="stylesheet" type="text/css" href="domcollapse.css"/>
  <script type="text/javascript" src="efp.js"></script>
  <script type="text/javascript" src="domcollapse.js"></script>
</head>
<body>
<!-- SANITY SNIP -->
</body>
</html>

So what this means? Not entirely sure. I doubt that they're blocking you specifically (as you can access the page, unless you're running this script on a webserver). Try running my code above - if that works then try commenting out parts of your code to see what's different (and possibly causing a stoppage). Also what PHP version are you running?

Ross
Cheers for the answer, I've updated my post with the status codes.
Daniel
Your code wasn't giving me the correct output (I was getting 0 for everything), so I used WAMP server to set up a server on my pc, tried it from there, and it worked fine - and so did my code. So I guess the problem is to do with how the original server I was using had been set up.
Daniel
I'm meeting the person who runs the server tomorrow so hopefully we'll find the problem. Thanks for your help!
Daniel
The code should work on Linux so I'm glad you've narrowed it down to your machine. Good luck sorting this out!
Ross
+4  A: 

You should use curl_error() to check which error has occurred (if any)

Greg
+1  A: 

Two things to consider.

The first is you've set your timeout to low. The request may be taking longer than 5 seconds on those websites.

The second is, the websites in question may be deliberately blocking your request. They have a rule in place to block requests coming from curl, or they may have noticed suspicious activity (either your screen scraping or someone else's network abuse) coming from your IP address and are blocking/throttling the requests.

Alan Storm