ansaurus

Question

Why will this function using CURL work for some URLs but not others?

Answer 1

+1 A:

I assume you've tried setting the timeout to 0.

What HTTP status codes are these sites returning? Check curl_getinfo($ch, CURLINFO_HTTP_CODE);.

Something else to try could be spoofing the User-Agent header, perhaps with that of your own browser since you know that works to access these pages. They may just be trying to stop bots accessing the page.

Investigating the headers and http codes should give you a little more information.

Edit:

I looked into this a bit more. One thing is that you've got a typo for the connection timeout - should be CURLOPT_CONNECTTIMEOUT.

Anyway, I ran this script (below) which returned what you're looking for (I think). Check to see what's different between it and yours. I'm using PHP 5.2.8 if it helps.

<?php

$addresses = array(
    'http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310',
    'http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0'
);

foreach ($addresses as $address) {
    echo "Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0\n";
    // This box doesn't have http registered as a transport layer - pfft
    //var_dump(fsockopen($address, 80));

    $ch = curl_init($address);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);

    $fc = curl_exec($ch);

    echo "Info: " . print_r(curl_getinfo($ch), true) . "\n";

    echo "$fc\n";

    curl_close($ch);
}

Which returns the following (TL;DR: my cURL can read the pages fine):

C:\Users\Ross>php -e D:\sandbox\curl.php

Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0

Info: Array
(
    [url] => http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310
    [content_type] => text/html; charset=ISO-8859-1
    [http_code] => 200
    [header_size] => 168
    [request_size] => 151
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.654
    [namelookup_time] => 0.004
    [connect_time] => 0.044
    [pretransfer_time] => 0.044
    [size_upload] => 0
    [size_download] => 7531
    [speed_download] => 11515
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 0.57
    [redirect_time] => 0
)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-gb"  lang="en-gb">
<head>
  <title>AtEnsembl release 49: Arabidopsis thaliana TAIR EnsEMBL UniSearch results</title>
  <style type="text/css" media="all">
    @import url(/css/ensembl.css);
    @import url(/css/content.css);
  </style>
  <style type="text/css" media="print">
    @import url(/css/printer-styles.css);
  </style>
  <style type="text/css" media="screen">
    @import url(/css/screen-styles.css);
  </style>
  <script type="text/javascript" src="/js/protopacked.js"></script>
  <script type="text/javascript" src="/js/core42.js"></script>
  <!-- Snipped for freedom - lots of lines -->
</body>
</html>

Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0

Info: Array
(
    [url] => http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&amp;modeInput=Absolute&amp;primaryGene=At5g02310&amp;orthoListOn=0
    [content_type] => text/html; charset=UTF-8
    [http_code] => 200
    [header_size] => 146
    [request_size] => 155
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 2.695
    [namelookup_time] => 0.004
    [connect_time] => 0.131
    [pretransfer_time] => 0.131
    [size_upload] => 0
    [size_download] => 14156
    [speed_download] => 5252
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 2.306
    [redirect_time] => 0
)

<html>
<head>
  <title>Arabidopsis eFP Browser</title>
  <link rel="stylesheet" type="text/css" href="efp.css"/>
  <link rel="stylesheet" type="text/css" href="domcollapse.css"/>
  <script type="text/javascript" src="efp.js"></script>
  <script type="text/javascript" src="domcollapse.js"></script>
</head>
<body>
<!-- SANITY SNIP -->
</body>
</html>

So what this means? Not entirely sure. I doubt that they're blocking you specifically (as you can access the page, unless you're running this script on a webserver). Try running my code above - if that works then try commenting out parts of your code to see what's different (and possibly causing a stoppage). Also what PHP version are you running?

Ross 2009-02-15 22:00:04

Cheers for the answer, I've updated my post with the status codes.

Daniel 2009-02-16 04:46:30

Your code wasn't giving me the correct output (I was getting 0 for everything), so I used WAMP server to set up a server on my pc, tried it from there, and it worked fine - and so did my code. So I guess the problem is to do with how the original server I was using had been set up.

Daniel 2009-02-16 21:55:29

I'm meeting the person who runs the server tomorrow so hopefully we'll find the problem. Thanks for your help!

Daniel 2009-02-16 21:57:22

The code should work on Linux so I'm glad you've narrowed it down to your machine. Good luck sorting this out!

Ross 2009-02-16 23:20:00

Answer 2

+4 A:

You should use curl_error() to check which error has occurred (if any)

Greg 2009-02-15 22:07:36

Answer 3

+1 A:

Two things to consider.

The first is you've set your timeout to low. The request may be taking longer than 5 seconds on those websites.

The second is, the websites in question may be deliberately blocking your request. They have a rule in place to block requests coming from curl, or they may have noticed suspicious activity (either your screen scraping or someone else's network abuse) coming from your IP address and are blocking/throttling the requests.

Alan Storm 2009-02-15 22:32:21

ansaurus

tags:

views:

answers:

Why will this function using CURL work for some URLs but not others?

related questions