tags:

views:

1601

answers:

3

Hello.

I am building a basic link checker at work using cURL. My application has a function called getHeaders() that returns an array of HTTP headers:

function getHeaders($url) {

    if(function_exists('curl_init')) {
        // create a new cURL resource
        $ch = curl_init();
        // set URL and other appropriate options
        $options = array(
            CURLOPT_URL => $url,
            CURLOPT_HEADER => true,
            CURLOPT_NOBODY => true,
            CURLOPT_FOLLOWLOCATION => 1,
            CURLOPT_RETURNTRANSFER => true );
        curl_setopt_array($ch, $options);
        // grab URL and pass it to the browser
        curl_exec($ch);
        $headers = curl_getinfo($ch);
        // close cURL resource, and free up system resources
        curl_close($ch);
    } else {
        echo "

Error: cURL is not installed on the web server. Unable to continue.

"; return false; } return $headers; } print_r(getHeaders('mail.google.com'));

Which yields the following results:

Array
(
    [url] => http://mail.google.com
    [content_type] => text/html; charset=UTF-8
    [http_code] => 404
    [header_size] => 338
    [request_size] => 55
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.128
    [namelookup_time] => 0.042
    [connect_time] => 0.095
    [pretransfer_time] => 0.097
    [size_upload] => 0
    [size_download] => 0
    [speed_download] => 0
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 0.128
    [redirect_time] => 0
)

I've tested it with several long links, and the function acknowledges redirects, all apart from mail.google.com it seems.

For fun, I passed the same URL (mail.google.com) to the W3C link checker, which produced:

Results

Links

Valid links!

List of redirects

The links below are not broken, but the document does not use the exact URL, and the links were redirected. It may be a good idea to link to the final location, for the sake of speed.

warning Line: 1 http://mail.google.com/mail/ redirected to

https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1&ltmpl=default&ltmplcache=2

Status: 302 -> 200 OK

This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is. 

Anchors

Found 0 anchors.

Checked 1 document in 4.50 seconds.

Which is correct, as the address above is where I am redirected to when I enter mail.google.com into my browser.

What cURL options would I need to use to make my function return 200 for mail.google.com?

Why is it that the function above returns 404 status code as opposed to 302 status code?

TIA

A: 

Could it be that

mail.google.com -> mail.google.com/mail is a 404 and then a hard redirect

and

mail.google.com/mail -> https://www.google.com/accounts... etc is a 302 redirect
Ólafur Waage
A: 

Ólafur,

I'm not sure what you mean by a 'hard redirect'? How is a hard redirect implemented? Do you have an example?

I managed to get cURL to return 200 status code for mail.google.com by commenting out the 'CURLOPT_NOBODY' option, which was set to true.

php.net says: "set value to TRUE to exclude the body from the output. Request method is then set to HEAD. Changing this to FALSE does not change it to GET."

Does this mean that there was a redirect inside the of mail.google.com?

Thanks

mejpark
What he meant by "hard redirect" is that, if the web server arrived at a location not found (404), then the server might have immediately redirected to another location defined on the server as the error document to handle 404 errors. It turns out this is not what happens in this case, but I am fairly certain that's what he meant by "hard redirect".
Dustin Fineout
+1  A: 

The problem is that the redirect is specified through methods that cURL won't follow.

Here is the response from http://mail.google.com:

HTTP/1.1 200 OK
Cache-Control: public, max-age=604800
Expires: Mon, 22 Jun 2009 14:58:18 GMT
Date: Mon, 15 Jun 2009 14:58:18 GMT
Refresh: 0;URL=http://mail.google.com/mail/
Content-Type: text/html; charset=ISO-8859-1
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3

<html>
 <head>
  <meta http-equiv="Refresh" content="0;URL=http://mail.google.com/mail/" />
 </head>
 <body>
  <script type="text/javascript" language="javascript">
  <!--
   location.replace("http://mail.google.com/mail/")
  -->
  </script>
 </body>
</html>

As you can see, the page uses both a Refresh header (and HTML meta equivalent) and javascript in the body to change location to http://mail.google.com/mail/.

If you then request http://mail.google.com/mail/, you will be redirected (with the Location header, which cURL follows) to the page you had previously mentioned W3C correctly identifies.

HTTP/1.1 302 Moved Temporarily
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Date: Mon, 15 Jun 2009 15:07:56 GMT
Location: https://www.google.com/accounts/ServiceLogin?service=mail&amp;passive=true&amp;rm=false&amp;continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&amp;bsv=zpwhtygjntrz&amp;scc=1&amp;ltmpl=default&amp;ltmplcache=2
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-control: no-cache, no-store
Pragma: no-cache
Expires: Mon, 01-Jan-1990 00:00:00 GMT
Set-Cookie: GALX=B8zH60M78Ys;Path=/accounts;Secure
Date: Mon, 15 Jun 2009 15:07:56 GMT
X-Content-Type-Options: nosniff
Content-Length: 19939
Server: GFE/2.0

(HTML page content here, removed)

Perhaps you should add an additional step in your script to check for a Refresh header.

Another possible error is that you have open_basedir set in your PHP configuration, which would disable CURLOPT_FOLLOWLOCATION - you can check this quickly by turning on error reporting, as a message is generated as either a warning or notice.

The results above were all obtained with the following cURL setup:

$useragent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$res = curl_exec($ch);

curl_close($ch);
Dustin Fineout