ansaurus

Question

Confusion with mail.google.com, cURL and http://validator.w3.org/checklink

Answer 1

A:

Could it be that

mail.google.com -> mail.google.com/mail is a 404 and then a hard redirect

and

mail.google.com/mail -> https://www.google.com/accounts... etc is a 302 redirect

Ólafur Waage 2009-06-13 19:17:51

Answer 2

A:

Ólafur,

I'm not sure what you mean by a 'hard redirect'? How is a hard redirect implemented? Do you have an example?

I managed to get cURL to return 200 status code for mail.google.com by commenting out the 'CURLOPT_NOBODY' option, which was set to true.

php.net says: "set value to TRUE to exclude the body from the output. Request method is then set to HEAD. Changing this to FALSE does not change it to GET."

Does this mean that there was a redirect inside the of mail.google.com?

Thanks

mejpark 2009-06-14 11:33:44

What he meant by "hard redirect" is that, if the web server arrived at a location not found (404), then the server might have immediately redirected to another location defined on the server as the error document to handle 404 errors. It turns out this is not what happens in this case, but I am fairly certain that's what he meant by "hard redirect".

Dustin Fineout 2009-06-19 15:49:36

Answer 3

+1 A:

The problem is that the redirect is specified through methods that cURL won't follow.

Here is the response from http://mail.google.com:

HTTP/1.1 200 OK
Cache-Control: public, max-age=604800
Expires: Mon, 22 Jun 2009 14:58:18 GMT
Date: Mon, 15 Jun 2009 14:58:18 GMT
Refresh: 0;URL=http://mail.google.com/mail/
Content-Type: text/html; charset=ISO-8859-1
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3

<html>
 <head>
  <meta http-equiv="Refresh" content="0;URL=http://mail.google.com/mail/" />
 </head>
 <body>
  <script type="text/javascript" language="javascript">
  <!--
   location.replace("http://mail.google.com/mail/")
  -->
  </script>
 </body>
</html>

As you can see, the page uses both a Refresh header (and HTML meta equivalent) and javascript in the body to change location to http://mail.google.com/mail/.

If you then request http://mail.google.com/mail/, you will be redirected (with the Location header, which cURL follows) to the page you had previously mentioned W3C correctly identifies.

HTTP/1.1 302 Moved Temporarily
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Date: Mon, 15 Jun 2009 15:07:56 GMT
Location: https://www.google.com/accounts/ServiceLogin?service=mail&amp;passive=true&amp;rm=false&amp;continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&amp;bsv=zpwhtygjntrz&amp;scc=1&amp;ltmpl=default&amp;ltmplcache=2
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-control: no-cache, no-store
Pragma: no-cache
Expires: Mon, 01-Jan-1990 00:00:00 GMT
Set-Cookie: GALX=B8zH60M78Ys;Path=/accounts;Secure
Date: Mon, 15 Jun 2009 15:07:56 GMT
X-Content-Type-Options: nosniff
Content-Length: 19939
Server: GFE/2.0

(HTML page content here, removed)

Perhaps you should add an additional step in your script to check for a Refresh header.

Another possible error is that you have open_basedir set in your PHP configuration, which would disable CURLOPT_FOLLOWLOCATION - you can check this quickly by turning on error reporting, as a message is generated as either a warning or notice.

The results above were all obtained with the following cURL setup:

$useragent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$res = curl_exec($ch);

curl_close($ch);

Dustin Fineout 2009-06-15 15:19:31

ansaurus

tags:

views:

answers:

Confusion with mail.google.com, cURL and http://validator.w3.org/checklink

related questions