tags:

views:

1465

answers:

5

I would like to create a batch script, to go through 20,000 links in a DB, and weed out all the 404s and such. How would I get the HTTP status code for a remote url?

Preferably not using curl, since I dont have it installed.

+6  A: 

CURL would be perfect but since you don't have it, you'll have to get down and dirty with sockets. The technique is:

  1. Open a socket to the server.
  2. Send an HTTP HEAD request.
  3. Parse the response.

Here is a quick example:

<?php

$url = parse_url('http://www.example.com/index.html');

$host = $url['host'];
$port = $url['port'];
$path = $url['path'];
if(!$port)
    $port = 80;

$request = "HEAD $path HTTP/1.1\r\n"
          ."Host: $host\r\n"
          ."Connection: close\r\n"
          ."\r\n";

$address = gethostbyname($host);
$socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
socket_connect($socket, $address, $port);

socket_write($socket, $request, strlen($request));

$response = split(' ', socket_read($socket, 1024));

print "<p>Response: ". $response[1] ."</p>\r\n";

socket_close($socket);

?>

UPDATE: I've added a few lines to parse the URL

Adam Pierce
I believe that's:."Host: $host\r\n\"(ie, not %host)But other than that that'll work nicely.
Sean Schulte
Thanks for spotting that Sean. I'll correct that little typo.
Adam Pierce
I should point out that not all web servers support or enable HEAD requests, even if the chance of hitting one is close to nil...
jcinacio
That's a "quick example"? Nice work.
Jim Nelson
A: 

http://www.webmasterworld.com/forum88/12559.htm a quick bit of googling found this link. The most up-to date version is near the bottom.

jimktrains
A: 

This page looks like it has a pretty good setup to download a page using either curl or fsockopen, and can get the HTTP headers using either method (which is what you want, really).

After using that method, you'd want to check $output['info']['http_code'] to get the data you want.

Hope that helps.

Sean Schulte
+1  A: 

You can use PEAR's HTTP::head function.
http://pear.php.net/manual/en/package.http.http.head.php

sanxiyn
+2  A: 

If im not mistaken none of the php built-in functions return the http status of a remote url, so the best option would be to use sockets to open a connection to the server, send a request and parse the response status:

pseudo code:

parse url => $host, $port, $path
$http_request = "GET $path HTTP/1.0\nHhost: $host\n\n";
$fp = fsockopen($host, $port, $errno, $errstr, $timeout), check for any errors
fwrite($fp, $request)
while (!feof($fp)) {
   $headers .= fgets($fp, 4096);
   $status = <parse $headers >
   if (<status read>)
     break;
}
fclose($fp)

Another option is to use an already build http client class in php that can return the headers without fetching the full page content, there should be a few open source classes available on the net...

jcinacio