views:

2713

answers:

9

Given a list of urls, I would like to check that each url:

  • Returns a 200 OK status code
  • Returns a response within X amount of time

The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.

The script will be written in PHP and will most likely run on a daily basis via cron.

The script will be processing approximately 1000 urls at a go.

Question has two parts:

  • Are there any bigtime gotchas with an operation like this, what issues have you run into?
  • What is the best method for checking the status of a url in PHP considering both accuracy and performance?

Thanks very much for taking the time.

+2  A: 
  1. fopen() supports http URI.
  2. If you need more flexibility (such as timeout), look into the cURL extension.
Serge - appTranslator
don't use fopen() - it doesn't support redirects and such.
Alex
+6  A: 

Look into cURL. There's a library for PHP.

There's also an executable version of cURL so you could even write the script in bash.

antik
A: 

One potential problem you will undoubtably run into is when the box this script is running on looses access to the Internet... you'll get 1000 false positives.

It would probably be better for your script to keep some type of history and only report a failure after 5 days of failure.

Also, the script should be self-checking in some way (like checking a known good web site [google?]) before continuing with the standard checks.

BoltBait
Yeah, there is a major history component to the end application. I left it out of the question for simplicity. Also, checking some know good url is a great idea. Thanks. :)
GloryFish
+4  A: 

I actually wrote something in PHP that does this over a database of 5k+ URLs. I used the PEAR class HTTP_Request, which has a method called getResponseCode(). I just iterate over the URLs, passing them to getResponseCode and evaluate the response.

However, it doesn't work for FTP addresses, URLs that don't begin with http or https (unconfirmed, but I believe it's the case), and sites with invalid security certificates (a 0 is not found). Also, a 0 is returned for server-not-found (there's no status code for that).

And it's probably easier than cURL as you include a few files and use a single function to get an integer code back.

Thomas Owens
+1  A: 

Seems like it might be a job for curl.

If you're not stuck on PHP Perl's LWP might be an answer too.

Chris Kloberdanz
Amen to LWP. Perl is better suited, as there are no timeouts involved too. Not to mention, it rocks. :)
Abyss Knight
A: 

You should be careful not to hammer the same website continuously or the owner might get upset. Maybe sort the list, and for multiple URLs from the same site institute some type of delay before the next request (or go on to another site and come back to that one later).

Kip
+1  A: 

You should also be aware of URLs returning 301 or 302 HTTP responses which redirect to another page. Generally this doesn't mean the link is invalid. For example, http://amazon.com returns 301 and redirects to http://www.amazon.com/.

Kip
+8  A: 

Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.

As a starting point you could use some function like this:

function is_available($url, $timeout = 30) {
 $ch = curl_init(); // get cURL handle

 // set cURL options
 $opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
      CURLOPT_URL => $url,            // set URL
      CURLOPT_NOBODY => true,     // do a HEAD request only
      CURLOPT_TIMEOUT => $timeout);   // set timeout
 curl_setopt_array($ch, $opts); 

 curl_exec($ch); // do it!

 $retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK

 curl_close($ch); // close handle

 return $retval;
}

However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.

Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.

Henning
+1  A: 

Just returning a 200 response is not enough; many valid links will continue to return "200" after they change into porn / gambling portals when the former owner fails to renew.

Domain squatters typically ensure that every URL in their domains returns 200.

MarkR
That is a true concern, as well. Checking for good (or bad) URLs is not a trivial problem.
Thomas Owens