views:

1132

answers:

1

Here is a brief overview of what I am doing, it is quite simple really:

  • Go out and fetch records from a database table.
  • Walk through all those records and for each column that contains a URL go out (using cURL) and make sure the URL is still valid.
  • For each record a column is updated with a current time stamp indicating when it was last checked and some other db processing takes place.

Anyhow all this works well and good and does exactly what it is supposed to. The problem is that I think performance could be greatly improved in terms of how I am validating the URL's with cURL.

Here is a brief (over simplified) excerpt from my code which demonstrates how cURL is being used:

$ch = curl_init();
while($dbo = pg_fetch_object($dbres))
{
   // for each iteration set url to db record url
   curl_setopt($ch, CURLOPT_URL, $dbo->url);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   curl_exec($ch); // perform a cURL session
   $ihttp_code = intval(curl_getinfo($ch, CURLINFO_HTTP_CODE));
   // do checks on $ihttp_code and update db
}
// do other stuff here
curl_close($ch);

As you can see I am just reusing the same cURL handle the entire time but even if I strip out all over the processing (database or otherwise) the script still takes incredibly long to run. Would changing any of the cURL options help improve performance? Tuning timeout values / etc? Any input would be appreciated.

Thank you,

  • Nicholas
+4  A: 

Set CURLOPT_NOBODY to 1 (see curl documentation) tell curl not to ask for the body of the response. This will contact the web server and issue a HEAD request. The response code will tell you if the URL is valid or not, and won't transfer the bulk of the data back.

If that's still too slow, then you'll likely see a vast improvement by running N threads (or processes) each doing 1/Nth of the work. The bottleneck may not be in your code, but in the response times of the remote servers. If they're slow to respond, then your loop will be slow to run.

slacy
Adding that parameter definitely helped, cutting execution time by 30-40% -- thank you!
Nicholas Kreidberg
Nice idea, thanks for the contribution!
Jay
We can't use multithreading here. Use multi_curlhttp://www.askapache.com/php/curl-multi-downloads.html
mixdev