tags:

views:

540

answers:

2

I'm using a 'rolling' cURL multi implementation (like this SO post, based on this cURL code). It works fine to process thousands of URLs using up to 100 requests at the same time, with 5 instances of the script running as daemons (yeah, I know, this should be written in C or something).

Here's the problem: after processing ~200,000 urls (across the 5 instances) curl_multi_exec() seems to break for all instances of the script. I've tried shutting down the scripts, then restarting, and the same thing happens (not after 200,000 urls, but right on restart), the script hangs calling curl_multi_exec().

I put the script into 'single' mode, processing one regular cURL handle at time, and that works fine (but it's not quite the speed I need). My logging leads me to suspect that it may have hit a patch of slow/problematic connections (since every so often it seems to process on URL then hang again), but that would mean my CURLOPT_TIMEOUT is being ignored for the individual handles. Or maybe it's just something with running that many requests through cURL.

Anyone heard of anything like this?

Sample code (again based on this):

//some logging shows it hangs right here, only looping a time or two
//so the hang seems to be in the curl call
while(($execrun = 
    curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);

//code to check for error or process whatever returned

I have CURLOPT_TIMEOUT set to 120, but in the cases where curl_multi_exec() finally returns some data, it's after 10 minutes of waiting.

I have a bunch of testing/checking yet to do, but thought maybe this might ring a bell with someone.

+1  A: 

I think your problem is releated to:

(62) CURLOPT_TIMEOUT does not work properly with the regular multi and multi_socket interfaces. The work-around for apps is to simply remove the easy handle once the time is up.

See also: http://curl.haxx.se/bug/view.cgi?id=2501457

If that is the case you should watch your curl handles for timeouts and remove them from the multi pool.

Goran Rakic
Exactly what I was looking for. Will try some test code and see what happens.
Tim Lytle
A: 

After much testing, I believe I've found what is causing this particular problem. I'm not saying the other answer is incorrect, just in this case not the issue I am having.

From what I can tell, curl_multi_exec() does not return until all DNS (failure or success) is resolved. If there are a bunch of urls with bad domains curl_multi_exec() doesn't return for at least:

(time it takes to get resolve error) * (number of urls with bad domain)

Here's someone else who has discovered this:

Just a note on the asynchronous nature of cURL’s multi functions: the DNS lookups are not (as far as I know today) asynchronous. So if one DNS lookup of your group fails, everything in the list of URLs after that fails also. We actually update our hosts.conf (I think?) file on our server daily in order to get around this. It gets the IP addresses there instead of looking them up. I believe it’s being worked on, but not sure if it’s changed in cURL yet.

Also, testing shows that cURL (at least my version) does follow the CURLOPT_CONNECTTIMEOUT setting. Of course the first step of a multi cycle may still take a long time, since cURL waits for every url to resolve or timeout.

Tim Lytle