I think you may be going too fast.
Most operating systems have a limit on the number of sockets they can have open at any one time but it's actually worse than that.
When a socket is closed down, it is put in a special time-wait state for a certain amount of time. This is usually twice the packet time-to-live value and it ensures that there aren't still packets out in the network that are on the way to your socket.
Once that time expires, you can be sure that all packets out in the network have already died. The socket is placed in that special state so that packets that were out in the network when you shut it down can be captured and thrown away if they arrive before they die.
I think that's what's happening in your case, the sockets aren't being freed as quickly as you think.
We had a similar problem with code that opened lots of short-lived sessions. It ran fine for a while but then the hardware got faster, allowing many more to be opened in a given time period. This manifested itself as inability to open more sessions.
One way to check this is to do netstat -a
from the command line and see how many sessions are actually in the wait state.
If that does turn out to be the case, there's a few ways to handle it.
- re-use your sessions, either manually or by maintaining a connection pool.
- introduce a delay in each connection to try and stop reaching the saturation point.
- go flat out until you reach saturation and then modify your behaviour, such as running your connect logic inside a while statement that retries for up to 60 times with a two-second delay each time before giving up totally. This lets you run at full speed, slowing down only if there's a problem.
That last bullet point deserves some expansion. We actually used a back-off strategy in our afore-mentioned application which would gradually lessen the load on a resource provider if it was complaining so, instead of 30 two-second delays, we opted for a one-second delay, then two seconds, then four and so on.
The general process for a back-off strategy is as follows and it can be used in any case where there may be temporary shortages of a resource. The action alluded to in the pseudo-code below would be the opening of a socket in your case.
set maxdelay to 16 # maximum time period between attempts
set maxtries to 10 # maximum attempts
set delay to 0
set tries to 0
while more actions needed:
if delay is not 0:
sleep delay
attempt action
if action failed:
add 1 to tries
if tries is greater than maxtries:
exit with permanent error
if delay is 0:
set delay to 1
else:
double delay
if delay is greater than maxdelay:
set delay to maxdelay
else:
set delay to 0
set tries to 0
This allows the process to run at full speed in the vast majority of cases but backs off when errors start occurring, hopefully giving the resource provider time to recover. The gradual increase in delays allows for more serious resource restrictions to recover and the maximum tries catches what you would term permanent errors (or errors that are taking too long to recover).