views:

451

answers:

2

I've got a strange issue with a server accepting TCP connections. Even though there are normally some processes waiting, at some volume of connections it hangs.

Long version:

The server is written in Perl and binds a $srv socket with the reuse flag and listen == 5. Afterwards, it forks into 10 processes with a loop of $clt=$srv->accept(); do_processing($clt); $clt->shutdown(2);

The client written in C is also very simple - it sends some lines, then receives all lines available and does a shutdown(sockfd, 2); There's nothing async going on and at the end both send and receive queues are empty (as reported by netstat).

Connections last only ~20ms. All clients behave the same way, are the same implementation, etc. Now let's say I'm accepting X connections from client 1 and another X from client 2. Processes still report that they're idle all the time. If I add another X connections from client 3, suddenly the server processes start hanging just after accepting. The first blocking thing they do after accept(); is while (<$clt>) ... - but they don't get any data (on the first try already). Suddenly all 10 processes are in this state and do not stop waiting. On strace, the server processes seem to hang on read(), which makes sense.

There are loads of connections in TIME_WAIT state belonging to that server (~100 when the problem starts to manifest), but this might be a red herring.

What could be happening here?


After some more analysis: It turned out that the client was at fault, not closing previous connections properly before trying the next one. The servers at the beginning of the load-balancing list were left stale connections.

A: 

Does it surge and then pause a long time (circa two minutes or so) and then surge again? If so you may not have your system max open files limit set high enough.

Benjamin Franz
The open files ulimit is at 1024. The server never goes over ~100 dead (time_wait) connections and never over 10 (1 per forked process) live connections. When the blocked connections start, they happen on around ¼ of connections (3 will go through, one will block for ~5 seconds, until the timeout protection kicks in and respawns the process).
viraptor
+1  A: 

This probably isn't the solution to your problem, but it might solve a problem you'll experience in the future: don't forget to close() the sockets when you're done! shutdown() will disconnect the stream, but it'll still eat a file descriptor.

Since you said strace is showing processes stuck in read(), then your problem seems to be that the client isn't sending the data you expect it to be sending. You should either fix your client, or add an alarm() to your server processes so that they can survive dead clients.

apenwarr