views:

561

answers:

3

Hello folks.

This involves two automated unit tests which each start up a tcp/ip server that creates a non-blocking socket then bind()s and listen()s in a loop on select() for a client that connects and downloads some data.

The catch is that they work perfectly when run separately but when run as a test suite, the second test client will fail to connect with WSACONNREFUSED...

UNLESS

there is a Thread.Sleep() of several seconds between them??!!!

Interestingly, there is retry loop every 1 second for connecting after any failure. So the second test loops for a while until timeout after 10 minutes.

During that time, netstat -na shows the correct port number is in the LISTEN state for the server socket. So if it is in the listen state? Why won't it accept the connection?

In the code, there are log messages that show the select NEVER even gets a socket ready to read (which means ready to accept a connection when it applies to a listening socket).

Obviously the problem must be related to some race condition between finishing one test which means close() and shutdown() on each end of the socket, and the start up of the next.

This wouldn't be so bad if the retry logic allowed it to connect eventually after a couple of seconds. However it seems to get "gummed up" and won't even retry.

However, for some strange reason the listening socket SAYS it's in the LISTEN state even through keeps refusing connections.

So that means it's the Windoze O/S which is actually catching the SYN packet and returning a RST packet (which means "Connection Refused").

The only other time I ever saw this error was when the code had a problem that caused hundreds of sockets to get stuck in TIME_WAIT state. But that's not the case here. netstat shows only about a dozen sockets with only 1 or 2 in TIME_WAIT at any given moment.

Please help.

A: 

From This MSDN site:

The TIME_WAIT state determines the time that must elapse before TCP can release a closed connection and reuse its resources. This interval between closure and release is known as the TIME_WAIT state or 2MSL state. During this time, the connection can be reopened at much less cost to the client and server than establishing a new connection. The TIME_WAIT behavior is specified in RFC 793 which requires that TCP maintains a closed connection for an interval at least equal to twice the maximum segment lifetime (MSL) of the network. When a connection is released, its socket pair and internal resources used for the socket can be used to support another connection.

Windows TCP reverts to a TIME_WAIT state subsequent to the closing of a connection. While in the TIME_WAIT state, a socket pair cannot be re-used. The TIME_WAIT period is configurable by modifying the following DWORD registry setting that represents the TIME_WAIT period in seconds.

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\TCPIP\Parameters\TcpTimedWaitDelay

By default, the MSL is defined to be 120 seconds. The TcpTimedWaitDelay registry setting defaults to a value 240 seconds, which represents 2 times the maximum segment lifetime of 120 seconds or 4 minutes. However, you can use this entry to customize the interval. Reducing the value of this entry allows TCP to release closed connections faster, providing more resources for new connections. However, if the value is too low, TCP might release connection resources before the connection is complete, requiring the server to use additional resources to re-establish the connection. This registry setting can be set from 0 to 300 seconds.

I think the minimum you can set the value to is 30 (try smaller but it might not work)

You can look at Winsock Programmer's FAQ for a more detailed explanation.

Romain Hippeau
I'm not convinced that the listening socket goes into TIME_WAIT when closed; there's no sense in it doing so since it's not an established connection and so you can't get any delayed packets... However I can't find a TCP state transition diagram that shows this ;) Anyway, I would always suggest that changing the time wait delay on a machine wide basis should be the 'solution' of last resort in situations where TIME_WAIT is a problem. If the listen socket DOES transition into TIME_WAIT then closing via an RST (i.e. setting linger to off before the close) might solve the issue better.
Len Holgate
That is according to the IBM and Microsoft what Windows does. After you open a Server Socket, whether there is a connection or not the OS puts it into TIME_WAIT, this is a WINDOWS post.
Romain Hippeau
Can you point me to a document which shows this? I understand that it's a question is asking about code running on Windows, I've been building and running very similar tests on Windows for 12 years or so and haven't seen this issue.
Len Holgate
Ooh, I've just found a TCP state transition diagram that actually shows the transition from Listen to closed; looks like it shouldn't go through `TIME_WAIT`: http://www.ssfnet.org/Exchange/tcp/tcpTutorialNotes.html#ST
Len Holgate
@Len Holgate http://support.microsoft.com/kb/173619 http://msdn.microsoft.com/en-us/library/ms737757(VS.85).aspx
Romain Hippeau
IMHO Neither link actually states that sockets in a LISTEN state will transition to TIME_WAIT upon closure. I understand how TIME_WAIT works for sockets in an ESTABLISHED state. Whilst the link you've added to your question provides the standard TCP state transition table from TCP/IP Illustrated Volume 1 it doesn't contain any transition from LISTEN to CLOSE. Unfortunately I've checked the text and that doesn't mention this transition either and that's the transition that we're talking about here... The transition table that I link to in my comment DOES show a LISTEN to CLOSE transition...
Len Holgate
@Len Holgate I updated the post to point to the WinSock FAQ which does explain it. I am done with this post.
Romain Hippeau
The point that I've been making all along is that none of these sources of yours explicitly states that a socket in LISTEN state will transition to CLOSED via TIME_WAIT. None of the links that you've supplied do anything to clarify this specific point. Given the purpose of TIME_WAIT it doesn't make sense, to me at least, for a socket in LISTEN state to need to transition to TIME_WAIT. Since you believe that this is the cause of the problem the questioner is having I had hoped to learn something and to be pointed to some resources that clarify the situation.
Len Holgate
A: 

I run lots of tests like this across build machines with various Windows operating systems (XP through Windows 7) with various numbers of cores and I've never seen it be a problem.

I don't believe that the listen socket transitioning to TIME_WAIT is likely to be your problem; I've certainly never seen it and I regularly run client server tests with the same port where I start and stop servers within the TIME_WAIT delay period.

If you were starting your second server before your first had closed its socket (or, if the socket were in TIME_WAIT) then I'd expect your second server to get an error when you attempted to bind().).

Personally I think it's more likely that there's an issue in the code that you have that's accepting connections - that is your test might have found a bug ;)

Can we have a look at the code between your listen and the accept loop?

Do you have the problem if you reverse the order of the tests?

Are the client and server running on the same machine, does it change things if they aren't?

Etc.

I have some TCP test tools http://www.lenholgate.com/archives/000568.html, if you set up your test system to run the test client from that link against an example server from this one http://www.lenholgate.com/archives/000569.html do you still see your problem? (That is, run my server with my client in your test system so that it runs it the same as it runs your stuff and does my stuff work?).

Len Holgate
+1  A: 

The fundamental problem was then in closing the socket, a thread was trying to read any remaining bytes. That was done as a separate thread which holds the read end of the socket open for a fixed time of milliseconds while trying repeatedly to read any data.

That logic has been replaced to more intelligently read any data and close properly when the read returns 0. So it closed much more rapidly.

So it turned out to be improper closing of the socket in my own code.

Thanks for all the help!

Wayne
Cool, your tests found a bug :)
Len Holgate