views:

402

answers:

4

I have a problem with a piece of legacy c++/winsock code that is part of a multi-threaded socket server. The application creates a thread that handles connections from clients, of which there are typically a couple of hundred connected at any one time. It typically runs without a problem for several days (continuously), and then suddenly stops accepting connections. This only happens in production, never test.

It uses WSAEventSelect() to detect FD_ACCEPT network events. The (simplified) code for the connection handler is:

SOCKET listener;
HANDLE hStopEvent;

// ... initialise listener and hStopEvent, and other stuff ...

HANDLE hAcceptEvent = WSACreateEvent();
WSAEventSelect(listener, hAcceptEvent, FD_ACCEPT); 
HANDLE rghEvents[] = { hStopEvent, hAcceptEvent };

bool bExit = false;
while(!bExit)
{
    DWORD nEvent = WaitForMultipleObjects(2, rghEvents, FALSE, INFINITE);
    switch(nEvent)
    {
        case WAIT_OBJECT_0:
            bExit = true;
            break;
        case WAIT_OBJECT_1:
            HandleConnect();
            WSAResetEvent(hAcceptEvent);
            break;
        case WAIT_ABANDONED_0:
        case WAIT_ABANDONED_0 + 1:
        case WAIT_FAILED:
            LogError();
            break;
    }
}

From detailed logging I know that, when the problem occurs, the thread enters WaitForMultipleObjects() and never emerges, even though there are clients attempting to connect and waiting for an accept. The WAIT_FAILED and WAIT_ABANDONED_x conditions never occur.

While I haven't ruled-out a config problem on the server, or even some kind of resource leak (can't find anything), I am also wondering if the event created by WSACreateEvent() is somehow being 'dissassociated' from the FD_ACCEPT network event - causing it to never fire.

So, am I doing something wrong here? Is there something I should be doing that I'm not? Or a better way? I'd appreciate any suggestions! Thanks.

EDIT

The socket is a non-blocking socket.

EDIT

Problem solved by using the approach suggested by kipkennedy (below). Changed hAcceptEvent to be an auto-reset event, and removed the call to WSAResetEvent() which was no-longer needed.

+1  A: 

Code looks fine. The only thing I can suggest is calling WSAWaitForMultipleObjects() instead of the global version.

John Dibling
Thanks John, I'll try that!
Andy Johnson
Looks like WSAWaitForMultipleObjects() is just a wrapper around WaitForMultipleObjects(), so I'm not sure thats the problem. I appreciate your suggestion though John.
Andy Johnson
+1  A: 

From reading the docs, it appears that WSAEventSelect() is as parsimonious about notifications as WSAAsyncSelect(). The stack doesn't signal FD_ACCEPT every time a connection comes in. To Winsock, the notification is its way of saying:

You called accept() earlier, and it failed with WSAEWOULDBLOCK. Go ahead and call it again, it should succeed this time.

The solution is to call accept() before calling WSAEventSelect(), and only call WSAEventSelect() after you get WSAEWOULDBLOCK. For this to work as you expect, you need to have set the listening socket to non-blocking. (That might seem obvious, but it isn't actually required.)

Warren Young
Warren - thanks for taking the time to respond. I should have said that the socket is non-blocking. The code as posted works ok for several days, responding to all connection attempts, then suddenly stops signalling the event when any clients try to connect. I'm starting to suspect a config issue on the server, or a resource leak.
Andy Johnson
+1  A: 

Maybe an FD_ACCEPT is signaling during HandleConnect() after after the accept() and before the return and subsequent ResetEvent(). Then, ResetEvent() ends up resetting all signals and no re-enabling accept() is ever called. For example, the following sequence is possible:

  1. Event signaled, WaitForMultipleObjects() returns
  2. During HandleConnect(), sometime after accept() is called, the Event is signaled again
  3. HandleConnect() returns
  4. ResetEvent() resets the event, masking the second signal
  5. WaitForMultipleObjects() never returns since as far as Windows is concerned, it has already signaled the subsequent event and no subsequent accepts() have re-enabled it

A couple possible solutions: 1) loop on accept() in HandleConnect() until WSAEWOULDBLOCK is returned 2) use an auto-reset event or immediately reset the event before calling HandleConnect()

kipkennedy
I finally got time to try your suggestion, and it appears to have resolved the problem. Huge thanks for your help! For the record, I changed the code to use an auto-reset event as described in the 2nd solution in your post.
Andy Johnson
+1  A: 

After an accept event occured you must not do a WSAResetEvent(hAcceptEven). You must issue a WSAEnumNetworkEvents ( listener, hAcceptEvent, &some_struct). This functions clears the internal state of the socket (ar copies this state into some_struct) and after that you can receive new connections.

aa