views:

51

answers:

2

I hit a bug in my code which uses WSARecv and WSAGetOverlapped result on an overlapped socket. Under heavy load, WSAGetOverlapped returns with WSASYSCALLFAILURE ('A system call that should never fail has failed') and my TCP stream is out of sync afterwards, causing mayhem in the upper levels of my program.

So far I have not been able to isolate it to a given set of hardware or drivers. Has somebody hit this issue as well, and found a solution or workaround?

+1  A: 

How many connections, how many pending recvs, how many outsanding sends? What does perfmon or task manager say about the amount of non-paged pool used? How much memory in the box? Does it go away if you run the program on Vista or above? Do you have any LSPs installed?

You could be exhausting non-paged pool and causing a badly written driver to misbehave when it fails to allocate memory. This issue is less likely to bite on Vista or later as the amount of non-paged pool available has increased dramatically (see http://www.lenholgate.com/archives/000837.html for details). Alternatively you might be hitting the "locked pages" limit (you can only lock a fixed number of pages in memory on the OS and each pending I/O operation locks one or more pages depending on buffer size and allocation alignment).

Len Holgate
About 10 connections with one pending receive per connection. Sends are implemented blocking, so that there is at most one outstanding send. I'll check for the rest later.Testing on Vista will take some time, since we don't have a test cluster yet. Thanks for all the pointers - I'll post again once I tracked it down.
eile
Unlikely to be resource limits then. Possibly just a bad driver? Can you try on a machine with a different network card and drivers?
Len Holgate
We have the issue consistently on different machines and drivers. I'm beginning to think that there is a bug with overlapped IOs on XP. Our application is somewhat different in that a bunch of render slaves sends pixel data to a single machine, as opposed to server applications which send data to a bunch of clients.
eile
"Bug with overlapped IOs on XP" If so, I've never seen it and I do a LOT of work with overlapped socket IO and have done for 10 years or so. It's more likely that you have a bug in your code, sorry ;) Can you post some code that shows the area that fails?
Len Holgate
Sure - the connection is here: http://www.equalizergraphics.com/cgi-bin/viewvc.cgi/trunk/src/lib/net/socketConnection.cpp?view=markupIt is de-multiplexed by a WaitForMultipleObjects in here: http://www.equalizergraphics.com/cgi-bin/viewvc.cgi/trunk/src/lib/net/connectionSet.cpp?view=markupAnd used from Node::runReceiverThread/handleData in here: http://www.equalizergraphics.com/cgi-bin/viewvc.cgi/trunk/src/lib/net/node.cpp?view=markup
eile
Personally I wouldn't be setting the event in your overlapped when a read returns 0 indicating that the read has completed straight away; it's being set by the read call, there could be a race condition between you accessing it after WSARecv() returns and whatever you're doing with it (destroying it eventually?) after your wait on it returns in another thread...
Len Holgate
This question says that the event is not set by the WSARecv in this case: http://stackoverflow.com/questions/2511690/hevent-member-in-overlapped-win32-structureIn any case, I never hit this condition, plus the read and WaitForMultipleObjects are in the same thread.
eile
That surprises me about the event, but then I don't tend to use the event I just use an IOCP to deal with completions. Fair enough re the threading.
Len Holgate
You never answered my question about whether you have any layered service providers installed on the machines in question; virus checkers and firewalls sometimes install them. I read from a MS guy that `WSASYSCALLFAILURE` is the result if lastError isn't set correctly by the function and that LSPs tend to be the main culprit...
Len Holgate
I finally got around to check this. There are no additional LSP's installed.
eile
That's a pity... I'm out of ideas.
Len Holgate
Thanks anyway. I'm trying to implement a workaround, and immediately have hit the next Winsock oddity. Sigh.http://stackoverflow.com/questions/3296920/winsock-blocking-sockets-multithreading-deadlock
eile
A: 

It seems I have solved this issue by sleeping 1ms and retrying the WSAGetOverlapped result when it reports a WSASYSCALLFAILURE.

I had another issue related to overlapped events firing, even though there is no data, which I also had to solve first. The test is now running for over an hour, with a few WSASYSCALLFAILURE handled correctly. Hopefully the overnight test will succeed as well.

@Len: thanks again for your help.

EDIT: The overnight test was successful. My bug was caused by two interdependent issues:

Issue 1: WaitForMultipleObjects in ConnectionSet::select occasionally signals data on an empty socket, causing SocketConnection::readSync to deadlock. Fix: Do a non-blocking read on the first byte of each packet. Reset ConnectionSet if socket was empty

Issue 2: WSAGetOverlappedResult returns occasionally WSASYSCALLFAILURE, causing out-of-sync on the TCP stream. Fix: Retry WSAGetOverlappedResult after a small sleep period.

http://equalizer.svn.sourceforge.net/viewvc/equalizer?view=revision&revision=4649

eile
Hmm... I'm generally very wary of 'fixes' which require sleeps... I expect it's more likely that there's a race condition in your code and that your 'fix' will come back to bite you in a couple of years time when you move to hardware that you can't dream of yet ;)
Len Holgate
I agree on a general basis with your comments, and normally don't do 'fixes' like that.In this instance however, I have debugged the code to death and am fairly convinced that it's not a race in my code. The error code is WSASYSCALLFAILURE, which is an Un-Error. Furthermore it occurs only with unusual communication patterns (many nodes sending one node a lot of data), which might be the reason it's still in WinSocks. We've had other issues with Winsocks before, e.g., this one: http://www.mombu.com/microsoft/alt-winsock-programming/t-wsasendto-resulting-in-stack-corruption-678924.html
eile
Again that looks likely to be a race condition IMHO... How do you know when your per i/o is done with, I don't see any reference counting going on and, IMHO, you need reference counting of per socket and per i/o data to ensure you clean up at the right time and not before...
Len Holgate
Data receive is done by a single thread. The data is received into a eq::net::Command, which is referenced when it's inserted into a net::CommandQueue to be dispatched to the handling thread, and dereferenced when it has been handled. The net::CommandCache takes care to recycle these packets for the ReceiverThread. I've extensively reviewed and tested this code using valgrind on Linux.
eile