views:

176

answers:

1

We are moving large amounts of data on a LAN and it has to happen very rapidly and reliably. Currently we use windows TCP as implemented in C++. Using large (synchronous) sends moves the data much faster than a bunch of smaller (synchronous) sends but will frequently deadlock for large gaps of time (.15 seconds) causing the overall transfer rate to plummet. This deadlock happens in very particular circumstances which makes me believe it should be preventable altogether. More importantly if we don't really know the cause we don't really know it won't happen some time with smaller sends anyway. Can anyone explain this deadlock?

Deadlock description (OK, zombie-locked, it isn't dead, but for .15 or so seconds it stops, then starts again)

  1. The receiving side sends an ACK.
  2. The sending side sends a packet containing the end of a message (push flag is set)
  3. The call to socket.recv takes about .15 seconds(!) to return
  4. About the time the call returns an ACK is sent by the receiving side
  5. The the next packet from the sender is finally sent (why is it waiting? the tcp window is plenty big)

The odd thing about (3) is that typically that call doesn't take much time at all and receives exactly the same amount of data. On a 2Ghz machine that's 300 million instructions worth of time. I am assuming the call doesn't (heaven forbid) wait for the received data to be acked before it returns, so the ack must be waiting for the call to return, or both must be delayed by something else.

The problem NEVER happens when there is a second packet of data (part of the same message) arriving between 1 and 2. That part very clearly makes it sound like it has to do with the fact that windows TCP will not send back a no-data ACK until either a second packet arrives or a 200ms timer expires. However the delay is less than 200 ms (its more like 150 ms).

The third unseemly character (and to my mind the real culprit) is (5). Send is definitely being called well before that .15 seconds is up, but the data NEVER hits the wire before that ack returns. That is the most bizarre part of this deadlock to me. Its not a tcp blockage because the TCP window is plenty big since we set SO_RCVBUF to something like 500*1460 (which is still under a meg). The data is coming in very fast (basically there is a loop spinning out data via send) so the buffer should fill almost immediately. Msdn mentions that there various "heuristics" used in deciding when a send hits the wire, and that an already pending send + a full buffer will cause send to block until the data hits the wire (otherwise send apparently really just copies data into the tcp send buffer and returns).

Anway, why the sender doesn't actually send more data during that .15 second pause is the most bizarre part to me. The information above was captured on the receiving side via wireshark (except of course the socket.recv return times which were logged in a text file). We tried changing the send buffer to zero and turning off nagel on the sender (yes, I know nagel is about not sending small packets - but we tried turning nagel off in case that was part of the unstated "heuristics" affecting whether the message would be posted to the wire. Technically microsoft's nagel is that a small packet isn't sent if the buffer is full and there is an outstanding ACK, so it seemed like a possibility).

+1  A: 

The send blocking until the previous ACK is recieved almost certainly indicates that the TCP receive window is full (you can check this by using Wireshark to analyse the network traffic).

No matter how big your TCP window is, if the recieving application isn't processing data as fast as it's arriving then the TCP window will eventually fill up. How fast are we talking here? What is the recieving side doing with the data? (If you're writing the recieved data to disk then it's quite possible that your disk just can't keep up with a gigabit network at full bore).


OK, so you have a 730,000 byte recieve window and you're streaming data at 480Mbps. That means it takes only 12ms to entirely fill your window - so when the 150ms delay on the recieve side occurs, the recieve window fills up almost instantly and causes the sender to stall.

So your root cause is this 150ms delay in scheduling your recieve process. There's any number of things that could cause that (it could be as simple as the kernel needing to flush dirty pages to disk to create some more free pages for your application); you could try increasing your processes scheduling priority, but there's no guarantee that that will help.

caf
Actually we used wireshark to get most of the information above. We have have a very large receive window (window scaling is automatically invoked by windows since our receive buffer is large).We need speeds on the order of at least 450 Mbs over 1000Mbs ethernet. Should be doable.The receiving side primarily moves the data around in memory right now. The pauses exactly correspond to the time socket.recv takes to return, and not to the return time of any other function. It only happens in just the right circumstance.
John Robertson
As I said, the size of the recieve window is immaterial. Check the value of the "Window" field in the TCP subsection of Wireshark's output, for that delayed ACK packet.
caf
(see updated answer).
caf
Thanks for the updated answer. Its very kind of you to take the time. I wish that was what was happening, but neither side sends anything in that .15 seconds. Early on we did have a receive buffer that was too small and the "window" value sent back in the tcp packets would rapidly shrink as data arrived until nothing could be sent. Playing with SO_RCVBUF (from the default to .5Meg to 50Meg or more) eliminated that. Typically during those pauses the value of "window" in the tcp packets is exactly SO_RCVBUF (in fact, typically during these captures it simply stays at that value the whole time).
John Robertson
On the other hand, your suggestion about busy kernel on the receive side is something I have wondered about. I should probably try to nail down what the kernel is doing during that time. What troubles me is that it still doesn't explain why nothing is being sent during that time. There is a huge window advertised with each ack that is sent back, but the sender still waits for another ack.
John Robertson
It isn't necessary for the advertised recieve window to ever fall below the maximum to get a stall - if the reciever simply *stops sending* ACKs for 12ms, the sender will send 730,000 bytes and then stop sending. When the reciever wakes from its slumber, it can pass all that data onto the application and then advertise a maximum recieve window again (so you'll never see it advertise anything less than that). If you want to be able to ride out a 150ms freeze you'll need to up that window to much higher than you've currently got - at least 10MB.
caf
Interesting. Where did you find that out? I didn't know that.It isn't sending that 730,000 bytes. If it sent even a fraction of that there would be no problem. When it deadlocks, it just got an ACK, sent a single push packet, and refuses to send more until it gets another ACK. If it would send just one more packet of 1460 bytes the deadlock wouldn't be there (since MS TCP would send a no-data ACK when it got the second packet). As we have played with it the size of the outgoing message (meaning how much data is posted in each socket.send command) appears important in the MS heuristics.
John Robertson
Hmm, I just noticed that you're capturing the data only on the *recieve* side - if the recieve kernel is spacing out completely for that 150ms, it's probable that the recieve timestamps are all being delayed too during that time. It would be interesting to see how a packet capture on the sending side compares. (I still think that it looks like the reciever is temporarily freezing, which is *directly* causing it not to send ACKs, rather than the lack of ACKs being caused by the sender somehow).
caf
Interesting. I hadn't accepted your answer as the solution because things didn't seem to add up. Our testing is temporarily halted so I accept your solution. :) Since no packets are missed the receiver isn't freezing to the extent that it is not taking in data. Since there are no dropped packets during the freeze wouldn't your suggestion require that the networking kernel be able to receive data in some internal buffer (or on the network card) but for that received data to not be available to wireshark for time stamping until a later time?Where did you learn about the 730000 bytes thing?
John Robertson
I just got the 730,000 bytes from your question - 500*1460. The 12ms is from that divided by 480Mbps.
caf