views:

337

answers:

5

Hi everyone,

I have a small application which sends files over the network to an agent located on a Windows OS.

When this application runs on Windows, everything works fine, the communication is OK and the files are all copied successfully.

But, when this application runs on Linux (RedHat 5.3, the receiver is still Windows) - I see in Wireshark network trace messages of TCP Zero Window and TCP Window Full to appear on each 1-2 seconds. The agent then closes the connection after some minutes.

The Windows - Linux code is almost the same, and pretty simple. The only non-trivial operation is setsockopt with SO_SNDBUF and value of 0xFFFF. Removing this code didn't help.

Can someone please help me with this issue?

EDIT: adding the sending code - it looks that it handles properly partial writes:

int totalSent=0;
while(totalSent != dataLen)
{
    int bytesSent 
        = ::send(_socket,(char *)(data+totalSent), dataLen-totalSent, 0);

    if (bytesSent ==0) {
        return totalSent;
    }
    else if(bytesSent == SOCKET_ERROR){
#ifdef __WIN32
        int errcode = WSAGetLastError();
        if( errcode==WSAEWOULDBLOCK ){
#else
            if ((errno == EWOULDBLOCK) || (errno == EAGAIN)) {
#endif
            }
            else{
                if( !totalSent ) {
                    totalSent = SOCKET_ERROR;
                }
                break;
            }
        }
        else{
            totalSent+=bytesSent;
        }
    }
}

Thanks in advance.

A: 

The most likely problem is that you have a bug in your code where you don't handle partial reads or partial writes correctly. TCP between Linux and Windows is known to work.

janm
A: 

A common mistake when developing with TCP sockets is about incorrect assumption about read()/write() behavior.

When you perform a read/write operation you must check the return value, they may not have read/write the requested of bytes, you usually need a loop to keep track and make sure the entire data was transfered.

João Pinto
+2  A: 

Not seeing your code I'll have to guess.

The reason you get a Zero window in TCP is because there is no room in the receiver's recv buffer.

There are a number of ways this can occur. One common cause of this problem is when you are sending over a LAN or other relatively fast network connection and one computer is significantly faster than the other computer. As an extreme example, say you've got a 3Ghz computer sending as fast as possible over a Gigabit Ethernet to another machine that's running a 1Ghz cpu. Since the sender can send much faster than the receiver is able to read then the receiver's recv buffer will fill up causing the TCP stack to advertise a Zero window to the sender.

Now this can cause problems on both the sending and receiving sides if they're not both ready to deal with this. On the sending side this can cause the send buffer to fill up and calls to send either to block or fail if you're using non-blocking I/O. On the receiving side you could be spending so much time on I/O that the application has no chance to process any of it's data and giving the appearance of being locked up.

Edit

From some of your answers and code it sounds like your app is single threaded and you're trying to do non-Blocking sends for some reason. I assume you're setting the socket to non-Blocking in some other part of the code.

Generally, I would say that this is not a good idea. Ideally, if you're worried about your app hanging on a send(2) you should set a long timeout on the socket using setsockopt and use a separate thread for the actual sending.

See socket(7):

SO_RCVTIMEO and SO_SNDTIMEO Specify the receiving or sending timeouts until reporting an error. The parameter is a struct timeval. If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred; if no data has been transferred and the timeout has been reached then -1 is returned with errno set to EAGAIN or EWOULDBLOCK just as if the socket was specified to be nonblocking. If the timeout is set to zero (the default) then the operation will never timeout.

Your main thread can push each file descriptor into a queue using say a boost mutex for queue access, then start 1 - N threads to do the actual sending using blocking I/O with send timeouts.

Your send function should look something like this ( assuming you're setting a timeout ):

// blocking send, timeout is handled by caller reading errno on short send
int doSend(int s, const void *buf, size_t dataLen) {    
    int totalSent=0;

    while(totalSent != dataLen)
    {
        int bytesSent 
            = send(s,((char *)data)+totalSent, dataLen-totalSent, MSG_NOSIGNAL);

        if( bytesSent < 0 && errno != EINTR )
            break;

        totalSent += bytesSent;
    }
    return totalSent;
}

The MSG_NOSIGNAL flag ensures that your application isn't killed by writing to a socket that's been closed or reset by the peer. Sometimes I/O operations are interupted by signals, and checking for EINTR allows you to restart the send.

Generally, you should call doSend in a loop with chunks of data that are of TCP_MAXSEG size.

On the receive side you can write a similar blocking recv function using a timeout in a separate thread.

Robert S. Barnes
A: 

A read() returning 0 indicates a closed connection. A send() returning 0 however just indicates 0 bytes sent. You have to continue the loop. To avoid busy waiting you can use select to wake up when the socket is writeable again.

Peter G.
You're wrong. A return of 0 on send has no special meaning. A return of 0 only indicates a closed connection for reads. On send, a closed connection is indicated by a return of -1 and `errno` set to `EPIPE` if the `MSG_NOSIGNAL` flag is passed, other wise a signal is thrown and the program is terminated.
Robert S. Barnes
@Robert I fail to see where you prove me wrong. I especially pointed out that sending 0 bytes is not special beyond the fact that 0 bytes where sent.
Peter G.
Sorry Peter, must have misread you post and switched what you said about send and read in my own head.
Robert S. Barnes
+1  A: 

Hi all,

I tried to disable Nagle's algorithm (with TCP_NODELAY), and somehow, it helped. Transfer rate is much higher, TCP window size isn't being full or reset. The strange thing is that when I chaged the window size it didn't have any impact.

Thank you.

rursw1
That's really odd. Typically disabling Nagle is only useful for real time apps where you want to have very low latency at the expense of wasting allot of bandwidth. Disabling it for bulk file transfer seems counter-intuitive. Have you actually tested and seen objectively that disabling Nagle is what makes the difference? Maybe some other change you made could be responsible?
Robert S. Barnes
@Robert S. Barnes: That's really odd, I agree. But this is the only change that was made, and it helped. Moreover, the receiver side has already disabled Nagle. I know that it may refer to an underlying fundamental problem that is hiding somewhere, waiting to jump out and bite at another time. But as a workaround it is good enough.
rursw1