views:

102

answers:

3

I have a very simple Winsock2 TCP client - full listing below - which simply blasts a bunch of bytes. However, it's running very slowly over the network; the data just trickles by.

Here's what I've tried and found (both Windows PCs are on the same LAN):

  • Running this app from one machine to the other is slow - it takes ~50s to send 8MB.
  • Two different servers - netcat and a custom-written one (just as simple as the below client) - yielded the same results.
  • taskmgr shows both the CPU and network being barely-utilized.
  • Running this app with the server on the same machine is fast - it takes ~1-2s to send 8MB.
  • A different client, netcat, works just fine - it takes ~7s to send 20MB of data. (I used the nc that comes with Cygwin.)
  • Varying the buffer size (1*4096, 16*4096, and 128*4096) made little difference.
  • Running almost the same code on Linux boxes on a different LAN worked just fine.
  • Adding a bunch of print statements around the send call shows that we spend most of our time blocking on it.
  • On the server side, we see a bunch of receives of <= 4K chunks (regardless of what size buffers the sender is pushing). However, this happens with other clients as well, like netcat, which runs at full speed.

Any ideas? Thanks in advance for any tips.

#include <winsock2.h>
#include <iostream>

using namespace std;

enum { bytecount = 8388608 };
enum { bufsz = 16*4096 };

int main(int argc, TCHAR* argv[])
{
  WSADATA wsaData;
  WSAStartup(MAKEWORD(2,2), &wsaData);

  struct sockaddr_in sa;
  memset(&sa, 0, sizeof sa);
  sa.sin_family = AF_INET;
  sa.sin_port = htons(9898);
  sa.sin_addr.s_addr = inet_addr("157.54.144.70");
  if (sa.sin_addr.s_addr == -1) {
    cerr << "inet_addr: " << WSAGetLastError() << endl;
    return 1;
  }

  char *blob = new char[bufsz];
  for (int i = 0; i < bufsz; ++i) blob[i] = (char) i;

  SOCKET s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
  if (s == INVALID_SOCKET) {
    cerr << "socket: " << WSAGetLastError() << endl;
    return 1;
  }

  int res = connect(s, reinterpret_cast<sockaddr*>(&sa), sizeof sa);
  if (res != 0) {
    cerr << "connect: " << WSAGetLastError() << endl;
    return 1;
  }

  int sent;
  for (int j = 0; j < bytecount; j += sent) {
    sent = send(s, blob, bufsz, 0);
    if (sent < 0) {
      cerr << "send: " << WSAGetLastError() << endl;
      return 1;
    }
  }

  closesocket(s);

  return 0;
}
+1  A: 

The application looks fine, and you said it works fine with linux. I dont know whether this will help you, but I would have compared - 1) The mtu values of the windows with the linux system. 2) checked the tcp receive mem size in windows and Linux. 3) checked whether the network card speed of both the systems are same.

Satish
+2  A: 

Here are the things you can do to get a better picture.

  • You can check how much time it spends inside the "connect", "send" API calls. You can see if connect call is a problem. You can do it with profiler, but if your application is very slow, you will be able to see it while debugging.
  • Try running Wireshark (or Ethereal) to dump you network traffic so that you see that TCP packets are transferred with some lattency. If responses come fast then it has to do with your system only. If you find delays, than it is routing/network problem.
  • You can run "route print" to check how your PC is sending traffic to destination machine (157.54.144.70). You would be able to see if gateway is used and check routing priority for the different routes.
  • Try sending smaller chunks. (I mean changing "bufsz" to 1024). Is there any correlation between performance and buffer size?
  • Check if there is antivirus, firewall applications installed? Make sure to turn it off. You can try to run the same app in safe mode with network support.
AlexKR
It really only calls connect() once; it spends most of its time blocking on send() calls. Smaller chunks made no difference, and there was no AV/FW enabled. I followed your advice of using a network monitor (Microsoft's own netmon); see my full answer.
Yang
A: 

I watched packets going by using Microsoft Network Monitor (netmon) with the nice TCP Analyzer visualizer, and it turned out that tons of packets were getting lost and needing to be retransmitted - hence the slow speeds, because of retransmission timeouts (RTOs).

A colleague helped me debug this:

Well, from this trace on the receiver side, it definitely looks like some packets are not making it through to the receiver. I also see what appear to be some mangled packets (things like partial TCP headers, etc) in these traces.

Even in the “good” trace (the receiver's view of the netcat client), I see some mangled packets (wrong TCP data length, etc). The errors aren’t as frequent as in the other trace, however.

Given that these machines are on the same subnet, there is no router in the way which could be dropping packets. That leaves the two NICs, the Ethernet cables, and the Ethernet switches. You could try to isolate the bad machine by adding a third machine into the mix and try the same test with the new machine replacing first the sender and then the receiver. Use a different physical port for the third machine. If either of the original machines has a switch between it and the floor jack, try removing that switch from the equation. You could also try an Ethernet reversing cable between the original two machines (or a different Ethernet switch that you plug the two machines into directly) and see if the problem persists.

Since the problem appears to be packet content dependent, I doubt the problem is in the cabling. Given that the sender has an NVidia nForce chipset Ethernet and the receiver has a Broadcom Ethernet, my money is on the sender’s NIC being the culprit. If it does seem to be the fault of a particular NIC, try turning off special features of the NIC like checksum offloading or large-send offload.

I tried using a third box as the sender (identical to original sender, a Shuttle XPC with nForce chipset), and this worked smoothly - TCP Analyzer showed very smooth-running TCP sessions. This suggests to me that the problem was actually due to a buggy NIC/driver on the original sender box, or bad Ethernet cable.

Yang
Eventually, another symptom arose with the replacement box as well: copying C:\windows\memory.dmp - and *only* that data! - would cause the TCP packets to get dropped. Everything else transferred fine and fast. Replacing the Ethernet cable made the problem go away, so for now it seems the problem was indeed due to a bad cable. Bad hardware can cause the strangest problems.
Yang