tags:

views:

451

answers:

2

I have a mission-critical real-time data application that uses a TCP connection between the client and server. In some cases, the connection periodically dies (SocketException). No problem - just reconnect and move on. However, the customers aren't thrilled with these intermittent drops in connectivity.

I'd like to know where to point the finger. Is it the client or server? Hardware or software? Is it something about the ethernet link? The end result would be to show the user an indicator of connection health, so that a bad link can be investigated and remedied.

Are there any metrics I can pull from the TcpClient, Socket, or anything else that will tell me about the health of the connection? Perhaps average time to ack, number of retries, etc?

I specifically want to know about a TCP connection - not just the ethernet connection as a whole (your LAN connection might be dandy, but there could be an issue going to an outside server).

Of course I could ping the remote host, but I don't think that would really give me the kind of stats I'm looking for. For one thing, I could be pinging a router if the server is hiding behind NAT.

+1  A: 

Perfmon is your friend, run a log for all the IP, TCP and networking counters. If you can tell when the connection died, you can look in the graph to see if there's anything - network errors, no transmission, no IO bytes transferred, etc.

Add some .NET counters too, like GC, memory and CPU usage.

The last thing you can do is increase the TCP timeout and other settings. They're in the registry

You will have to monitor both ends if it's really a problem with the remote server, but start with looking at the counters and see if anything jumps out at you.

gbjbaanb
Thanks. However, I'm really looking for some way to do this on a small scale inside the application. If a customer reports lots of dropped connections, I can't tell them to ask IT to spend a day chasing it down (wouldn't that be nice, though).
Jon B
you can talk to the customer though, I've done this a few times before now to try to trace memory, performance and networking issues. Often its a problem with the network card. If you 'work with' the customer it usually makes them feel warm and fuzzy.
gbjbaanb
+4  A: 

Firstly, you should inspect the details of the SocketExceptions you're getting. I don't know what they contain in .Net, but in Java the detailed message provides a useful hint, such as "Connection closed by peer" or "Connection reset".

In my experience, a common cause of socket connections being dropped is a bug in the code where a read timeout exception is handled by the same catch clause as all other connection-related exceptions, thus usually resulting in the connection being closed for no good reason.

In enterprise setups, the typical cause of long-lasting TCP connections being closed is a firewall appliance that closes TCP connections with no traffic, say, after 10 minutes, or closes connections after their age reaches, say, 30 minutes, regardless of the traffic. In general, it's best to assume that these things will happen, and be prepared to reestablish the connection gracefully.

A good approach is to see whether there's a pattern in connection closers. For example, whether they are closed periodically, or after a certain time of no activity. You can also run a packet sniffer to see which side initiates the connection shutdown or sends the RST packet and why.

Alexander
+1 for the sniffer. Wireshark, the sniffer formerly known as Ethereal, is very good.
Zan Lynx