views:

224

answers:

2

I have a C# application that has been running fine for several years. It connects via a TCP/IP socket to a machine that sends me stock trade executions.

Recently, I've tried to deploy it to some machines in a new data center that is behind a hardware firewall, and I've started to see some weird dis-connects.

When a dis-connect happens, in my app (the client side), I see nothing unusual except that I stop receiving data over the socket. Wireshark confirms that no data is reaching the socket and my application's receive thread is blocking on the Receive() call when I stop it in the debugger. The socket shows as ESTABLISHED in netstat.

But from the server side, it looks like my client is dis-connecting. Looking at their logs, it looks like the socket on their end usually ends up with either (nRecvd=-1,errno=104) or (nRecvd=0,errno=11). (104 is connection reset by peer).

The dis-connect only seems to happen after a period of in-activity. I have solved this for now by implementing a heartbeat between my client and their server that just sends a short message every 20 seconds and gets a reply. This has caused the dis-connects to drop to 0 over the past few days.

At first, I figured that the hardware firewall was the problem. It was causing the socket to time out after in-activity. But the person in charge of the firewall claims that the timeout for connects on this port (8887) is 2160 minutes.

I am running Windows Server 2003 and .NET 3.5. The trades server is a linux machine (sles9 I believe though I'm not sure).

Any ideas on what might be going on? What could I do to debug this more given that I don't have any access to the firewall logs and no ability to change the code on the trade server?

Thanks, Mike

A: 

I would setup wiresharp on both sides of the firewall to see what happens on TCP (and lower level). And when the admin says the "timeout for connects" is something. Is that the timeout for an idle, established connection? Anything else does not make any sense I guess.

Also, are you using KeepAlive option for TCP? And is that forwarded by the firewall or not?

As I said, probably want to run wireshark on both sides of the firewall...

Cellfish
A: 

What you describe is common, and it's common to implement a heartbeat to keep TCP sockets alive through such firewalls/gateways like you did.

That hardware might have hard 2160 minutes timeouts (in my experience 20-30 minutes is more common though) , but connections are usually dropped much more aggressively if there's any kind of load. Such firewalls have limited resources, and when they need more connection tracking they tend to drop the oldest connection tracked without any activity regardless of the hard timeout set.

If you want to debug this more, go sniff on the server side of the firewall and see what , if anyting, happens when the server gets a disconnect

nos
Thanks, just wanted to make sure that I was on track with the firewall hypothesis. They wouldn't capture anything for me on the path from the firewall to the trade server. In the end, it turned out to be the firewall. They had unblocked the wrong port despite me asking 10x to confirm the port number.
Michael Covelli