tags:

views:

43

answers:

3

System Background: Its basically a client/server application. Server is an embedded device and Client is a windows app developed in C++.

Issue: After a runtime of about a week, communication breaks between client/server,
because of this the server is not able to connect back to the client and needs a restart to recover. Looks like System is experiencing Socket re-connection problem. Also The network sometimes experiences intermittent failures.

  1. Abrupt Termination at remote end
  2. Port locking

Want some suggestions on how to cleanup the socket or shutdown cleanly so that re-connection happens properly. Other alternate solutions?

Thanks, Hussain

+2  A: 

It does not sound like you are in a position to easily write a stress test app to reproduce this more quickly out of band, which is what I would normally suggest. A pragmatic solution might be to periodically restart the server and client at a time when you think the system is least busy, or when problems arise. This sounds like cheating but many production systems I have been involved with take this approach to maximize system uptime.

My preferred solution here would be to abstract the server and client socket code (hopefully your design allows this to be done without too much work) and use it to implement client and server test apps that can be used to stress test only the socket code by simulating a lot of normal socket traffic in a short space of time - this helps identify timing windows and edge cases that could cause problems over time, and might speed up the process of obtaining a debuggable repro - you can simulate network error in your test code by dropping the socket on the client or server periodically.

A further step to take on the strategic front would be to ensure that you have good diagnostics in your socket handlers on client and server side. Track socket open and close, with special focus on your socket error and reconnect paths given you know the network is unreliable. Make sure the logs are output sequential with a timestamp. Something as simple as this might quickly show you what error or conditions trigger your problems. You can quickly make sure the logs are correct and complete using the test apps I mentioned above.

One thing you might want to check is that you are not being hit by lack of ability to reuse addresses. Sometimes when a socket gets closed, it cannot be immediately reused for a reconnect attempt as there is still residual activity on one or other end. You may be able to get around this (based on my Windows/Winsock experience) by experimenting with SO_REUSEADDR and SO_LINGER on your sockets. however, my first focus in your case would be on ensuring the socket code on client and server handles all errors and mainline cases correctly, before worrying about this.

Steve Townsend
+1  A: 

A common issue is that when a connection is dropped, it is kept opened by the OS in TIME_WAIT state. If you want to restart the server socket, it will not be able to reopen the same port directly because it is still present for the OS. To avoid that, you need to set the parameter SO_REUSEADDR so that the OS allows you to reuse the port if it is in TIME_WAIT state for a server socket.

Example:

int optval=1;
// set SO_REUSEADDR on a socket to true (1):
setsockopt(s1, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof optval);
Benoit Thiery
A: 

I'm experiencing something similar with encrypted connections. I believe in my case it is because the client dropped the connection and reconnected in less than the 4 minute FIN_WAIT period. The initial connection is recycled (by the os) and the server doesn't see the drop out. The SSL authentication is lost when the client loses connection so the client tries to re-authenticate. This is during what the servers considers the middle of a conversation. The server then hangs up on the client. I think the server ssl code considers this a man in the middle attack or just gets confused and closes the connection.

Jay
None of that is possible. The FIN_WAIT period affects the end that closed first, not the end that received the close first. The server will see it as a new TCP connection because of the incoming SYN and also because of the new TCP sequence number. At the SSL level the SSL session can be resumed. Your problem lies elsewhere.
EJP
I think you're mistaken. I've debugged closing the client (a windows service) and restarted it. The server never received a connection close event and the same connection was reused.
Jay