tags:

views:

92

answers:

2

We have an async socket server written in C#. (running on Windows Web Server 2008)

It works flawlessly up until it stop accepting new connections for an unknown reason.

We have about 200 concurrent connections on average, however we keep a count of both connections created and connections dropped. These figures can reach as high as 10,000 or as low as only 1000 before it just stops! It can run for up to around 8 hours sometimes before it stops or it can run for about half hour, at the moment it's running for about an hour before we have another application bring it back up automatically when that can't connect (not exactly ideal).

It doesn't appear like we're running out of sockets as we're closing them properly, we're also logging all errors and nothing is happening immediately before it stops.

We can figure this out. Does anyone have any ideas what might be going on?

I can paste code, but it generally just the same old async beginaccept/send code you see everywhere.

A: 

Without seeing code, it is almost imposible to wage a guess. But I'll try anyway, one thing that comes to mind is that you might not be maintaining a reference to the listening socket and at some point the GC collects the socket and your listening stops.

Now of course the fact that this sometime runs for hours makes this an almost unlikely reason, it is one that came to minds and thought worth mentioning.

Chris Taylor
This is interesting, we'll move the listening socket global just to make sure
Rob
It's impossible for me to paste code here, there are so many lines of code, but we genuinely using very generic code. It feels like it's a configuration issue of some sort, but we can figure it out.
Rob
Not sure if anyone would be willing, but i could send a large chunk of code over for you to have a gander?
Rob
To give more info, netstat -a returns no TIME_WAIT sockets, quite a lot CLOSE_WAIT and ESTABLISHED but no time_wait
Rob
A: 

Who initiates the active close, the client or the server? If it's the server then you may be accumulating socket's in TIME_WAIT state on the server and this may prevent you from accepting new connections. This is more likely if the client connections can be short lived and you go through periods when lots of short lived client connections occur.

Oh and if you ARE accumulating socket's in TIME_WAIT then please don't just assume that changing the machine-wide time wait period length is the best or only solution.

Len Holgate
The actual problem that Len suspects is the exhaustion of ephemeral ports. The best solution is probably to increase the ephemeral port range, if this the problem you're seeing.
Stephen Cleary
That's true, or to move the active close to the client so that the server doesn't accumulate `TIME_WAIT` in the first place...
Len Holgate
We have upped the range of ephemeral ports on the machine but that's made no difference. The client initiates the active close
Rob
If the client is initiating the active close then you're unlikely to be collecting `TIME_WAIT` sockets on the server so increasing the ephemeral port range wont have any affect.
Len Holgate
Hi Len, any other ideas? We're running out of ideas, all we know is it suddenly stops listening for no apparent reason. Earlier today it managed to create 12,000 connections without falling over, but tonight when it's busier it only last for about an hour and only 2-5k connections.
Rob
You're sure you're not swallowing an exception somewhere?Post the a link to the code and I'll take a look...
Len Holgate
Just a bit of an update, running TCPView shows a lot of established connections but none are in TIME_WAIT so that looks good. When the server stops accepting connections i can see that TCPView says that the server app is still listening. Confused!
Rob
Not missing any exceptions (i don't think). http://tinyurl.com/3y5hkk6Thanks Len, much appreciated!
Rob
What kind of event is that you're using? Could there be a race condition between your Set() and WaitOne() which is causing the wait to wait forever? Personally I wouldn't structure the code like that. You can post a number of async accepts rather than one and you can have each of them post a new one when it completes rather than having a listen thread at all. This will improve accept performance.
Len Holgate
Thanks for the advice, we've taken out the listening thread and we're BeginAccepting as the first thing we do in AcceptCallback. When we first started it, it raced upto 130 connections as everyone signed back in immediately, it then just stopped taking connections, not sure if this was due to a bottle neck or not. However since then we've tried it again and it seems to be working (for the time being).
Rob
Well, if nothing else we've removed one potential problem, the wait on the event; though I can't really see how it could have been a problem. The only potential issue would have been if one of the BeginAccepts had failed.... Personally I'd use the AcceptAsync() model as it scales better.
Len Holgate
That's the thing, we don't get any exceptions at all if a beginAccept fails. The issue is still happening unfortunately. We've now created a timer which fires every 10 seconds checking whether a connection can be made, if it can't then it tries to create a new BeginAccept, but while this works on our test server (with just a few clients), on production it just seems to die and stop responding. This clearly isn't the greatest way of doing things. Is it possible that a client can connect, use up a beginAccept and never reach AcceptCallBack and therefore a new beginAccept is never called?
Rob
Will have to investigate the AcceptAsync() method, not seen many example of how this works. By the way, thanks for taking the time to help, it's really appreciated!
Rob
Ah, is the AcceptSync part of the .net 3.5 framework? We did try rewriting our server with the SocketAsyncEventArgs stuff but we came up again an issue with buffering, couldn't work out how to maintain a buffer between calls of receiveCallBack (or the alternative func in 3.5). Is it worth writing it with the new framework
Rob
I don't know, I tend to work in unmanaged code mostly. Try posting more than one BeginAccept() at the start (loop and post 10, perhaps). Also what is your listen backlog? AcceptAsync() has a fairly complete and fairly good (if simplistic) example linked to from the SocketAsyncEventArgs msdn pages.
Len Holgate
Listening backlog is 100, what impact does it have to have more BeginAccepts than are perhaps needed? We're a bit dubious of sctattering them around. We've read every example going of SocketAsyncEventArgs and pretty much all of them are just simple echo servers and don't properly demonstrate buffering
Rob
Well, in unmanaged code, posting more AcceptEx() calls results in the server being able to accept more new connections simultaneously; if the listen backlog is exceeded then new connections are refused until you've accepted those in the listen backlog queue... I'm working on some example servers with SocketAsyncEventArgs to profile against my unmanaged server framework but they're not complete yet. BUT you can just use Offset and BytesTransferred to set up the buffer for another read into the same memory buffer at an offset after where the last read completed (if you're accumulating a message)
Len Holgate
Ok so we've spend today rewriting the server with 3.5 and AcceptAsync() and we have all the buffering working etc... However we've come against another issue i'm hoping you might be able to shed some light on. For each new connection we're adding it to a dictionary. Every so often we loop through that collection and send a message to each client, however we seems to be hitting this error: An asynchronous socket operation is already in progress using this SocketAsyncEventArgs instance whenever we try to send. Code we're using is:
Rob
c.ReadEventArgs.SetBuffer(byteData, 0, byteData.Length); Boolean willRaiseEvent = c.Client.SendAsync(c.ReadEventArgs); if (!willRaiseEvent) { this.ProcessSend(c.ReadEventArgs); }Where c is our object from the dictionary. Any ideas?
Rob
Ah okay, figured it out. Had to hold our sent data in a new AsyncEventArgs object - oops.
Rob