tags:

views:

127

answers:

2

We have a c# (3.5 framework) socket server which is a console app, after about 3000 connections (or less, it's quite random), we get an unhandled exception which crashes the app completely.

We're really struggling to find out what's happening and where, the only info we get is below, can anyone shed any light? It should be noted that EVERYTHING is wrapped up in try catch{}

Description: Stopped working

Problem signature: Problem Event Name: CLR20r3 Problem Signature 01: qrushrserver.exe Problem Signature 02: 1.0.0.0 Problem Signature 03: 4bf56a0c Problem Signature 04: System Problem Signature 05: 2.0.0.0 Problem Signature 06: 49cc5ec9 Problem Signature 07: 2c0b Problem Signature 08: 40 Problem Signature 09: System.Net.Sockets.Socket OS Version: 6.0.6002.2.2.0.1296.17 Locale ID: 2057

Faulting application app_name.exe, version 1.0.0.0, time stamp 0x4bf56a0c, faulting module mscorwks.dll, version 2.0.50727.4200, time stamp 0x4a9ee32d, exception code 0xc0000005, fault offset 0x00000000001c89ca, process id 0x%9, application start time 0x%10.

.NET Runtime version 2.0.50727.4200 - Fatal Execution Engine Error (000007FEF8E4664E) (80131506)

+1  A: 

Your program is dying on an Access Violation. While there's no conclusive evidence, this most likely happened when the garbage collector or finalizer thread was running. Which would make the exception uncatchable.

This is almost always caused by heap corruption. It is very unlikely that it is the .NET Socket code that caused the corruption, that code has been put to the test billions of times. Although you are giving it a good work-out. The much more likely cause is some kind of unmanaged code that's used by your program. Some kind of COM server, perhaps you P/Invoke something. It could also be some sort of add-in to the machine, like a virus scanner.

Finding the true cause is going to be difficult. Start with the environmental stuff, boot Windows in safe mode with network support. Run SysInternals' AutoRuns utility to disable stuff that gets started automatically. Good luck with it, you'll need it.

Hans Passant
We're not P/Invoking anything and we're using pure .NET classes. There is a .net hotfix for mscorwks.dll which we're thinking about installing. Annoying we need to uninstall all the frameworks and then reapply them. Thanks for replying, this sounds like a right nightmare!
Rob
That's not what I recommended, re-read my last paragraph. Start with the virus scanner, they are notorious for hooking into sockets to "protect" the machine. Symantec and Panda have a particularly bad history.
Hans Passant
what's the hot fix for Rob?
Len Holgate
We don't actually have any virus scanner on our server.
Rob
This is the hotfix, have no idea whether this is anything to do with our issue: http://support.microsoft.com/kb/913384/en-us
Rob
@Rob: It doesn't. The revision version number of that hotfix is 63, a patch for number 42 which was the original release version of .NET 2.0. The revision you've got on that machine is 4200, published about 3 years after that hotfix. The best you can hope for is that it will refuse to install.
Hans Passant
You're right, we figured this same thing out shortly after me posting.
Rob
A: 

I guess you're using the Async methods rather than Begin/End.

Are you calling Dispose() on your SocketAsyncEventArgs when you're done with them? I find that if you don't and keep allocating new ones then you chew through memory for some reason, they don't seem to be collected... I found this a real problem with high number of read/write operations on a high number of connections.

Len Holgate
We're using a connection pool so calling Dispose() on them wouldn't be the right behaviour for us, we just close the socket and then push it back on the stack. Or should we be disposing it too?
Rob
Your `SocketAsyncEventArgs` are 'per operation' data... What do you do with them after the read or write completes? I found that disposing of them and creating new for each operation was more efficient for high numbers of connections than pooling for reuse and the resulting contention for the pool lock.
Len Holgate
Len, would you kind enough to have a quick butchers at what we're doing so far. Off topic in this thread i know, but for some reason i can take a couple of attempts to successfully login sometimes and i can't find the issue.http://95.131.67.163/code.txt MUCH appreciated!
Rob
I see you're based on the MSDN sample; I found that unreliable over time (after a period of lots of connections and lots of data transferred performance dropped horribly for some reason that I didn't bother to look for) so I stripped out the whole clever memory management and buffer pooling...
Len Holgate
Indeed, thought couldn't go wrong with MSDN example. Clearly wrong. Ok we'll strip the management pool out and go with a more simple model. Does everything else look about right? For some reason, when testing locally with a few clients the number of clients in the dictionary is spot on, however on production with a lot more connections it seems to be out by a considerable amount. We do have drifting sockets but i'd be suprised if it's all down to that. Might have to put a timer in to check for disposed sockets i guess..
Rob