views:

66

answers:

1

Hello, we are having latency issues in one of our network application. Most of the time requests are being handled within 100ms. But sometime it can take up to a few seconds for no apparent reason.

So I hooked up some monitoring tools and looked up what was happening (Wireshark to monitor the network externally through port replication and Process Monitor to see what was happening on the local machine).

I was able to match tcp packets and they usually where within a millisecond of eachother in both logs file. But in one occurence, the last packet of a series was delayed by more then 250ms in Process Monitor compared to wireshark (and the application erratic behavior - due to latency - was being observed).

Since Wireshark was hooked up on another computer I'm quite sure that what was being monitored was accurate : all the packed did reach the network card on time. As for Process Monitor I'm not totally sure about how it work : when is the network data being registered? Is it when it reach the network card? When it is made available to the application? When the application reads the data?

During these 250ms there were a few other events being registered which let me believe that Process Monitor was recording correctly and that this 250ms delay wasn't "created" by it.

Any help regarding the behavior of Process Monitor, the current method I use to dig down the problem or what you think could be the problem would be much appreciated.

+1  A: 

Option 2

Perhaps you're experiencing the infamous 250ms delays that the GC cause from time to time (link). You can accurately measure GC suspensions using a specialized CLR host (link)

Option 1 - was ruled out

Since you are using TCP, I'd suggest that you'll turn on the NoDelay option on your socket just to eliminate the possibility that you're suffering from a clash between Nagle's Algorithm and the Delayed ACKs Algorithm. If you're experiencing "batching" of packets while sometimes a packet is "delayed" for about 200ms, then it just might be the issue.
A more in-depth explanation of this behavior can be found here.

Liran
Nodelay is already enabled. Also since I was monitoring with wireshark and the packet went in without "any" delay from the sending part, I can only assume the problem is local to the machine. The packet went through the wire in time, why wasn't it "received" on time on the local computer?
Benoittr
@Benoittr, have you checked if those 250ms delays are caused by a GC collection? Perhaps you're experiencing the infamous 250ms delays that the GC cause from time to time (link: http://blogs.microsoft.co.il/blogs/sasha/archive/2009/07/31/garbage-collection-thread-suspension-delay-250ms-multiples.aspx). You can accurately measure GC suspensions using a specialized CLR host (link: http://blog.liranchen.com/2010/08/accurately-measuring-gc-suspensions.html)
Liran
@Liran I've been trying to measure the GC in my application following your blog post. I modified it slightly to use the new way of doing things in 4.0. At the moment I'm able to load some sample c# program from the host but as soon as my first garbage collection arise the method SetAppDomainManager is called in unmanaged code and the c# application crash with a System.ExecutionEngineException. Still working on it. Eventually I'd like to come up with a configurable host which would load any manage code and report on the GC durations. It might already exist but I couldn't find anything.
Benoittr
@Benoittr, As I recall, the sample code I've provided in the post should work just fine on v4.0 (just remember to update the CLR version number to the appropriate one). Actually, for my own profiling needs, I've written a CLR host that's similar to your description (it basically receives some parameters from the command-prompt and writes the suspensions to a file). Unfortunately, I won't be able to post the source code. However, the sample code in the post covers the "difficult" part of it.
Liran
@Liran I think I'll just stick to the deprecated methods for now. Even tough I still need to dig down the problem on my side, your answer was extremely useful and I'll go ahead and accept it. Thanks a lot.
Benoittr
@Benoittr, you're welcomed. I'm curios about your case, please update me if the GC suspensions are indeed the root cause of the delays.
Liran
@Liran I have yet to test on the problematic computer. But on my personnal computer (windows 7) I've been able to reproduce the situation but instead of noticing delays in multiples of 250ms, they are in multiples of 10ms. Nothing above 50ms for a few minutes run.
Benoittr
@Liran I've let my application run on my personnal computer overnight. Its a repetitive process that keeps doing the same task over and over again every 3 seconds. Gen2 Collections are generally between 10 and 30ms but at one point during the night it took 2 seconds to do a gen 1 collection. It doesn't seem to be the problem that you mentionned, but it still puzzle me.
Benoittr
@Benoittr, are those numbers based on the clr host's output? You could try to take a dump of the process when this phenomena occurs (perhaps by some external tool that receives ACK messages from your own application, and when it stops to receive it for X milliseconds, it takes a dump of the monitored process). Of course, this will require you to use the appropriate WinAPI methods (http://msdn.microsoft.com/en-us/library/ms680360%28VS.85%29.aspx). After obtaining the dump, you could investigate it to see what's going on in your process at that time
Liran
@Liran Yes the overnight numbers were extracted using the CLR Host. I'm not sure I'll head in the "dump" direction yet but it does seems to be a close future step. Thanks again.
Benoittr