Diagnosing pathological behavior of a piece of cluster software

views:

answers:

+1 Q:

Diagnosing pathological behavior of a piece of cluster software

I'm using a kind of load balancer over a small cluster that is able to achieve >2000rps on zero-duration requests (t.i. ones that are immediately satisfied by the worker nodes). But as soon as the requests stop being zero-duration and start taking even 1ms, performance immediately drops >10x. The data being transfered in both directions is identical and is about 2kb in size. This is for sure not related to saturation of the cluster or network throughput, because 200rps of 1ms requests is a very tiny load and the network is 10Gbit. Besides, the CPU load is just some 2-5% both on the load balancer and on the worker nodes.

I wonder whether that might be related to some pathological behavior of the OS scheduler, or the OS network stack (t.i. there is some special case behavior for very short interactions).

How might I diagnose the reason? Which perfcounters to watch? What tools or methodologies to use?

(Just in case someone simply knows the answer to my particular problem, I'm talking about the MS HPC Server 2008 R2's "WCF Broker", running on Windows Server 2008 R2 over Hyper-V)

I'm assuming that there are some shared resources with some kind of locking system in place? Is locking a bottleneck? It's hard to guess without seeing the system.

Do you have a way to profile the workers? What are they spending most of their time on, especially in the fast vs slow scenarios?

Chris Smith 2010-08-11 18:57:07

Everyone is doing nothing most of the time. The CPU load is close to zero.

jkff 2010-08-12 04:30:37

If there was a locking issue, it would manifest with very low CPU usage. The threads are waiting for something before they can complete their work items. You might try stack walking w/ xperf to see what they're doing, i.e. http://blogs.msdn.com/b/pigscanfly/archive/2009/08/06/stack-walking-in-xperf.aspx

Chris Smith 2010-08-12 19:51:08

+1 A:

One thing you can do is use ETW tracing to try and understand what the nodes are doing while your WCF job is running. On HPC server, I sometimes clusrun xperf to collect traces on all or specific nodes. There are a number of tools that you can use for analyzing ETW traces, including xperf itself. I haven't done any serious work using HPC SOA (WCF), but I did write a simple WCF raytracer app and then used xperf to profile it on several of the nodes.

Matt 2010-08-16 00:58:17

+1 A:

Turned out it was a completely network-unrelated issue having to do with peculiarities of the scheduling mechanism of HPC Server. I resolved the issue by tweaking a configuration option "serviceRequestPrefetchCount" to 0 in the loadBalancing section of the WCF service config file.

jkff 2010-08-16 09:05:12

ansaurus

tags:

views:

answers:

Diagnosing pathological behavior of a piece of cluster software

related questions