tags:

views:

215

answers:

2

I have a troublesome problem which I'm at a loss to explain. To put it simply, the CPU use is inexplicably high on the web servers in my web farm.

I have a large number of users hitting two front-end web servers. 99% of the page loads are Ajax requests and serve a simple JSON-serialized object which the web servers retrieve from a backend using WCF. In the typical case (again, probably 99% of the requests), all the ASPX page is doing is making a WCF call to get this data, serializing it into a JSON string and returning it.

The object is pretty small-- a guid, a couple short strings, a few ints.

The non-typical case is the initial page load, which does the same thing (WCF request) but injects the response into different parts of the page using asp:literals.

All three machines (2 web servers, one backend) have the same hardward specs. I would expect the backend to do the majority of the work in this situation, since it's managing all the data, doing the lookups, etc. BUT: the load on the backend is much less than the load on the front ends. The backend is a nice, level 10-20% CPU load. The front ends run an average of 30%, but they're all over the map, sometimes hitting spikes of 100% for 10 seconds and taking 600ms to serve these very simple pages.

When I run the front-end in profiler (ANTS), it flags the WCF communication as taking 80% of the CPU time. That's the whole call on the .NET-generated WCF proxy.

WCF Setup: the service is fully parallel. I have instancing set to "single" and concurrency set to "multiple". I opened up the maxConnections and listenBacklog on the service to 256. Under heavy strain (500 requests/s) I see about 75 connections open between both front-end servers and the service, so it's not hitting that wall. I have security set to 'none' all around. Bandwidth use is about 1/20th of the potential (4Mb/s on a 100Mb/s network).

On the client (the web servers), I create a static ChannelFactory for the service. Code to call the service looks like:

service = MyChannelFactory.CreateChannel();
try {
   service.Call();
   service.Close();
} catch {
   service.Abort();
}

(simplified, but you get the basic picture)

What I don't understand is where all this load on the front end is coming from. What's strange about it is that it's never in the 30%-90% range. It's either in panic mode (100%) or doing OK (30% or less). Given the load on the backend, though, I'd expect both of these machines to be 10% or less. Memory use, handles, etc., all seem reasonable.

To add one more wrinkle: when I log how long it takes to service these calls on the backend, I get times consistently less than 15ms (maybe one or two spikes to 30ms every minute). On the front end, these calls can take up to 1s to return. I guess that could be because of the CPU problems, but it seems off to me.

So... does anyone have any ideas on where to look on this kind of thing? I'm running short on things to explore.

Clarification: The WCF service is hosted in a Windows service, and is using a netTcp binding. Also, I have the maxConnections on the client set to 128, FWIW.

+4  A: 

It's hard to say what might be going on, but a wild guess would be that something is hitting a contention point and its spinning (instead of doing a wait).

By any chance, have you increased the number of allowed HTTP connections to the back-end server in the front-end server? You can do it through the config file. One common issue I see with WCF clients is that the limit is left to the default value of 2, which severely limits concurrency at the client proxy level.

tomasr
Right-- I get the impression that it's busy-waiting for something, but I can't figure out what. It appears that there is still room for it to make more connections to the service if it needs to.To elaborate a little-- the WCF service is hosted as windows service, not as part of IIS. I have the maxConnections set to 128 on the binding (which, incidentally, is netTcp).
Moxen
Both things are not related. The problem I'm describing has nothing to do with the server (I agree you did the right changes there), but on the *client*.HttpWebRequest (which WCF uses internally) limits connections to a single remote server to 2 concurrent HTTP connections at a time by default (which is what the HTTP spec recommends). That means that regardless of your server settings, your client might encounter contention anyway.
tomasr
Does the HttpWebRequest connection limit affect netTcp bindings? I don't think this is happening because I do see literally dozens of open connections between the two machines. During a stress test, the connections start at zero and eventually work their way up. And the sockets (according to netstat) are definitely in the 'ESTABLISHED' state.
Moxen
Oops! Didn't see you mention you were using Net.TCP :) In that case, no, it wouldn't apply.Net.TCP binding does connection pooling internally, so it could be that under higher loads you're hitting the pool's upper limit and the code is spinning waiting for connections to be released.Some of that connection pooling can be configured, but I seem to remember you have to use a custom binding to do so. Look at TcpTransportBindingElement.ConnectionPoolSettings: http://msdn.microsoft.com/en-us/library/system.servicemodel.channels.tcptransportbindingelement.connectionpoolsettings.aspx
tomasr
I should add that maybe it would help in those TCP connection pool settings to reduce the leaseTimeout so that idle connections are released to the pool sooner, besides just growing the connection pool size.
tomasr
+2  A: 

Have you considered and tested for the possibility of external factors?

  • Process recycles?
  • Is Dynamic compression enabled?
keithwarren7
First thing I thought of was process recycling.
Cheeso
These machines don't have anything else running on them aside from ASP.NET on the web servers and the one WCF windows service on the backend. The "ASP.NET v2.0.50727/Application Restarts" counter doesn't show that any restarts are going on. And the application pool isn't set to auto-restart after X requests.
Moxen