views:

137

answers:

3

Given a relatively typical .NET 4 system in an SOA environment (i.e. Windows Server 2008 R2, RESTful Web Services on IIS 7, Windows Services for NServiceBus messaging, SQL Server 2008 R2, etc) what are the best practices or de facto solutions (without enterprise price tag) for performing 24x7 performance monitoring in production?

Not necessarily how much CPU/Memory/Disk IO it consumes but rather for example how many createAccount() calls per minute were made, what is the average time generateResponse() method takes and detect unusual delta spikes between for example generateResponseStarted and generateResponseComplete (method was invoked (which in turn can call 3rd party) and response is ready to be returned respectively).

After some googling it seems options are for low level profilers (like dotTrace) and implementing Performance Counters and consuming those with PerfMon or some other OpManager type product.

What would you recommend? Would implementing performance counters for a live application significantly degrade performance on production system? If not, are there any good libraries that streamline the implementation in .NET? If yes, how do people monitor their applications' performance other than memory-disk-cpu?


@Ryan Hayes

Thanks, I'm looking for a way to see an unusual slowing down or spikes on production systems. For example all was good during stress testing but for some reason 3rd party we rely on is having some problems or DB is slowing down due to thread locking, or SAN is giving way, or any other unexpected scenarios. Low level profiling is too much of an overhead while turning counters on only when there is a problem is too late at that point. Plus we'll be missing historical data to compare it to (I would need some sort of alert system for when delta is outside of an acceptable threshold). I'm wondering how people monitor performance of their production systems and in their experience what would be the best approach for non memory/cpu/server related kind of monitoring.

A: 

The question here is really what are you trying to learn from the performance monitoring?

  • Do you want to make your code faster? Then I would suggest using the profiling tools on a test environment to find out where you can improve your code.

  • Do you want to find out the maximum beating your system can handle? Then I would suggest performing load testing on a test environment. If you know exactly how hard you can push your system without destroying it, then you won't need to put monitoring into production.

For production, you probably want to maximize performance. To do this, it's common to push a test environment hard and get solid metrics so that you don't need to put performance monitors in place in production. For production, you just want to be able to know when you hit that peak and then degrade gracefully or whatever you see fit. Generally, good logging is the best way to monitor system (besides hardware) performance and keep a record of exceptional performance quirks.

Every system is different though, and your mileage may vary. Take this as a suggestion rather than the way EVERYONE does it, because there are always exceptional cases where you may have to have profiling running in production.

Ryan Hayes
Thanks, I'm looking for a way to see an unusual slowing down or spikes on production systems. For example all was good during stress testing but for some reason 3rd party we rely on is having some problems or DB is slowing down due to thread locking, or SAN is giving way, or any other unexpected scenarios. Low level profiling is too much of an overhead while turning counters on only when there is a problem is too late at that point. Plus we'll be missing historical data to compare it to (I would need some sort of alert system for when delta is outside of an acceptable threshold). I'm wonder[…]
Ilya Kozhevnikov
+2  A: 

You can try AlertGrid. Looks like this can be a solution for your problems.

You can send various parameters to AlertGrid from your application (new account name, time of executing some important piece of logic and so on). AlertGrid service can do couple of things with your data. First of all it can process some notification rules built with parameters you've sent (like if time of doing something important > X seconds -> send sms to person in charge).

In a two weeks AlertGrid is going to have a bunch of new features. Looks like the most important for you will be the possiblity to plot parameters received from your system.

Please note that AlertGrid cannot detect parameters from your systems - you need to send them instead. This might looks like an additional piece of work, but we think it is comparable to time required for installing and configuring some specialized tools. On the other hand thanks to this approach AlertGrid overcomes some limitations (it can be integrated with anything that can send http requests).

I believe it will be much easier to understand when you create account in AlertGrid and pass its interactive tutorial.

As you might have noticed I'm a developer in AlertGrid team:)

Disclaimer: At the momment of writing we know that prices of AlertGrid are going to be reduced in a near future, so don't look at them right now, you can contact our support line for more information on pricing. Free account is available and should be enough for the begining.

Lukasz Dziedzia
A: 

We use Nagios for local monitoring (CPU, disk space etc) and AlertFox for web transaction monitoring ("outside view"). Of course, the later only makes sense if your website (?) is public.

Would implementing performance counters for a live application significantly degrade performance on production system?

We have the Nagios Win server plugins in place, and see no perfornamce issues with them.

Ruby8848