I am looking for suggestions on doing some simple monitoring of an ASP.Net web farm as close to real-time as possible. The objectives of this question are to:
- Identify the best way to monitor several Windows Server production boxes during short (minutes long) period of ridiculous load
- Receive near-real-time feedback on a few key metrics about each box. These are simple metrics available via WMI such as CPU, Memory and Disk Paging. I am defining my time constraints as soon as possible with 120 seconds delayed being the absolute upper limit.
- Monitor whether any given box is up (with "up" being defined as responding web requests in a reasonable amount of time)
Here are more details, things I've tried, etc.
- I am not interested in logging. We have logging solutions in place.
- I have looked at solutions such as ELMAH which don't provide much in the way of hardware monitoring and are not visible across an entire web farm.
- ASP.Net Health Monitoring is too broad, focuses too much on logging and is not acceptable for deep analysis.
- We are on Amazon Web Services and we have looked into CloudWatch. It looks great but messages in the forum indicate that the metrics are often a few minutes behind, with one thread citing 2 minutes as the absolute soonest you could expect to receive the feedback. This would be good to have for later analysis but does not help us real-time
- Stuff like JetBrains profiler is good for testing but again, not helpful during real-time monitoring.
- The closest out-of-box solution I've seen is Nagios which is free and appears to measure key indicators on any kind of box, including Windows. However, it appears to require a Linux box to run itself on and a good deal of manual configuration. I'd prefer to not spend my time mining config files and then be up a creek when it fails in production since Linux is not my main (or even secondary) environment.
Are there any out-of-box solutions that I am missing? Obviously a windows-based solution that is easy to setup is ideal. I don't require many bells and whistles.
In the absence of an out-of-box solution, it seems easy for me to write something simple to handle what I need. I've been thinking a simple client-server setup where the server requests a few WMI metrics from each client over http and sticks them in a database. We could then monitor the metrics via a query or a dashboard or something. If the client doesn't respond, it's effectively down.
Any problems with this, best practices, or other ideas?
Thanks for any help/feedback.
UPDATE: We looked into Cloudwatch a bit more and we may focus on trying it out. This forum post is the most official thing I can find. In it, an Amazon representative says that the offical delay window for data is 4 minutes. However, the user says that 2 minute old data is always reliable and 1 minute is sometimes reliable. We're going to try it out and hope it is enough for our needs.