views:

218

answers:

3

We have developed a .NET web application that uses SQL Server as a backend. Now we would like to provide a monitoring dashboard app for the tech support team. The idea is that this monitoring app will show a global picture of the "health" of the web servers hosting the application and the database servers holding the data. This "health" measure should reflect the workload of each machine, and would be a number (between 0 and 100, let's say) computed from some inputs that I need to determine.

For the web servers, I imagine that HTTP requests per time unit must be considered, and perhaps bandwidth consumed.

For the database servers, I reckon that transactions per time unit and maybe locks or some other indicator or database concurrency should be used.

In addition, some other generic inputs, such as CPU load, memory usage and disk queue length should also be taken into account.

All these factors should be weighed as necessary to obtain the final "health" figure for each server.

Edit. The idea is that the "health" measure gives the technician a global picture view of a server's workload. If a server appears with low "health", the technician will be able to drill down and look at the details of the machine to see what specific inputs are causing the low "health".

My questions are:

  1. Do you think this "health" measure makes sense?
  2. I am thinking of using performance counters to capture the input data. Is this the best option?
  3. Can you suggest appropriate inputs for the web servers (IIS 7) and the database servers (SQL Server 2008)?

Thanks.

A: 

First of all, I think you are designing a different dashboard than what you are telling us, tech support wants to know if machines are up/down and what to do when there is a problem.

Requests and transactions per second are useful for capacity planning and/or system and application tuning, not for tech support.

Also, I believe a single figure makes no sense and helps nobody, because what would 87,75% mean?

So, I believe you want a dashboard for sysadmins and app developers, where this type of measurement makes sense, to tune the OS or know when to add a new machine or which query is bogging down SQL Server.

That said, performance counters already store much of the information you want to present so that does make sense. Additionally you can use SQL Server traces to measure performance data about the queries, the traces should not be run constantly, but at defined intervals.

Now, if you really wanted a dashboard for tech support, two type of monitors would be enough: Server up/down - Application responsive/unresponsive

Vinko Vrsalovic
Thank you for your comments. Our tech support team will need to see if everything is running fine. If not, they will need to drill down and see exactly what's happening and with what machine. We would like to offer them a "global picture" view that lets them see the whole array of servers on a single screen (and hence the aggregate "health" measure I am proposing).
CesarGon
See nagios. A simple status measurement is enough for tech support. Green OK, Yellow Something's up, Red Some serious problem. A number is useless
Vinko Vrsalovic
+1  A: 

Do you think this "health" measure makes sense?

No. The first thing someone will ask if your single number is off is "what's wrong?" Also, consider the fact that trend analysis can be very important for early error detection.

I am thinking of using performance counters to capture the input data. Is this the best option?

I think that would be an excellent starting point.

Can you suggest appropriate inputs for the web servers (IIS 7) and the database servers (SQL Server 2008)?

This is a big subject for a forum post, and the answer depends heavily on the details of your app. In broad terms, you want to look at things like the frequency of error conditions, some sense/measure of throughput for each subsystem, counts for how often out-of-process calls exceed performance thresholds, etc. It's usually a good idea to show current numbers as well as historical and trends.

You might want to have a look at Microsoft's product in this area: Service Center Operations Manager (SCOM), to see the types of things they do.

RickNZ
Thanks for your answer, RickNZ. As I said in a previous comment, my plans are to use the "health" measure as a global picture indicator, but let technicians drill down into the details of a particular machine whenever they see a low health value. I am editing my question now to explain this.
CesarGon
Part of the problem with trying to use a single number relates to scaling. Let's say you monitor 10 subsystems, with 10 points for each. If the disk is performing at zero, that looks the same as all 10 subsystems performing at 90%. If you want to keep it simple, maybe just have a red/green bad/good indicator instead of a number?
RickNZ
A: 

SQL Server 2008 comes with a performance collection and data warehouse out-of-the-box, see SQL Server 2008 Data Collections and the Management Data Warehouse. Also SQL 2005 has a similar Performance Dashboard. I'm not saying you should use these as your dashboard necessarily (although you could), but you should look at these two SQL dashboards to see what the MS team considered important to put in a dashboard.

Remus Rusanu