views:

829

answers:

9

What, at a minimum, should an application health-monitoring system do for you (the developer) and/or your boss (the IT Manager) and/or the oeprations (on-call) staff?

What else should it do above the minimum requirements?

Is monitoring the 'infrastructure' applications (ms-exchange, apache, etc.) sufficient or do individual user applications, web sites, and databases also need to be monitored?

if the latter, what do you need to know about them?

ADDENDUM: thanks for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both

+1  A: 

Minimum: make sure it is running :)

However, some other stuff would be very useful. For example, the CPU load, RAM usage and (in multiuser systems) which user is running what. Also, for applications that access network, a list of network connections for each app. And (if you have access to client computer(s)) it would be cool to be able to see the 'window title' of the app - maybe check each 2-3 minutes if it changed and save it. Also, a list of files open by the application could be very useful, but it is not a must.

Milan Babuškov
For monitoring Apache, Exchange and other common services, take a look at software like Nagios (open source) that does all the job already. Just install, configure and enjoy.
Milan Babuškov
+1  A: 
  • Whether the application is running.
  • Unusual cpu/memory/network usage.
  • Report any unhandled exceptions.
  • Status of various modules (if applicable).
  • Status of external components (databases, webservices, fileservers, etc.)
  • Number of pending background tasks (if applicable).
  • Maybe track usage of the application and report statistics on most/less used functionalities so you know where optimizations are most beneficial.
David Thibault
how do you define "unusual"?
Steven A. Lowe
It depends on the application, but basically I'd get the average usage over a specific period (say 5 minutes), and if it's higher than X (90% cpu, 1 gig of memory, 200kbps... these values really depend on the app), report it.
David Thibault
+2  A: 

The answer is 'it depends'. Why do you need to monitor? How large is your operations staff? Do you need reporting? What is the application environment? Who cares if the application fails? Who cares if an exception happens? Are any of the errors recoverable? I could ask questions like these for a long time.

David Medinets
please continue...
Steven A. Lowe
[@David Medinets]: as for "why do you need to monitor" the answer is: to be pro-active about support, i.e. to know when something goes wrong immediately, so we can fix it
Steven A. Lowe
+1  A: 

I think this is fairly simple - monitor so that you can be warned early enough before something goes wrong. That means monitor dependencies and the application itself.

It's really hard to provide specifics if you're not going to give details on the application you're monitoring, so I'd say use that as a general rule.

Steve M
my project is a system for monitoring .NET applications - of all types
Steven A. Lowe
+1  A: 

This is such an open ended question, but I would start with physical measurements.
1. Are all the machines I think are hosting this site pingable.
2. Are all the machines who should be serving content serving some content. (Ideally this would be hit from an external network.
3. Is each expected service on each machine running
3a. Have those services run recently?
4. Does each machine have hard drive space left? (Don't forget the db)
5. Have these machines been backed up? When was the last time?

Once one lays out the physical monitoring of the systems, one can address those specific to a system?

1. Can an automated script log in? How long did it take?
2. How many users are live? Have there been a million fake accounts added?
...
These sorts of questions get more nebulous, and can be very system specific. They also usually can be derived reactively when responding to phsyical measurements. Hard drive fill up, maybe the web server logs got filled up because a bunch of agents created too many fake users. That kind of thing.

While plan A shouldn't necessarily be reactive, it is the way many a site setup a monitoring system.

Nathan Feger
excellent points, but what about the applications running on each machine?
Steven A. Lowe
+1  A: 

At a minimum you want to know that the system is healthy. This is subjective in what defines your system is healthy. Is it computers are up, the needed resources exist, the data is flowing through the system, the data is properly producing results, etc, etc.

In my project we do monitoring of most of this and then some. It really comes down to what is the highest level that you can use to analyze that everything is working. In our case we need to know down to the data output. If you just need to know down to the are these machines up it saves you on trying to show an inexperienced end user what is wrong.

There are also "off the shelf" tools that will do a lot of the hard work for you if you are just looking too hard into data results. I particularly liked Nagios when I was looking around but we needed more than it could easily show so I wrote our own monitoring system. Basically we also watch for "peculiarities" in the system, memory / cpu spikes, etc...

Ryan P
nagios - like many others - monitors only 'infrastructure' applications, not individual applications. What do you need to make sure your user's programs are 'healthy'?
Steven A. Lowe
+2  A: 

thanks everyone for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both

the difference is:

  • infrastructure monitoring would be servers plus MS Exchange Server, Apache, IIS, and so forth
  • application monitoring would be user machines and the specific programs that they use to do their jobs, and/or servers plus the data-moving/backend applications that they run to keep the data flowing

sometimes it's hard to draw the line - an oversimplified definition might be "if your team wrote it, it's an application; if you bought it, it's infrastructure"

i think in practice it is best to monitor both

Steven A. Lowe
+1  A: 

What you need to do is to break down the business process of the application and then have the software emit events at major business components. In addition, you'll need to create end to end synthetic transactions (eg. emulating end users clicking on a website). All that data would be fed into an monitoring tool. In the past, I've done JMX for applications of which flowed into Tivoli Monitoring's JMX Adapter and then I've done scripts that implement a "fake user" and then pipe in the results into Tivoli Monitoring's Script Adapter. Tivoli Monitoring takes the data and then creates application health and performance charts from that raw data.

Albert T. Wong
interesting - but I'm not trying to simulate the results, I'm trying to monitor the actual result in real-time
Steven A. Lowe
The monitoring is real time... the emulation part was just to get data flowing into the real time monitoring dashboards.
Albert T. Wong
+2  A: 

Great question.

We've been looking for some application-level monitoring solution for our needs some time ago without any luck. Popular monitoring solution are mostly addressed to monitor infrastrcture and - in my opinion - they are too complicated for a requirements of most of small and mid-sized companies.

We required (mainly) following features:

  • alerts - we wanted to know about incident as fast as possible
  • painless management - hosted service wouldbe the best
  • visualizations - it's good to know what is going on and take some knowledge from the data

Because we didn't find suitable solution we started to write our own. Finally we've ended with up-and-running service called AlertGrid. (You can check it for free of course.)

The idea behind it is to provide an easy way to handle custom monitoring scenarios. Integration API is very simple (one function with two required parameters). At the momment we and others are using it for:

  • monitor scheduled tasks (cron jobs)
  • monitor entire application logic execution
  • alert on errors in applications
  • we are also working on examples of basic infrastructure monitoring using AlertGrid
Lukasz Dziedzia