views:

633

answers:

9

When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.

Some examples for this would be CPU utilization, memory usage, and perhaps disk space. What other resource should I be monitoring, and what tools are available to do so?

A: 

I typically watch 'top' and 'tail -f /var/log/auth.log'.

Geoffrey Chetwood
+6  A: 

As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.

You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.

I really like munin for graphing things related to hosts.

Daniel Papasian
A: 

In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.

I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.

cori
+1  A: 

"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.

tante
+1  A: 

Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.

If you want more 'alerting' and less 'graphing', check out nagios.

As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.

pjz
A: 

network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.

Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.

I also recommend ntop as it shows where your network traffic is going.

A good link to get you going with Munin and Monit: link text

gbjbaanb
+3  A: 

I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:

  • Network usage
  • CPU usage (% user,system,nice times)
  • Load averages (1m, 5m, 15m)
  • RAM usage (real, swap, shm)
  • Disc throughput
  • Active connections (by port number)
  • Number of processes (by process type)
  • Ping time from remote location
  • Time to SSL certificate expiry
  • MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)

Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.

Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)

Jon Topper
+1  A: 

Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.

Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.

(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)

JBB
+1  A: 

I ended up using dstat which is vmstat's nicer looking cousin.

This will show most everything you need to know about a machine's health, including:

  • CPU
  • Disk
  • Memory
  • Network
  • Swap
oneself