views:

243

answers:

3

I'm currently monitoring a large network with Hobbit and have been tasked with lowering the amount of false (or at least irrelevant) alarms. At the top of my list are the tests "http" and "conn", initiated by bbtest-net. This command checks ping, ssh, etc, and if for instance a ping times out, it immediately sets the status to red. One minute later, the bbretest command kicks in, checks all the newly reddened hosts, and finds it to be green again. This happens all the time, and it clutters up my log.

Is there any way for me to make Hobbit report a red status AFTER bbretest has been run the first time?

+2  A: 

First, this is a programming site so you won't get many answers.

But.... but ...

If your server times out, isn't that a problem?

Sounds to me like Hobbit does the job it is designed for: Telling you that you have something that needs your attention.

Fix the timeout problem, and your log should be fine.

Lasse V. Karlsen
A fair point, but a connection check can time out for different reasons. Servers are not *required* to respond to ping, for instance. I do want to know when my servers are down, but I can wait one minute for the retest until I'm told. :)
Ace
Monitoring is part of programming a large system; monitoring often requires considerable programming effort - we spend quite a lot of time creating monitors for our SaaS app.
MarkR
A: 

I think your best bet is to shun the stock Hobbit service tests and write your own one. It's not difficult.

It is a good idea that your test script will not go red unless several successive attempts fail.

You can disable the standard Hobbit ones and use your own instead. Having said that, the default behaviour of the "conn" test seems fairly reasonable (going red immediately if the server doesn't ping).

Unfortunately there's no option on the Hobbit alerting system to only alert if a problem persists for X minutes, that would be really useful - but I'm sure you could do that as well with a custom alerting script.

MarkR
A: 

You can use:

<ip> <hostname> # noconn

In bb-hosts for a server that doesn't respond to ping. Then test its aliveness through a service.