views:

14

answers:

1

Hi,

I'm building a system for monitoring several large web sites (resources), using distributed web services controlled by a central controller.

I'm coming to a specific part of the design - the actual reporting of resources that are thought to have fallen over.

My problem is that there is always the chance that the actual monitor it's self is at fault, or has lost it's network connection to a resource, and the resource is actually fine. I don't want to report issues if they are not really there.

My plan at the moment is to have the monitor to request that all other monitors check the resource if it encounters a problem, and then make a decision as to whether the resource has really fallen over based on collective results.

I'm sure there's someone out there with more experience of this type of programming than my self.

Is there a common solution to this type of problem? Is my solution a decent way of looking at this?

+1  A: 

Your solution is about one of the only pragmatic ones.

There is nothing new under the sun. The IETF Routing Information Protocol wasn't the first attempt at addressing this problem, but it is well documented and works.

Note well, that there is no optimal (or perfect) solution to the class of problems which you are facing, the best you can do with in-band monitoring is make good guesses about where the fault is. In systems that need a very high degree of accuracy of fault information (e.g. the public switched telephone network) a parallel out-of-band monitoring network is established which itself must necessarily be monitored by humans.

msw
Thanks. That's a reassuring answer, even if the solution is tricky!
BombDefused