views:

389

answers:

4

I can think of a few hacks using ping, the box name, and the HA shared name but I think that they are leading to data leakage.

Should a box even know its part of an HA cluster or what that cluster name is? Is this more a function of DNS? Is there some API exposed for boxes to join an HA cluster and request the id of the currently active node?

I want to differentiate between the inactive node and active node in alerting mechanisms for a running program. If the active node is alerting I want to hit a pager and on the inactive node I want to send an email. Pushing the determination into the alerting layer moves the same problem elsewhere.

EASY SOLUTION: Polling the server from an external agent that connects through the network makes any shell game of who is the active node a moot point. To clarify this the only thing that will page is the remote agent monitoring the real. Each box can send emails all day long for all I care.

A: 

One way is to get the box to export it's idea of whether it is active into your monitoring. From there you can predicate paging/emailing on this status (with a race condition around failover), and alert on none/too many systems believing they are active.

Another option is to monitor the active system via a DNS alias (or some other method to address the active system) and page on that. Then also monitor all the systems, both active and inactive, and email on that. This will cause duplicate alerts for the active system, but that's probably okay.

It's hard to be more specific without knowing more about your setup.

The box itself should have no knowledge if it active. I don't want to visit all nodes when I fail over. The scenario is simple and pervasive where I work:Box perfoms some critical business function and just in case it has a partner in case something goes wrong. Upgrades occur on inactive.
ojblass
Try the second method I suggest, presumably a failover includes a DNS update of some form. I take it this is a stateless service such as apache? Any reason not to use active-active?
Active active is more of a load balancing approach. Updating machines while they are active does not allow you the ability to test and or upgrade without possibly impacting production systems. Some of the paired boxes are part of an active active scenario but that is another story.
ojblass
We have people that do the failover and I guess speaking to them is the right thing to do. That entry of which box a request would flow to is a function of the network. I still feel the box has to be ignorant of its participation in the cluster.
ojblass
A: 

As a rule, the machines in a HA cluster shouldn't really know which one is active. There's one exception, mind, and that's with cronjobs. At work, we have a HA cluster on top of which some rather important services run. Some of those use services have cronjobs, and we only want them running on the active box. To do that, we use this shell script:

#!/bin/sh
HA_CLUSTER_IP=0.0.0.0
if ip addr | grep $HA_CLUSTER_IP >/dev/null; then
    eval "$@"
fi

(Note that this is running on Debian.) What this does is check to see if the current box is the active one within the cluster (replace 0.0.0.0 with the external IP of your HA cluster), and if so, executes the command passed in as arguments to the script. This ensures that one and only one box is ever actually executing the cronjobs.

Other than that, there's really no reasons I can think of why you'd need to know which box is the active one.

UPDATE: Our HA cluster uses Heartbeat to assign the cluster's external IP address as a secondary address to the active machine in the cluster. Programmatically, you can check to see if your machine is the current active box by calling gethostbyname(), and iterating over the data returned until you either get to the end or you find the cluster's IP in the list.

Keith Gaughan
alright consider my cron job something that determines if a page or an email should be sent out. If whatever managed the associations decicded to put this box into another cluster I would be visiting this script often... would I not?
ojblass
Ah. Fair point. What your looking to do is check to is if the external IP of your HA cluster is a secondary address on the machine you're checking on.
Keith Gaughan
Is a heartbeat some sort of roundtrip mechanism? I think that getbyhostname is still data leakage to the nodes. I will look at heartbeat related stuff because from what I think you are saying is that heartbeat could be a roundtrip operation.
ojblass
I'd really have to talk to the sysadmin who set those elements of the HA cluster up. I'll probably see him in the pub tomorrow, so if I get a chance, I'll ask him. I've added a link to the Heartbeat homepage, if it helps.
Keith Gaughan
I was actually thinking of how cool a website could be for you to buy a drink for someone far far away!
ojblass
Hmmm... beer over IP... :-)
Keith Gaughan
+2  A: 

It really depends on the HA system you're using.

For example, if your system uses a shared IP and the traffic is managed by some hardware box, then it can be hard to determine if a certain box is a master or slave. That will depend on a specific solution really... As long as you can add a custom script to the supervisor, you should be ok - for example the controller can ping a daemon on the master server every second. In the alerting script, simply check if the time of the last ping < 2 sec...

If your system doesn't have a supervisor / controller node, but each node tries to determine the state itself, you can have more problems. If a split brain occurs, you can end up with both slaves or both masters, so your alerting software will be wrong in both cases. Gadgets that can ensure only one live node (STONITH and others) could help.

On the other hand, in the second scenario, if the HA software works on both hosts properly, you should be able to obtain the master/slave information straight from it. It has to know its own state at any time, because it's one of its main functions. In most HA solutions you should be able to either get the current state, or add some code to run when the state changes. Heartbeat offers both.

I wouldn't worry about the edge cases like a split brain though. Almost any situation when you lose connection between the clustered nodes will be more important than the stuff that happens on the separate nodes :)

If the thing you care about is really logging / alerting only, then ideally you could have a separate logger box which gets all the information about the current network / cluster status. External box will probably have better idea how to deal with the situation. If your cluster gets dos'ed / disconnected from the network / loses power, you won't get any alert. A redundant pair of independent monitors can save you from that.

I'm not sure why you mentioned DNS - due to its refresh time it shouldn't be a source of any "real-time" cluster information.

viraptor
Understanding that corner cases are of little interest to me here is key. Also I am able to host the monitoring of the solution on the hardware managing the cluster itself. The information you provided me led me to a decent solution in under 4 hours. Thank you... enjoy your bounty!
ojblass
i am still in awe of the simplicity of it.
ojblass
A: 

Without hard-coding.... ? I assume you mean some native heartbeat query, not sure. However, you could use ifconfig, HA creates a virtual interface on whatever interface it is configured to run on. For instance if HA was configured on eth0 then it would create a virtual interface of eth0:0, but only on the active node.

Therefore you could do a simple query of the ifconfig output to determine if the server twas the active node or not, for example if eth0 was the configured interface:

ACTIVE_NODE=`ifconfig | grep -c 'eth0:0'`

That will set the $ACTIVE_NODE variable to 1 (for active) and 0 (if standby). Hope that may help.

http://www.of-networks.co.uk

earthgecko