views:

82

answers:

6

We have two industrial controllers that are used to control critical systems. The idea is that on failure of one controller, the other controller will automatically take over. To ensure the swap over is seamless, each the standby controller must mirror the state of the online controller at all time.

We have a solution, which is poorly coded and documented. The question is, is there a common design pattern that implements such a system or open source software that achieves a similar thing thaty could be used to create a generic solution that could be used for controllers or PC's and can be extended to allow any number of controllers to act as standby routines.

A: 

Maybe a shared SQLite database or something similar?

Blindy
+1  A: 

On approach is "cache coherence". Commercial products -- Tangosol, for example -- do this.

Another approach is a light-weight version of an Enterprise Service Bus (ESB) or Service Oriented Architecture (SOA). Almost all the SOA vendors have products for this. I'd start with Tibco, which has a lightweight component set that you can use for this.

Since SOA isn't that hard, you can roll your own using the HTTP protocol so one controller can POST status to it's shadow controllers.

S.Lott
A: 

There is the standard Master-Slave pattern used my almost all DBMS' that support clustering, distributed architectures and replication (http://en.wikipedia.org/wiki/Database_replication).

So, very basically in your situation you could have the Master machine maintaining state, and the slave sitting there doing nothing except updating its own state from that of the master. If the master goes down, the slave sees the master is no longer there, and can take over the control of state, with the master only being used again once it has updated its own state from that of the slave (which has maintained state while the master has not been active).

Kazar
+1  A: 

There is a difference between failover and transparent failover. Do you really have requirements for transparent failover? If so, you're going to end up paying for it (in both cost and complexity).

That being said, take a look at this post on Buddy Replication for an elegant solution to the problem.

cwash
A: 

The traditional approach taken in controlling realtime critical systems is to run the two units in lockstep. Tandem have been building some very impressive fault-tolerant machines using this technique for years.

However, lockstep is very much a hardware-level solution; i don't think you could implement classic lockstep purely at the software level. Or at least, not straightforwardly. Maybe using state machines synchronised by exchange of vector clocks or something equally propeller-headed?

Tom Anderson
A: 

There is an analogous situation with the space shuttle computers. In that situation, they used 5 computers and if one machine was late or different from the others, it was (in essence) voted off of the island.

In your situation, how do you determine which controller has gone bad? Is the determining machine also considered for single-point failure?

What level of communications are available between the two controllers? Shared memory, Ethernet, or something even slower?

How fast does state information change between the two?

Is it possible to feed identical information to both controllers and would both controllers calculate the same state transitions?