views:

339

answers:

6

I operate an OLTP system that allows SSL connections over the internet at multiple sites. I would like to find an effective solution to how to transparently and automatically reroute transaction connections when one site is down. Bonus points for considering the site down when it is not actually unreachable or unable to connect but just delayed or overloaded or sending back bad results.

For example, the user system would attach to www.abcdef.com or 123.234.56.7 and actually be redirected to one.abcdef.com/two.abcdef.com or 99.5.2.1/68.96.79.1 depending on which site is working. This sounds a lot like load balancing but it's primarily how to use the network to avoid a single point of failure as opposed to how to spread the work between servers.

The advantages to the user are that (1) they only have to know one URL or one IP to connect to and (2) their transactions work in several different failure scenarios. Like, if the public network near one of the sites fails or is misrouted, if the local loop for that ISP fails, if in-house routers or servers fail. Of course, the transactions still fail if the problem is close to the user.

A: 

Could use round-robin DNS but that still sends up to half of the traffic to a non-operating site.

A: 

Could manually update DNS but that might take minutes or hours to propagate to all potential users. Plus the time to notice the problem and edit the name server.

A: 

Could create one very well connected, very bulletproof site and run a network load balancer or similar custom application but there's no guarantee that (A) the user can actually reach that site or (B) the site will always be up. I'd rather it operate like RAID - redundant and inexpensive.

There is no way a distributed solution is going to be inexpensive. It's probably cheaper to do the single site with lots of redundancy in one place and multiple network connections.
tvanfosson
A: 

I'm thinking that perhaps one way to do this is to have redundant front-end servers that sit behind a load balancer. This front-end system simply responds to requests by redirecting them to your real servers which are distributed in various locations. Your front-end server can periodically check if the other servers are up and, if not, take that server out of the mix. Having redundant front-ends behind the load balancer (or maybe just in a cluster) keeps it from becoming a single point of failure. You could also have multiple front-ends using the round-robin DNS solution, located in various locations. You'd probably need to take this architecture into account in your application.

You probably also want to have redundant networks links to all sites as well.

tvanfosson
This is the best solution so far - 12/5/08 noon CT. It is a another description of the "create one very well connected, very bulletproof site and run a network load balancer" idea. But it doesn't completely fix the single PoF in terms of reaching this front end site.
It's beginning to sound like I'm describing some enhanced DNS system. But that only moves the failure point into DNS (and makes DNS complicated). Maybe it's just turtles all the way down. In other words, the problem may always exist somewhere no matter how many layers are used.
A: 

ifstated can be used as a front-end with pf (on OpenBSD and FreeBSD) to redirect traffic to online servers.

man ifstated

Blockquote ifstated - Interface State daemon

The ifstated daemon runs commands in response to network state changes, which it determines by monitoring interface link state or running exter- nal tests. For example, it can be used with carp(4) to change running services or to ensure that carp(4) interfaces stay in sync, or with pf(4) to test server or link availability and modify translation or routing rules. The options are as follows:

brunoqc
I think the thing that is going to doom this is SSL. The cert belongs to a particular system and redirecting to another system transparently will likely cause certificate issues. It might be ok if they are behind a load balancer and the load balancer handles SSL, but he wanted distributed systems.
tvanfosson
Good point on the SSL. However, in this situation, I think that can be worked around by having the clients accept the certificate anyway (don't validate it). We're not looking for authentication just encryption during transmission.
+1  A: 

I once asked a similar question years ago and the answer has a lot to do with how much money you are willing to spend.

There are hardware solutions, devices whose sole purpose is to sit in front of your servers A and B so that when server A goes down it stops sending requests to it and only uses server B. The advantage of doing this in hardware is performance and reliability.

It also helps to know the relative reliability of all your system components. If it's one component you are worried about then you can make THAT piece redundant and design the rest of the system to fail over from one to the other. The reason I say this is, there is no one perfect answer.

Unless of course you are trying to build something like a credit card processing system or other similar financial transaction system where money is no object :P

The most common scenario is a fail-over setup where A fails over to B.

If you are looking for the 'perfect' fail over you can implement a system in between the customer and A and B that will automatically retry etc and return the answer without the customer seeing even one error. But that may bottleneck performance and THEN you have the issue of another system that may or may not fail. And now we are back to my second paragraph.... :)

Its not a bad question but knowing more about what you are trying to do (and if you are already locked into a particular implementation) would help.