views:

689

answers:

3

I'm interested in cross-colo fail-over strategies for web applications, such that if the main site fails users seamlessly land at the fail-over site in another colo.

The application side of things looks to be mostly figured out with a master-slave database setup between the colos and services designed to recover and be able to pick up mid-stream. I'm trying to figure out the strategy for moving traffic from the main site to the fail-over site. DNS failover, even with low TTLs, seems to carry a fair bit of latency.

What strategies would you recommend for quickly moving traffic between colos, assuming the servers at the main colo are unreachable?

If you have other interesting experience / words of wisdom about cross-colo failover I'd love to hear those as well.

A: 

If you can, Multicast - http://en.wikipedia.org/wiki/Multicast or AnyCast - http://en.wikipedia.org/wiki/Anycast

Brian Knoblauch
multicast is no use - the rest of the internet will be oblivious to it
Alnitak
Multicast would depend on the peering of the colos. Anycast works across the Internet as a whole. You might have missed that part of my post though, I accidently saved the post before I had finished... :-)
Brian Knoblauch
indeed - it wasn't there then. However anycast is normally used for stateless UDP services, and doesn't get on well with TCP (see caveats in the wikipedia article).
Alnitak
+2  A: 

DNS based mechanisms are troublesome, even if you put low TTLs in your zone files.

The reason for this is that many applications (e.g. MSIE) maintain their own caches which ignore the TTL. Other software will do a single gethostbyname() or equivalent call and store the result until the program is restarted.

Worse still, many ISPs' recursive DNS servers are known to ignore TTLs below their own preferred minimum and impose their own higher TTLs.

Ultimately if the site is to run from both data centers without changing its IP address then you need to look at arrangements for "Multihoming" via global BGP4 route announcements.

With multihoming you need to get at least a /24 netblock of "provider independent" (aka "PI") IP address space, and then have that only be announced to the global routing table from the backup site if the main site goes offline.

Alnitak
Reading the Multihoming wikipedia page now, much thanks. Any words of wisdom on the effectiveness of this technique and level of difficulty in setting it up?
Parand
it's pretty hairy - needs ISP level advice really, and also cooperation from the sites providing the connectivity at both sites. Not all ISPs will allow customers to announce their own routes.
Alnitak
+2  A: 

As for DNS, I like to reference, "Why DNS Based Global Server Load Balancing Doesn't Work". For everything else -- use BGP.

Designing networks in order to load balance using BGP is still not an easy task and I myself certainly am not an expert on this. It's also more complex than Wikipedia can tell you but there are a couple interesting articles on the web that detail how it can be done:

There is always more if you search for BGP and load balancing. There are also a couple whitepapers on the net which describe how Akamai does their global loadbalancing (I believe it's BGP too.), which is always interesting to read and learn about.

Beyond the obvious concepts you can use software and hardware to achieve, you might also want to check with your ISP/provider/colo if they can set you up.

Also, no offense in regard to your choice of colo (Who's the provider?), but most places should be setup to deal with downtimes and so on, they should not require you to take actions. Of course floods or aliens can always strike, but in that case I guess there are more important issues. :-)

Till
In my experience and in chatting with friends who use various colo providers, I couldn't find a single person who hadn't experienced downtime of one sort or another due to the colo provider. I'd love to find a provider that deals elegantly with the problem, please feel free to recommend.
Parand
I know what you are saying. But, e.g. we had no issues ever with ServerCentral. In Europe I'm colocating in a Telia POP. Expensive, but no issues! With PEER1 (NYC), we've been alright, actually suffered minor downtime when their router's PSU failed. :(
Till