views:

149

answers:

3

How do large server farms handle gracefully shutting down all or part of the farm? I'm thinking of planed and unplanned cases like:

  • "We need to shutdown Rack 42"
  • "We need to do work on the power feeds to the whole block"
  • "Blackout! UPS's running out of Juice! Aahh!"
  • "AC is down, air temp is 125F and climbing"

The issues I'm interested in are how people handle sequencing, and kicking the whole thing off. Also it occurs to me that this could easily get mixed with bringing up and down services and with the software up grade system.

(At this point I'm more asking out of curiosity than anything.)

A: 

One method is to mirror the live machines on temporary hot-swaps and, assuming access is via network, cut over by reconfiguring the router to divert traffic to the mirrors. This process can be automated for unplanned outages.

For planned maintenance, some simply notify their users that the system will be unavailable during a certain window.

Redundant power supplies and gas-powered generators handle most power-related problems, again with automated failover.

Adam Liss
Good ans interesting info, but not really what I'm interested in. e.g. how things get shut down, not how to avoid shutting service off.
BCS
+1  A: 

Computers can use a lot more power coming back online than they do running, since they have to get all of the platters and fans spinning, typically have heavy CPU activity starting all of the applications, and so on. Most shops will have a set sequence that staggers the startups, so they don't max out the circuit and have to start all over again. This is also important if you have a bunch of applications that expect to talk to a database, or a bunch of web servers that need to talk to the app servers. You usually start from the bottom up, and stagger the startups by 30 seconds to a minute, depending on how many boxes are on your circuit.

Tim Howland
I have a box with 5 HDD's that pull in 30W per drive on startup. I'm glad it staggers them or it would toast my UPS!
BCS
Any idea what kind of systems are used to effect the staggered start up and to pick the order?
BCS
In the implementations that I've done, it's usually been a human operator- power outages are rare enough that when they occur, someone is there to deal with the emergency. If they are happening more than once a year, it's time for a new datacenter.
Tim Howland
A: 

Ah, now I understand your question more clearly.

Products such as the iBootBar from dataprobe allow you to monitor and manage the power to remote devices. An intelligent system can monitor the current draw of each device to verify that it's functioning within nominal limits. If not, it can take the equipment offline and bring a spare online to replace it, watching for the initial surge and waiting for power to stabilize before switching the next device on.

Adam Liss