views:

111

answers:

2

I was asking the following question during interviewing in a company working on cloud computing, and did not answer well. Any suggestions on how to analyze this question will be greatly appreciate.

Our company has hundreds of millions of users and we expect zero down time in production, explain techniques and programming practices that help improve redundancy and fail-over capabilities for front-end, middle-tier and back-end services including database services.

A: 

That's a pretty broad question. If they expect zero downtime, tell them to forget about it or turn all of their profits over to building redundancy. Now, if they just want "five 9's, or 99.999% uptime" then we can talk. :)

You can usually answer these kinds of questions with the usual canned blather about building a sustainable, automatic, build environment that includes extensive unit testing. Using design patterns like MVC or similar can help with testability. Perform regular security audits. This is much bigger than just a development question, this is a question about network and server architecture, maintaining secondary and tertiary data centers, etc. These kinds of question really give you a chance to make the interviewer feel important.

BobbyShaftoe
+1  A: 

This question is very much along the lines of the "Impossible Question" from Joel. There is no right answer to this question.

I would start breaking this down into a list of all possible failure points:

  • Database Server
  • Database
  • Middle Tier
  • Middle Tier Server
  • Application
  • Web Server

Then for each one of them, I would identify reasons for breakage, and how to recover from it without having downtime. The ones that I do not know the answers to, I would profess to as much.

For example, Let's build a list of reasons a Database server goes down. Since we are looking for 100% uptime, we ignore nothing - no matter how far fetched

  • Hardware goes bad
  • Power goes down
  • Network card goes bad
  • Operating System unexpectedly crashes
  • O.S. Upgrades break system
  • Dumb System Admin or DBA
  • Dumb Janitor

Some Possible solutions (considering SQL Server on Windows back-end)

  • Lock on door
  • Database Mirroring (with regular failover testing)
  • Multiple NICS
  • Clustering (with regular failover testing)
  • Get better people

You can basically keep answering this question until the interviewer throws in the towel because there really isn't the One-Right-Answer to this question.

Raj More