a system design question

views:

111

answers:

+3 Q:

a system design question

I was asking the following question during interviewing in a company working on cloud computing, and did not answer well. Any suggestions on how to analyze this question will be greatly appreciate.

Our company has hundreds of millions of users and we expect zero down time in production, explain techniques and programming practices that help improve redundancy and fail-over capabilities for front-end, middle-tier and back-end services including database services.

That's a pretty broad question. If they expect zero downtime, tell them to forget about it or turn all of their profits over to building redundancy. Now, if they just want "five 9's, or 99.999% uptime" then we can talk. :)

You can usually answer these kinds of questions with the usual canned blather about building a sustainable, automatic, build environment that includes extensive unit testing. Using design patterns like MVC or similar can help with testability. Perform regular security audits. This is much bigger than just a development question, this is a question about network and server architecture, maintaining secondary and tertiary data centers, etc. These kinds of question really give you a chance to make the interviewer feel important.

BobbyShaftoe 2010-06-28 20:11:28

+1 A:

This question is very much along the lines of the "Impossible Question" from Joel. There is no right answer to this question.

I would start breaking this down into a list of all possible failure points:

Database Server
Database
Middle Tier
Middle Tier Server
Application
Web Server

Then for each one of them, I would identify reasons for breakage, and how to recover from it without having downtime. The ones that I do not know the answers to, I would profess to as much.

For example, Let's build a list of reasons a Database server goes down. Since we are looking for 100% uptime, we ignore nothing - no matter how far fetched

Hardware goes bad
Power goes down
Network card goes bad
Operating System unexpectedly crashes
O.S. Upgrades break system
Dumb System Admin or DBA
Dumb Janitor

Some Possible solutions (considering SQL Server on Windows back-end)

Lock on door
Database Mirroring (with regular failover testing)
Multiple NICS
Clustering (with regular failover testing)
Get better people

You can basically keep answering this question until the interviewer throws in the towel because there really isn't the One-Right-Answer to this question.

Raj More 2010-06-28 20:38:18

ansaurus

tags:

views:

answers:

a system design question

related questions