views:

746

answers:

7

In general, what kinds of design decisions help an application scale well?

(Note: Having just learned about Big O Notation, I'm looking to gather more principles of programming here. I've attempted to explain Big O Notation by answering my own question below, but I want the community to improve both this question and the answers.)

Responses so far
1) Define scaling. Do you need to scale for lots of users, traffic, objects in a virtual environment?
2) Look at your algorithms. Will the amount of work they do scale linearly with the actual amount of work - i.e. number of items to loop through, number of users, etc?
3) Look at your hardware. Is your application designed such that you can run it on multiple machines if one can't keep up?

Secondary thoughts
1) Don't optimize too much too soon - test first. Maybe bottlenecks will happen in unforseen places.
2) Maybe the need to scale will not outpace Moore's Law, and maybe upgrading hardware will be cheaper than refactoring.

+2  A: 

Jeff and Joel discuss scaling in the Stack Overflow Podcast #19.

GateKiller
+4  A: 

Well there's this blog called High Scalibility that contains a lot of information on this topic. Some useful stuff.

Daud
+3  A: 

Often the most effective way to do this is by a well thought through design where scaling is a part of it.

Decide what scaling actually means for your project. Is infinite amount of users, is it being able to handle a slashdotting on a website is it development-cycles?

Use this to focus your development efforts

svrist
+10  A: 

The only thing I would say is write your application so that it can be deployed on a cluster from the very start. Anything above that is a premature optimisation. Your first job should be getting enough users to have a scaling problem.

Build the code as simple as you can first, then profile the system second and optimise only when there is an obvious performance problem.

Often the figures from profiling your code are counter-intuitive; the bottle-necks tend to reside in modules you didn't think would be slow. Data is king when it comes to optimisation. If you optimise the parts you think will be slow, you will often optimise the wrong things.

Simon Johnson
+1  A: 

One good idea is to determine how much work each additional task creates. This can depend on how the algorithm is structured.

For example, imagine you have some virtual cars in a city. At any moment, you want each car to have a map showing where all the cars are.

One way to approach this would be:

    for each car {
      determine my position;  
       for each car {  
       add my position to this car's map;  
       }
    }

This seems straightforward: look at the first car's position, add it to the map of every other car. Then look at the second car's position, add it to the map of every other car. Etc.

But there is a scalability problem. When there are 2 cars, this strategy takes 4 "add my position" steps; when there are 3 cars, it takes 9 steps. For each "position update," you have to cycle through the whole list of cars - and every car needs its position updated.

Ignoring how many other things must be done to each car (for example, it may take a fixed number of steps to calculate the position of an individual car), for N cars, it takes N2 "visits to cars" to run this algorithm. This is no problem when you've got 5 cars and 25 steps. But as you add cars, you will see the system bog down. 100 cars will take 10,000 steps, and 101 cars will take 10,201 steps!

A better approach would be to undo the nesting of the for loops.

    for each car {  
    add my position to a list;  
    }  
    for each car {    
    give me an updated copy of the master list;  
    }

With this strategy, the number of steps is a multiple of N, not of N2. So 100 cars will take 100 times the work of 1 car - NOT 10,000 times the work.

This concept is sometimes expressed in "big O notation" - the number of steps needed are "big O of N" or "big O of N2."

Note that this concept is only concerned with scalability - not optimizing the number of steps for each car. Here we don't care if it takes 5 steps or 50 steps per car - the main thing is that N cars take (X * N) steps, not (X * N2).

Nathan Long
+1  A: 

FWIW, most systems will scale most effectively by ignoring this until it's a problem- Moore's law is still holding, and unless your traffic is growing faster than Moore's law does, it's usually cheaper to just buy a bigger box (at $2 or $3K a pop) than to pay developers.

That said, the most important place to focus is your data tier; that is the hardest part of your application to scale out, as it usually needs to be authoritative, and clustered commercial databases are very expensive- the open source variations are usually very tricky to get right.

If you think there is a high likelihood that your application will need to scale, it may be intelligent to look into systems like memcached or map reduce relatively early in your development.

Tim Howland
Sorry, I strongly disagree with this. Often, by the time scalability becomes a problem, the only real cure is to rewrite the system almost from scratch, because the needed changes are often pervasive and deep.
RickNZ
"Premature Optimization is the root of all evil".
Tim Howland
+6  A: 
Ed Lucas