views:

353

answers:

7

I really would like to learn how to attack problems of scale. Of course, the best way to tackle this is to get a job where I can deal with those sorts of issues. But they're apparently not terribly easy to get.

Obviously, I don't have any way to make any services that are going to handle millions of requests a day without a large investment of time. Are there any ways that I can learn enough to improve my chances of getting such a job in my spare time? For instance, what kinds of open source projects would be good to work on? And are there any books I can read on the subject?

+6  A: 

There's a good presentation here by one of eBay's architects, Randy Shoup, from QCon 2008 where he talks about how eBay have gone about scaling their solution. Well worth watching in my opinion. This site has the key points as bullet points.

There's also an article here that talks about the principles of scalability which is also a good place to get some of the main pointers quickly and in a to-the-point, not too waffly manner.

It is difficult, like you say to try and learn about it in your own time as you don't have the environment requiring a real degree of scaling. But, you can be prepared and show you understand the principles. Get into the frame of mind whereby you always think about the future when writing something - is it extensible? what if I had 100 million rows of data, not 1 million?

For example, one of the key points for scalability from a db point of view is to partition your data - speaking from SQL 2005 context, you don't get the built in support for table partitioning unless you get Enterprise Edition which I'd highly doubt you'd just have at home! So, it's not like you can just play around with it at home to try it out - you'd have to implement your own custom partitioning mechanism. But, you can at least look into it, know of the approach.

AdaTheDev
+2  A: 

Hmm, maby reading some books that tells clustering or cloud models. Making little CPU Linux clustering farm in spare time.

Scaling problems usually in open source project result from bad design. Example thunderbird saves attacked files to inside database, not in files. Then they are asking why they db is slow.

Tuukka
design - you can afford to reread and edit.
whatnick
+2  A: 

Assuming that you're interested in large scale websites, scaling depends on what the website is trying to do. There aren't any easy answers. To me, I would start with thinking about the transactions the website has to process, because then that would dictate your persistence and caching solutions (and most of your stack).

E.g. a large website may be trying to present lots of content to lots of users like Yahoo. Most of the interactions are read-only transactions. In which case, some of the tips from High Performance Websites make lots of sense.

On the other hand, a large website could be making lots of write transactions like Twitter. In Twitter's case, of course, money isn't lost because somebody's tweet disappeared in their stack. In day trading financial system (or eBay), such is not the case. While eBay can certainly afford to use lots of asynchronous messaging to improve performance and scalability, there are still cases where the latency needs to be real time. If you really need that, then those system transactions need to be write-behind instead of write-through (see this article).

Alan
+3  A: 

Obviously there will be websites and books that are a good starting point for the theory (though I can't recommend anything in particular there), and being able to point to a relevant open source project in your job interview and say "I helpd write that" will imporve your chances a lot.

Once you know how to approach the problem, you can use the same approach as for load testing. Set up your server application with one or more "client" mocking applications that bombard it with bursts of requests.

You won't have to write a particularly complex system (server or client) to encounter many of the scalablity issues you want experience with - indeed, with a relatively simple system you can try to set it up in different ways to see what works best (e.g. multiple processes or multiple threads, caching approaches for your data, srategies for minimising the CPU load on the server, the number of requests or the bandwidth-used-per-request, etc).

You can simulate a cluster of servers by simply running many processes on a single PC, so you don't need to invest in lots of hardware - you just need to measure performance in cpu-time-used rather than elapsed-time, so you can pretend that the processes are running on independent machines.

I find usually the best approach (for me) is to think of a realistic problem to solve (e.g. simulate a share trading system) or try to think of a useful tool that I could write for myself, and then just have a crack at it.

Jason Williams
+2  A: 

I've needed to come up to speed in this area for my job, and the resources I've found most helpful are Todd Hoff's highscalability.com, Steve Souder's High Performance Websites, Theo Scholossnagle's Scalable Internet Architectures, and Henderson's Building Scalable Web Sites.

Jim Ferrans
+3  A: 

Learn the theory

Learn about scalability from a theoretical point of view. What are the methods, techniques, trade-off, architecture, etc. You should then have a good view of the concepts such as transaction, redundancy, concurrency, multi-threading, caching, garbage collection, fault-tolerance, recovery, asynchrony, load-balancing, partitionning, distributed hash tables, programming model (actors, object-oriented, erlang, clojure, etc.), distributed algorithm ( 2 phase commit, group communication, consensus, etc.)

Learn how existing system works

Theory in this area can not be applied as-is. You always have trade-off. Learn how real system work. Look at white papers in the field of relational database systems or clustering in popular app. server, architecture of popular web sites (eBay, Amazon, FriendFeeds, etc.), Google platform documents (GFS, Chubby BigTable, etc.), Cloud-related architectures and maybe NoSQL projects. Subscribe to a few website or blogs of people in this area (Rick Ho is one I like but there are others).

Practice

The best is to have a real project where you have real performance problem to improve. Elaborate a few architectures on how you could improve that. Otherwise, create your own problem, say, with a little web site which provide a basic service (e.g. store to do list) that you run on Amazon EC2 and see if it's scale. I don't know about OS project to join.

IMHO, architecting truely scalable system goes from the requirements (know what need to be fast or not) down to the hardware. Inbetween there is the network, the OS, the middleware, and the application architecture which will need to be considered.

ewernli
+1 for EC2, mention GAE as well just for fairness.
whatnick
A: 

My foray into scalability came through Image Processing. I found scalability concepts via OSSIM, which led me to use MPI for building clustered applications. Setting up a cluster hardware didn't use to be trivial, but now it is via EC2 as mentioned above. You can build and test processing applications with virtual machines.

The other option is to participate in one of the projects that uses your local machine as part of a globally distributed computing architecture. Then synthesize how they handle networking and computing tasks.My ATI drivers installed F@Home which I found interesting.

whatnick