views:

653

answers:

5

I'm building a webservice that is going to be under ridiculous load (thousands to ten-thousands of queries per second). My normal stack of apache, PHP, memcache and some DB will be able to handle it with a nice load balancer infront and lots of machines, but I'm wondering if there are better solutions.

The endpoint will be hit by a beacon (via javascript on the client), I'll read the user's cookies, pull some small info on them from the DB, cache it, do some small calculation, send the response and if needed write to the DB and invalidate the cache.

And good technology choices and/or hardware recommendations?

+3  A: 

http://highscalability.com/ there is a lot to learn here, you'll probabily find your response.

Nicolas Dorier
My goal is not a large scalable system, just a simple technology stack. I'm not growing a DB, Search, crawler, etc. Just a simple request, query, respond, and store. Any recommendations for technology stack for my purpose?
Paul Tarjan
From what I've seen, you can build scalable system with any technology stacks. "thousands to ten-thousands of queries per second" is really high, so for me it's a "large scalable system". Any technology stacks have their success story. If you want to support this charge, you need to read this website. (and maybe consider using a key/value store as CouchDB instead of a relational database)
Nicolas Dorier
+7  A: 

This isn't the kind of question that can be answered here in anything other than a broad overview. Some general pointers:

  • Hardware: the two choices are basically lots of small, cheap boxes or fewer number of more powerful boxes. Cheaper boxes are, well, cheaper but typically consume a lot more power for the same CPU or memory (whichever is important to you) than bigger boxes. People often forget about the sometimes significant cost of power consumption;
  • Backend: you have a few choices from the big end of town (Oracle, SQL Server) to the commoditzed end (MySQL). MySQL is obviously cheaper and you can go far on MySQL but there is no question that Oracle (which I'm more familiar with than SQL Server) has a better optimizer, is more capable and is more robust than MySQL. You will however pay for it;
  • Budget: this is a huge factor as it might be worth paying for good commercial software rather than paying development costs to use "free" software. Software development is one of the most expensive costs of all;
  • Vertical and horizontal scalability: the question you're basically seeking to answer here is do you build up (bigger boxes, etc) or build out (clustered environments). The most scalable solutions have near-linear horizontal scalability but in the shorter term vertical scalability can be cheaper.

As for your normal stack, I'd stick with it unless you've got a particular requirement you haven't mentioned that prohibits it. After all PHP is a proven technology that runs 4 or so of the top 20 sites on the Internet (Facebook, Wikipedia, Flickr and I think Yahoo). If it's good enough for them, it's good enough for you.

More importantly, you know it. Technology stacks you know trump technology stacks you don't in almost every case. Beware the "greener pasture" trap of the latest hyped-up technology stack.

Memcache is good. The other thing you might want to consider adding to the mix is beanstalkd as a distributed work queue processor.

One important question to answer is: how well can you partition your application? Applications that easily lend themselves to partitioning are far easier to scale. Those that aren't tend to be modified in some way to make them easier to partition.

A good example of this is a simple sharetrading application. You can could partition market information based on stock code (A-C on one server, D-F on another and so on). For many such application that will work well.

cletus
A: 

tornado looks like something I would try on this kind of problems http://bret.appspot.com/entry/tornado-web-server at least you know it's a tried and tested solution.

A: 

I can contribute a good component for your stack: MemCache.

jldupont
A: 

PHP, memcached + DB in general scales well but there may be ways to do it with lower costs i.e. a stack that's able to handle more concurrent requests per machine.

Given your comment here...

My goal is not a large scalable system, just a simple technology stack. I'm not growing a DB, Search, crawler, etc. Just a simple request, query, respond, and store. Any recommendations for technology stack for my purpose?

.. it sounds like the DB part might be solvable by Amazon's S3 (what?!?), assuming you only need to locate items by key. That would also give you Cloudfront (en.wikipedia.org/wiki/Amazon_CloudFront) for reads, if you don't mind the eventual consistency (www.infoq.com/news/2008/01/consistency-vs-availability).

Meanwhile server-side something using async IO (en.wikipedia.org/wiki/Asynchronous_I/O) to handle requests should significantly boost the number of concurrent requests each machine can handle. As another poster already said tornado (bret.appspot.com/entry/tornado-web-server) would be worth a look here - haven't seen an API for async IO that's friendlier.

You'd probably still need memcached to keep reads fast but you want to watch out there that the memcached client isn't going to end up blocking the server process while trying to make concurrent requests - PHP wouldn't normally have this problem as each PHP (or Apache) process has it's own memcached connection and is only ever doing one thing at a time. This python client - code.google.com/p/python-libmemcached/ - should support async IO - the underlying libmemcached has support for asynchronous requests.

Same goes for HTTP requests from server to S3 - how do you handle concurrent requests there? boto (code.google.com/p/boto/) seems to use a connection pool for that, each connection holding a different socket open. Memory use?

Disclaimer: I'm being an armchair architect here - haven't actually done this and the smartest advice might be finish the project on time with the stack you know well and aren't going to fail with.

Sorry about the links

sorry, new users can only post a maximum of one hyperlink

HarryF