views:

202

answers:

2

I'm going to build a high-performance web service. It should use a database (or any other storage system), some processing language (either scripting or not), and a web-server daemon. The system should be distributed to a large amount of servers so the service runs fast and reliable.

It should replicate data to achieve reliability and at the same time it must provide distributed computing features in order to process large amounts of data (primarily, queries on large databases that won't survive being executed on a single server with a suitable level of responsiveness). Caching techniques are out of the subject.

Which cluster/cloud solutions I should take for the consideration?

There are plenty of Single-System-Image (SSI), clustering file systems (can be a part of the design), projects like Hadoop, BigTable clones, and many others. Each has its pros and cons, and "about" page always says the solution is great :) If you've tried to deploy something that addresses the subject - share your experience!

UPD: It's not a file hosting and not a game, but something rather interactive. You can take StackOverflow as an example of a web-service: small pieces of data, semi-static content, intensive database operations.


Cross-Post on ServerFault

A: 

It's hard to make specific recommendations since you've been a bit vague, but I would recommend Google Appengine for basically any web service. It's reliable, easy to use, and is built on the google architecture so is fast and reliable.

Martin
I should imagine that the OP wanted to host it on their own infrastructure. Also cloudy stuff has (typically) no SLA, and poor latency and performance in general.
MarkR
Maybe they did, but I would still recommend Appengine over any kind of in house solution.
Martin
I'd argue that it's neither fast nor reliable. I can't argue with the "scales well" though. I wonder if it's cost-effective? We haven't really investigated this much.
MarkR
I don't know about costs, so someone else would have to comment on that.
Martin
+1  A: 

You really need a better definition of "big". Is "Big" an aspiration, or do you have hard numbers which your marketing department* reckon they'll have on board?

If you can do it using simple components, do so. The likes of Cassandra and Hadoop are neither easy to setup (especially the later) or develop for; developers who are going to be able to develop such an application effectively will be very expensive and difficult to hire.

So I'd say, start off using your favourite "Traditional" database, with an appropriate high-availability solution, then wait until you get close to the limit (You can always measure where the limit is on your real application, once it's built and you have a performance test system).

Remember that Stack Overflow uses pretty conventional components, simply well tuned with a small amount of commodity hardware. This is fine for its scale, but would never work for (e.g. Facebook), but the developers knew that the audience of SO was never going to reach Facebook levels.

EDIT:

When "traditional" techniques start failing, e.g. you reach the limit of what can be done on a single database instance, then you can consider sharding or doing functional partitioning into more instances (again with your choice of HA system).

The only time you're going to need one of these (e.g. Cassandra) "nosql" systems is if you have a homogeneous data store with very high write requirement and availability requirement; even then you could probably still solve it by sharding conventional systems - as others (even Facebook) have done at times.

MarkR
This is why I recommended Appengine, it's easier to develop for, and scales much more easily. Also the price scales very well too!
Martin