views:

641

answers:

15

I want to create a system that delivers user interface response within 100ms, but which requires minutes of computation. Fortunately, I can divide it up into very small pieces, so that I could distribute this to a lot of servers, let's say 1500 servers. The query would be delivered to one of them, which then redistributes to 10-100 other servers, which then redistribute etc., and after doing the math, results propagate back again and are returned by a single server. In other words, something similar to Google Search.

The problem is, what technology should I use? Cloud computing sounds obvious, but the 1500 servers need to be prepared for their task by having task-specific data available. Can this be done using any of the existing cloud computing platforms? Or should I create 1500 different cloud computing applications and upload them all?

Edit: Dedicated physical servers does not make sense, because the average load will be very, very small. Therefore, it also does not make sense, that we run the servers ourselves - it needs to be some kind of shared servers at an external provider.

Edit2: I basically want to buy 30 CPU minutes in total, and I'm willing to spend up to $3000 on it, equivalent to $144,000 per CPU-day. The only criteria is, that those 30 CPU minutes are spread across 1500 responsive servers.

Edit3: I expect the solution to be something like "Use Google Apps, create 1500 apps and deploy them" or "Contact XYZ and write an asp.net script which their service can deploy, and you pay them based on the amount of CPU time you use" or something like that.

Edit4: A low-end webservice provider, offering asp.net at $1/month would actually solve the problem (!) - I could create 1500 accounts, and the latency is ok (I checked), and everything would be ok - except that I need the 1500 accounts to be on different servers, and I don't know any provider that has enough servers that is able to distribute my accounts on different servers. I am fully aware that the latency will differ from server to server, and that some may be unreliable - but that can be solved in software by retrying on different servers.

Edit5: I just tried it and benchmarked a low-end webservice provider at $1/month. They can do the node calculations and deliver results to my laptop in 15ms, if preloaded. Preloading can be done by making a request shortly before the actual performance is needed. If a node does not respond within 15ms, that node's part of the task can be distributed to a number of other servers, of which one will most likely respond within 15ms. Unfortunately, they don't have 1500 servers, and that's why I'm asking here.

A: 

Sounds like you need to utilise an algorithm like MapReduce: Simplified Data Processing on Large Clusters

Wiki.

Mitch Wheat
+1  A: 

Google does it by having a gigantic farm of small Linux servers, networked together. They use a flavor of Linux that they have custom modified for their search algorithms. Costs are software development and cheap PC's.

Robert Harvey
I just added a comment to my question, to inform that the average load will be very, very small. It does not make sense to run these servers only for this purpose, that would be far too expensive.
Lars D
Google have their gigantic server farm in order to cope with high load - each individual request to google in itself requires very little computation.
Kragen
+8  A: 

[in advance, apologies to the group for using part of the response space for meta-like matters]

From the OP, Lars D:
I do not consider [this] answer to be an answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be perfectly split into more than 300,000 servers if needed, although the extra costs wouldn't give much extra performance because of network latency.

Lars,
I sincerely apologize for reading and responding to your question at a naive and generic level. I hope you can see how both the lack of specifity in the question itself, particularly in its original form, and also the somewhat unusual nature of the problem (1) would prompt me respond to the question in like fashion. This, and the fact that such questions on SO typically emanate from hypotheticals by folks who have put but little thought and research into the process, are my excuses for believing that I, a non-practionner [of massively distributed systems], could help your quest. The many similar responses (some of which had the benefits of the extra insight you provided) and also the many remarks and additional questions addressed to you show that I was not alone with this mindset.

(1) Unsual problem: An [apparently] mostly computational process (no mention of distributed/replicated storage structures), very highly paralellizable (1,500 servers), into fifty-millisecondish-sized tasks which collectively provide a sub-second response (? for human consumption?). And yet, a process that would only be required a few times [daily..?].

Enough looking back!
In practical terms, you may consider some of the following to help improve this SO question (or move it to other/alternate questions), and hence foster the help from experts in the domain.

  • re-posting as a distinct (more specific) question. In fact, probably serveral questions: eg. on the [likely] poor latency and/or overhead of mapreduce processes, on the current prices (for specific TOS and volume details), on the rack-awareness of distributed processes at various vendors etc.
  • Change the title
  • Add details about the process you have at hand (see many questions in the notes of both the question and of many of the responses)
  • in some of the questions, add tags specific to a give vendor or technique (EC2, Azure...) as this my bring in the possibly not quite unbuyist but helpful all the same, commentary from agents at these companies
  • Show that you understand that your quest is somewhat of a tall order
  • Explicitly state that you wish responses from effective practionners of the underlying technologies (maybe also include folks that are "getting their feet wet" with these technologies as well, since with the exception of the physics/high-energy folks and such, who BTW traditionnaly worked with clusters rather than clouds, many of the technologies and practices are relatively new)

Also, I'll be pleased to take the hint from you (with the implicit non-veto from other folks on this page), to delete my response, if you find that doing so will help foster better responses.

-- original response--

Warning: Not all processes or mathematical calculations can readily be split in individual pieces that can then be run in parallel...

Maybe you can check Wikipedia's entry from Cloud Computing, understanding that cloud computing is however not the only architecture which allows parallel computing.

If your process/calculation can efficitively be chunked in parallelizable pieces, maybe you can look into Hadoop, or other implementations of MapReduce, for an general understanding about these parallel processes. Also, (and I believe utilizing the same or similar algorithms), there also exist commercially available frameworks such as EC2 from amazon.

Beware however that the above systems are not particularly well suited for very quick response time. They fare better with hour long (and then some) data/number crunching and similar jobs, rather than minute long calculations such as the one you wish to parallelize so it provides results in 1/10 second.

The above frameworks are generic, in a sense that they could run processes of most any nature (again, the ones that can at least in part be chunked), but there also exist various offerings for specific applications such as searching or DNA matching etc. The search applications in particular can have very short response times (cf Google for example) and BTW this is in part tied to fact that such jobs can very easily and quickly be chunked for parallel processing.

mjv
+1 for hadoop, although it's worth pointing out that that's just one implementation of map/reduce
Rob Fonseca-Ensor
-1 for Hadoop. Initial Job deployment takes a minute. Don't expect a Hadoop Cluster to give results within a range of 100ms. That's not goingt to happen.
mhaller
@mhaller, you are right, although hadoop is oft' used for the off-line matrix crunching and other tasks such as clustering and sorting, which support fast applications, these apps are themselves not running on hadoop. I'll alter my response accordingly. My only excuse for such an imprecise response is the vague and frankly, probably naive, nature of the OP's question.
mjv
This algorithm works very well in chuncks, basically because it's about 350,000 completely independent calculations that need to be done. Which provider lets me use 1500 servers without paying a lot of money?
Lars D
Have you considered throwing hardware at the problem in the form of a GPU or three?
Tuure Laurinolli
@Lars, as hinted in the response, beware of latency/overhead associated with the management of the parallel process. At the moment my understanding is that commercial vendors have both extra capacity and are actively in the process of securing market share/position and the price are therefore attractive. To get an idea you can maybe try http://calculator.s3.amazonaws.com/calc5.html .
mjv
@Tuure: +1 , maybe OpenCL is the solution... http://www.nvidia.com/object/cuda_opencl.html , http://ati.amd.com/technology/streamcomputing/opencl.html
Malkocoglu
@Tuure, right-on, GPU kits are now less buggy and can provide a significant boost, in a more traditional fashion. It appears that none the less Lars D's calculation will require parallel processing, it's just that with GPUs, the requirement of distinct servers will diminish. The management of parallel threads on the GPUs will too require management of sorts. Never the less a very worthy option to look into.
mjv
@Tuure: This is basically about throwing hardware at the problem instead of spending a lot of time on programming. If we can make this work in a couple of weeks on a cloud system, it is much cheaper than if we spend a man-month on it and it then runs on a big server.
Lars D
@mjv: I do not consider your answer to be an answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be perfectly split into more than 300,000 servers if needed, although the extra costs wouldn't give much extra performance because of network latency.
Lars D
@Lars: If you figured everything out, why did you ask this question in the first place ? Did you compare the prices of a multi-GPU system and 1500 servers in the cloud ? BTW: A month is also a couple of weeks :-)
Malkocoglu
If the calculations involve floating point and can be paralellized very far, a GPU might indeed be the solution. You can get about ~2 TFlops for $1k.
drhirsch
@Malkocoglu: I basically want to buy 30 CPU minutes in total, and I'm willing to spend $3000 on it, equivalent to $144,000 per CPU-day. The only criteria is, that those 30 CPU minutes are spread across 1500 responsive servers.
Lars D
@Lars: +1 for the GPUs. Worth looking into. http://www.securitiesindustry.com/issues/19_92/-23358-1.html?pg=1
Trevor Tippins
30 CPU minutes are spread across 1500 servers !?! This does not sound good/right. Just the network latency will kill the efficiency achieved by the processing power of those 1500 servers ! IMHO, mjv is correct and so is Tuure...
Malkocoglu
30 CPU minutes spread across 1500 servers gives 1.2 seconds for each to handle the request. Since I don't expect a single request to take more than max 50ms, this allows 24 runs - that should be more than enough if the system is easy to use.
Lars D
@Lars D, see my edit, I hope this helps. Do not hesitate to ask me to "self destroy", I'll be pleased to oblige, provided no one on this page make the argument that the material is useful independently of its missing the effective needs of the OP.
mjv
@mjv: I never have any hard feelings, I'm happy that so many are trying to help out, but I'd wish that someone would just come up with a very, very simple solution. I guess cloud computing is still too much in its infancy, to be able to deliver solutions to these kinds of problems.
Lars D
@Lars, same here, no hard feelings. You are quite right cloud computing is still defining itself in many ways (never mind implementing itself, and marketing itself...) This said in view of my better understanding of your problem, "cloud" is likely not a likely eventual destination (but none-the-less, maybe a good way to get educated on the broader issues associated with // processing). Your requirements make it so: very massive usage and low latency requirement, also the relative low usage overall (?24 times daily) probably make it a moderately interesting "account" for providers.
mjv
You say "the algorithm works very well in chuncks, basically because it's about 350,000 completely independent calculations that need to be done." From my understanding "completely independent" means the calculations act on the same initially available parameters and none has to wait for another one to return the result to be used as input. If that is true, then you can make a controller and 1500 web services. Each web service could be hosted on a separate server. The controller sends out the requests to the web services with the initial parameters and gets back the results and parses them.
Majid
@Majid: Exactly. But who can provide the service?
Lars D
A: 

Check out Parallel computing and related articles in this WikiPedia-article - "Concurrent programming languages, libraries, APIs, and parallel programming models have been created for programming parallel computers." ... http://en.wikipedia.org/wiki/Parallel_computing

Kristoffer Bohmann
+1  A: 

It would seem that you are indeed expecting at least 1000-fold speedup from distributing your job to a number of computers. That may be ok. Your latency requirement seems tricky, though.

Have you considered the latencies inherent in distributing the job? Essentially the computers would have to be fairly close together in order to not run into speed of light issues. Also, the data center in which the machines would be would again have to be fairly close to your client so that you can get your request to them and back in less than 100 ms. On the same continent, at least.

Also note that any extra latency requires you to have many more nodes in the system. Losing 50% of available computing time to latency or anything else that doesn't parallelize requires you to double the computing capacity of the parallel portions just to keep up.

I doubt a cloud computing system would be the best fit for a problem like this. My impression at least is that the proponents of cloud computing would prefer to not even tell you where your machines are. Certainly I haven't seen any latency terms in the SLAs that are available.

Tuure Laurinolli
Yes - if I set 10ms for sending 1kbyte from one server to another and start processing it, then the answer should be back within 100ms. The actual server count depends on the latency, of course, but if my question for 1500 is answered by someone, then the same solution can be used for 500 servers or 2500 servers. If there are no servers on my continent, I would probably not be able to achieve 100ms in total, but that would probably be ok. One problem is, that if just one server fails, the result is wrong - that's why I'm thinking about clouds.
Lars D
+2  A: 

MapReduce is not the solution! Map Reduce is used in Google, Yahoo and Microsoft for creating the indexes out of the huge data (the whole Web!) they have on their disk. This task is enormous and Map Reduce was built to make it happens in hours instead of years, but starting a Master controller of Map Reduce is already 2 seconds, so for your 100ms this is not an option.

Now, from Hadoop you may get advantages out of the distributed file system. It may allow you to distribute the tasks close to where the data is physically, but that's it. BTW: Setting up and managing an Hadoop Distributed File System means controlling your 1500 servers!

Frankly in your budget I don't see any "cloud" service that will allow you to rent 1500 servers. The only viable solution, is renting time on a Grid Computing solution like Sun and IBM are offering, but they want you to commit to hours of CPU from what I know.

BTW: On Amazon EC2 you have a new server up in a couple of minutes that you need to keep for an hour minimum!

Hope you'll find a solution!

Fred Simon
A: 

You'll find a lot about such questions on

http://highscalability.com/

RED SOFT ADAIR
+5  A: 

Sorry, but you are expecting too much.

The problem is that you are expecting to pay for processing power only. Yet your primary constraint is latency, and you expect that to come for free. That doesn't work out. You need to figure out what your latency budgets are.

The mere aggregating of data from multiple compute servers will take several milliseconds per level. There will be a gaussian distribution here, so with 1500 servers the slowest server will respond after 3σ. Since there's going to be a need for a hierarchy, the second level with 40 servers , where again you'll be waiting for the slowest server.

Internet roundtrips also add up quickly; that too should take 20 to 30 ms of your latency budget.

Another consideration is that these hypothethical servers will spend much of their time idle. That means they're powered on, drawing electricity yet not generating revenue. Any party with that many idle servers would turn them off, or at the very least in sleep mode just to conserve electricity.

MSalters
That's all assuming the servers are all set up with the applications running, the needed pages all in memory, ready to receive the requests, having just finished serving the previous request.
Stephen Denne
I just added an edit to my question, please read it.
Lars D
Doesn't really help. No commercial service has 1500 idle servers.
MSalters
@MSalters: They don't need to.
Lars D
They'd better be all idle. You don't have the latency budget to deal with retries. Hence, every server that's assigned a workpackage needs to pick it up immediately. Even if 95% of those servers are idle, you still end up waiting for 75 servers.
MSalters
A: 

Although Cloud Computing is the cool new kid in town, your scenario sounds more like you need a cluster, i.e. how can I use parallelism to solve a problem in a shorter time. My solution would be:

  1. Understand that if you got a problem that can be solved in n time steps on one cpu, does not guarantee that it can be solved in n/m on m cpus. Actually n/m is the theoretical lower limit. Parallelism is usually forcing you to communicate more and therefore you'll hardly ever achieve this limit.
  2. Parallelize your sequential algorithm, make sure it is still correct and you don't get any race conditions
  3. Find a provider, see what he can offer you in terms of programming languages / APIs (no experience with that)
sebastiangeiger
A: 

What you're asking for doesn't exist, for the simple reason that doing this would require having 1500 instances of your application (likely with substantial in-memory data) idle on 1500 machines - consuming resources on all of them. None of the existing cloud computing offerings bill on such a basis. Platforms like App Engine and Azure don't give you direct control over how your application is distributed, while platforms like Amazon's EC2 charge by the instance-hour, at a rate that would cost you over $2000 a day.

Nick Johnson
Some cheap providers allows you to host C++ native code programs, that are activated by a webserver. Having that on 1500 servers would solve my problem.
Lars D
A: 

Sorry, I put this as a comment, but I wanted to offer an answer:

You say

"the algorithm works very well in chuncks, basically because it's about 350,000 completely independent calculations that need to be done."

From my understanding "completely independent" means the calculations act on the same initially available parameters and none has to wait for another one to return the result to be used as input. If that is true, then you can make a controller and 1500 web services. Each web service could be hosted on a separate server. The controller sends out the requests to the web services with the initial parameters and gets back the results and parses them.

Majid
XML over HTTP crawls. Would introduce latencies. Would advise use of native or custom protocols.
Nissan Fan
+1  A: 

You have conflicting requirements. You're requirement for 100ms latency is directly at odds with your desire to only run your program sporadically.

One of the characteristics of the Google-search type approach you mentioned in your question is that the latency of the cluster is dependent on the slowest node. So you could have 1499 machines respond in under 100ms, but if one machine took longer, say 1s - whether due to a retry, or because it needed to page you application in, or bad connectivity - your whole cluster would take 1s to produce an answer. It's inescapable with this approach.

The only way to achieve the kinds of latencies you're seeking would be to have all of the machines in your cluster keep your program loaded in RAM - along with all the data it needs - all of the time. Having to load your program from disk, or even having to page it in from disk, is going to take well over 100ms. As soon as one of your servers has to hit the disk, it is game over for your 100ms latency requirement.

In a shared server environment, which is what we're talking about here given your cost constraints, it is a near certainty that at least one of your 1500 servers is going to need to hit the disk in order to activate your app.

So you are either going to have to pay enough to convince someone to keep you program active and in memory at all times, or you're going to have to loosen your latency requirements.

sdtom
You are assuming that the program takes 100ms to run on the servers. That is not the case - on a heavy loaded share webhost at $1 per month, one hosted app can produce the result and deliver via my internet connection in 15-20ms. After 10-15ms without response on a node, a retry can be made on a number of other servers.In case that the servers need a preload, that can be done before running it, I'll add that to the original question.
Lars D
sdtom is just saying that a program (that runs 1 second/day) and its related data is not kept in memory whole-time by the OS. Just open some program like IE, minimize it, do not touch it for a while, do other stuff, then activate IE window. This will take some time and you will see some disk activity. This means the OS has reclaimed the memory (code+data) that it has given to IE after certain amount of inactivity...
Malkocoglu
I added an edit which explains, how this is actually possible on a low-end hosting provider, which just happens to have too few servers.
Lars D
+2  A: 

I don't get why you would want to do that, only because "Our user interfaces generally aim to do all actions in less than 100ms, and that criteria should also apply to this".

First, 'aim to' != 'have to', its a guideline, why would u introduce these massive process just because of that. Consider 1500 ms x 100 = 150 secs = 2.5 mins. Reducing the 2.5 mins to a few seconds its a much more healthy goal. There is a place for 'we are processing your request' along with an animation.

So my answer to this is - post a modified version of the question with reasonable goals: a few secs, 30-50 servers. I don't have the answer for that one, but the question as posted here feels wrong. Could even be 6-8 multi-processor servers.

eglasius
Lars D
Damn, I am moving to Denmark in the cargo bay of next oil tanker :-)
Malkocoglu
+1  A: 

Two trains of thought:

a) if those restraints are really, absolutely, truly founded in common sense, and doable in the way you propose in the nth edit, it seems the presupplied data is not huge. So how about trading storage for precomputation to time. How big would the table(s) be? Terabytes are cheap!

b) This sounds a lot like a employer / customer request that is not well founded in common sense. (from my experience)

Let´s assume the 15 minutes of computation time on one core. I guess thats what you say. For a reasonable amount of money, you can buy a system with 16 proper, 32 hyperthreading cores and 48 GB RAM.

This should bring us in the 30 second range. Add a dozen Terabytes of storage, and some precomputation. Maybe a 10x increase is reachable there. 3 secs. Are 3 secs too slow? If yes, why?

posipiet
I'm trying to reduce programming costs and shorten time-to-market by spending more computing power on the problem. If the concept shows to be successful, we can spend more programmer time on making it more energy efficient ;-) Basically, I don't get the resources for implementing it properly, until I can demonstrate that it works.
Lars D
A: 

I moved the question to serverfault.com.

Lars D