views:

470

answers:

5

have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.

I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.

The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.

Is Hadoop the wrong tool for me?

A: 

While you might be able to get by using MapReduce with Hadoop, it seems like what you're doing might be better suited for a grid/job scheduler such as Condor or Sun Grid Engine. Hadoop is more suited for doing something where you take a single (very large) input, split it into chunks for your worker machines to process, and then reduce it to produce an output.

Emil
You are correct that Hadoop was built with the "large data" problem in mind. What is it about Hadoop that makes it unsuited for simulations?
JD Long
A: 

Since you are already using Java, I suggest taking a look at GridGain which, I think, is particularly well suited to your problem.

antrix
A: 

Hadoop can be made to perform your simulation if you already have a Hadoop cluster, but it's not the best tool for the kind of application you are describing. Hadoop is built to make working on big data possible, and you don't have big data -- you have big computation.

I like Gearman (http://gearman.org/) for this sort of thing.

SquareCog
A: 

Simply said, though Hadoop may solve your problem here, its not the right tool for your purpose.

Suraj Chandran
+5  A: 

I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"

Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):

  1. Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
  2. Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
  3. Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
  4. Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
  5. Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.

If you are interested in hearing more about my experiences using Hadoop with R, here's the video of a presentation I gave at the Chicago Hadoop User Group in May 2010:

JD Long
I think it's more like, you shouldn't use a spreadsheet for databases. Sure you *can* use a spreadsheet as a database, and many people do, but it may (or may not) introduce problems for you due to the mismatch between what it was designed to do and what you are using it for. That said, some people don't have access to databases so a spreadsheet is the best option for them.
Emil
That's a really good analogy Emil. Very good point. But what if using Hadoop for simulations is really like storing numbers in a database. One might say, "storing columns of numbers? That's a spreadsheet problem! Spreadsheets were purpose built for numbers!" But then when you have 1000mm numbers.. uh oh, it's a database problem. But storing 1000mm numbers is NOT what databases were built for. DBs store text, blobs, etc. Just storing numbers is not what they are supposed to be used for! ;)
JD Long