views:

156

answers:

2

SO I've finally figured out how to get my R scripts to run on the Amazon EC2 cloud. I've been using an AMI with 26 ECUs, 8 Cores, and 69 gigs of RAM.

I then divide up my code into multiple scripts, and run each one in an instance of R. With a server of this size, I can easily run 20-40 scripts simultaneously, each running several 1000 simulations.

What I would like to know is if R is taking advantage of all this computing power natively. Should I install packages that specifically tell R to use all this extra memory/ multiple CPUs? I've seen this page and some packages (at least from the description) seem promising. But I am unable to figure out how to incorporate this into my code. Could anyone shed more light on this?

+10  A: 

You could look at the examples in my the Intro to High-Performance Computing with R tutorials of which a few versions are on this page.

The quickest way to use the multiple cores is the (excellent) multicore package, you should not have anything special to do to take advantage of the oodles of ram you have there. multicore ties into foreach via doMC, but you can of course simply use the mclapply() function directly.

Dirk Eddelbuettel
Thanks Dirk for that excellent resource. I'll be spending my afternoon reading through your presentations. cheers!
Maiasaura
+9  A: 

Dirk's comments are spot on w.r.t multicore/foreach/doMC.

If you are doing thousands of simulations you may want to consider Amazon's Elastic Map Reduce (EMR) service. When I wanted to scale my simulations in R I started with huge EC2 instances and the multicore package (just like you!). It went well but I ran up a hell of an EC2 bill. I didn't really need all that RAM yet I was paying for it. And my jobs would finish at 3 AM then I would not get into the office until 8 AM so I paid for 5 hours I didn't need.

Then I discovered that I could use the EMR service to fire up 50 cheap small Hadoop instances, run my simulations, and then have them automatically shut down! I've totally abandoned running my sims on EC2 and now use EMR almost exclusively. This worked so well that my firm is beginning to test ways to migrate more of our periodic simulation activity to EMR.

Here's a blog post I wrote when I first started using multicore on EC2. Then when I discovered I could do this with Amazon EMR I wrote a follow up post.

JD Long
Thanks JD. My EC2 tab is running quite high too so I will definitely look into EMR.
Maiasaura
if you hit any snags with EMR be sure and ask follow up questions in StackOverflow. Good luck!
JD Long