views:

60

answers:

2

I have a computational intensive project that is highly parallelizable: basically, I have a function that I need to run on each observation in a large table (Postgresql). The function itself is a stored python procedure.

Amazon EC2 seems like an excellent fit for the project.

My question is this: Should I make a custom image (AMI) that already contains the database? This would seem to have the advantage of minimizing data transfers and making parallelization simple: each image could get some assigned block of indices to compute, e.g., image 1 gets 1:100, image 2 101:200 etc. Splitting up the data and the instances (which most how-to guides suggest) doesn't seem to make sense for my application, but I'm very new to this so I'm not confident my intuition is right.

+1  A: 

you will definitely want to keep the data and the server instance separate in order for changes in your data to be persisted when you are done with the instance. your best bet will be to start with a basic image that has the OS & database platform you want to use, customize it to suit your needs, and then mount one or more EBS volumes containing your data. You may also want to create your own server instance once you are finished with your customization, unless what you are doing is fairly straightforward.

some helpful links:

http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/creating-an-image.html http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=100&externalID=1663

(you said postgres but this mysql tutorial covers the same basic concepts you'll want to keep in mind)

dudlheimer
Thanks. Would this advice still hold if my data absolutely will not change?
John Horton
if your data absolutely will not change you could include it in your own image but I'm not sure if performance will be equivalent. this might be simpler to begin with, and you could always migrate to a mounted EBS volume if the need arises.
dudlheimer
Thanks - I played around with EC2 yesterday and you're def. right, re: putting everything in an EBS volume. FYI for future searchers that find this topic - I found this step-by-step helpful: http://deadprogrammersociety.blogspot.com/2009/08/postgresql-on-ubuntu-on-ec2.html
John Horton
A: 

If you've already got the function implemented in Python, the simplest route might be to look at PiCloud, which just gives you a really easy interface for running a Python function on EC2, handling pretty much everything else for you. Whether it's economically sensible will depend on how much data has to get sent per function call vs how long computations take to run.

thraxil