Much has been written about deploying data crunching applications on EC2/S3, but I would like to know, what is the typical workflow for developing such applications?
Lets say I have a 1 TB of time series data to begin with and I have managed to store this on S3. How would I write applications and do interactive data analysis to build machine learning models and then write large programs to test them? In other words, how does one go about setting up a dev environment in such a situation? Do I boot up an EC2 instance, develop software on it and save my changes, and shutdown every time I want to do some work?
Typically, I fire up R or Pylab, read data from my local drives and do my analysis. Then I create applications based on that analysis and let it loose on that data.
On EC2, I am not sure if I can do that. Do people keep data locally for analysis and only use EC2 when they have large simulation jobs to run?
I am very curious to know what other people are doing, especially start ups who have their entire infrastructure based on EC2/S3.