In Amazon ec2, how frequently will an instance of a running machine crash? Has anyone experienced this?
I have used EC2 for about 6 months now. Last year they crashed from time to time (I have 4 running and one crashed on average once a month or so). In the last 3 months we have had no crashes at all. I would say Amazon has really beefed up there infrastructure as EC2 is now out of beta.
Bruce
We run our company infrastructure (corporate Web site, JIRA, Confluence and Subversion) on EC2; we've had no outages on any of the machines for about 6 months. Since EC2 came out of Beta last October, they have a proper SLA in place.
You can do a few things to mitigate ec2 outages:
- Create a machine image (AMI) of your exact configuration so that you can bring up a new instance right away in case of failure (and make sure you can instantiate your new AMI successfully before you need it !)
- Store critical data on Elastic Block Store volumes; these persist even if the EC2 instance goes down, and are more reliable than physical hard drives since their data is replicated.
I've had a Windows Server 2003 instance running for about 3 months now with out any crashes.
I have had an Ubuntu 8.04 instance up for nearly a year (354 days today) with zero fuss. I use it as a test server for my web development projects. It has only disappeared once, and all I had to do was reboot it.
I will add my question to this thread.
I am writing a system to be hosted on ec2. It is like 4-5 month now, product almost complete. Investor spend a fair amount of cash to make sure it is ec2-instance-crash-proof.
For 5 months we had about 10 nodes up (24/7). Production env is probably 10-20. Not a single machine went down.
Why do we have this urban legend of failing ec2 machines? I remember last year's outage (when heroku.com went down) but it was ddos! So anything would go down. Everyone is scared of waves and waves of dead machines..
Do you think that spending 3 times more money, for obviously better system, is justified? To handle this kind of crashes and keep service Available the code base is bigger = more bugs = worse maintenance.