views:

85

answers:

2

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.

I'm not 100% sure on how this is all going to work yet but this is the idea:

We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.

I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?

CloudFront: http://aws.amazon.com/cloudfront/ EC2: http://aws.amazon.com/cloudfront/ Cassandra: http://cassandra.apache.org/

+2  A: 

Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.

If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.

Some other services you should read up on:

http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.

http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.

http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic

Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.

eSniff
One other comment: Cloud front is most useful when your connections are coming in from overseas. If all your traffic is from US users only, it may not be as useful. It basically turns S3 into a Content Delivery Network (CDN) http://bit.ly/2eILb
eSniff
A: 

Depending on the datasets, Cassandra can also significantly improve response times for queries.

There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:

WTF is a Super Column

DBQ