tags:

views:

38

answers:

1

Amazon Integration

I have my own CMS which has a file manager. A lot of the files and formats which people can create are stored locally in a database. These are trivial examples like CSS files, basic content etc.

The file manager can do all the things thats docs.google.com does. I actually based the entire methodolgy and design around the google docs browser.

Now, I am adding Amazon S3, so that my file manager will also display files uploaded to Amazon S3.

I have a few logistical questions.

All of my files and the heirarchical structure is stored in my assets and folders table in my mysql database. If I add Amazon S3, files will be uploaded to Amazon and I want to know how I should integrate them.

I can do one of two things.

1. Going to Amazon every time

Either: Whenever the user browsers any particular folder my script can also go off to Amazon and do something like:

$s3->listObjects();

Then I can merge the results of my database query with the results. I could even cache to prevent some issues with performance.

2. Going to my database locally every time.

Alternatively, since I am following this structure for uploads: Client -> Server -> Amazon I need to process the files. This means that I can store a lot of the details in my database. There would be very little need to goto Amazon to list the structure because I can look locally.


What do you think is the best option? I think the second option. This has a few benefits.

Database Benefits

  • I am not querying Amazon constantly. (Cheaper as a result as I think you have to pay for the API usage per 1000 requests).
  • It will be faster
  • I do not have to merge the structure

Database Cons

  • I need to make sure that my database version is an exact copy always of Amazon. Could be difficult??
  • I need to create a syncronise script. This shouldn't be too hard?
+1  A: 

I have a fair bit of experience using Amazon S3 for file storage for a website and you'll definitely want to go the database route.

S3 is way to slow to query all the time and as you mentioned you'll have the additional costs(albeit small). The speed becomes even more and more of an issue, the more files you have stored in a bucket as listObjects() only returns 1000 at a time. The performance issues are easy to see simply by using any of the S3 tools(eg Bucket Explorer, Cloudberry, or even Amazons own tools) to browse a bucket with lots of files.

The extra effort required to ensure your database stays in sync with S3 is well worth it.

geoff
Thanks geoff. You have totally put my mind at rest. I thought that I was being a bit verbose and redundant going the database route, bloating my system for the sake of ease. Incidentally, do you upload directly to S3? I'm uploading to my server then to S3. I'm not sure how I should handle my batch uploads to S3. It seems like running it under the same PHP process as the page request would be insane. How would you recommend doing it?
Laykes
@Laykes - Firstly my site is asp.net on windows, so you'll have to adapt your process appropriately. I also upload to S3 via the server. I have a Windows service running that does the uploading - as soon as a browser upload occurs, I tell the service about it and leave it to do the uploading to S3. Our server is on EC2 so uploads to S3 are super fast. I guess you could also just have a process that runs every few minutes that looks for new files to upload. I certainly wouldn't do it in the PHP process though.
geoff
@Laykes - Part 2 - We also have a need to upload large(2GB) files to S3. We tried uploading directly to S3 using various 3rd party software or our own but found reliability to be a problem(at least on our internet connection). Uploading to S3 doesn't have chunked or resume support, so if an upload failed 90% in we'd have to start again. So those also go up via a server. We FTP to the server on EC2 and then move them up to S3 from there. Again, EC2 to S3 is really fast and reliable.
geoff
@geoff - Thats reassuring. I'm currently in pre-pre-pre-alpha stages of development. I don't think I will be able to consider moving to EC2 until about February. I'm planning on being marketable around Christmas but EC2 was always my final platform to go into production on. It doesn't seem that cheap though unfortuantely. Would you agree? Money isn't so much an issue, budget would be around $500-1000 per month, but I'm not sure if I can get something massively scalable for that price.
Laykes