tags:

views:

160

answers:

5

I have a file host website thats burning through 2gbit of bandwidth, so I need to start adding secondary media servers to store the files. What would be the best way to manage a multiple server setup, with a large amount of files? Preferably through php only.

Currently, I only have around 100Gb of files... so I could get a 2nd server, mirror all content between them, and then round robin the traffic 50/50, 33/33/33, etc. But once the total amount of files grows beyond the capacity of a single server, this wont work.

The idea that I had was to have a list of media servers stored in the DB with the amounts of free space left on each server. Once a file is uploaded, php will choose to which server the file is actually uploaded to, and spread out all the files evenly among the servers.

Was hoping to get some more input/inspiration.

Cant use any 3rd party services like Amazon. The files range from several bytes to a gigabyte.

Thanks

A: 

Your best bet is really to get your files into some sort of storage that scales. Storing files locally should only be done with good reason (they are sensitive, private, etc.)

Your best bet is to move your content into the cloud. Mosso's CloudFiles or Amazon's S3 will both allow you to store an almost infinite amount of files. All your content is then accessible through an API. If you want, you can then use MySQL to track meta-data for easy searching, and let the service handle the actual storage of the files.

Chris Henry
They are sensitive, and considering its using a lot of bandwidth (for which I get a deal), storing them elsewhere isnt an option.
Yegor
A: 

i think your own idea is not the worst one. get a bunch of servers, and for every file store which server(s) it's on. if new files are uploaded, use most-free-space first*. every server handles it's own delivery (instead of piping through the main server).

pros:

  • use multiple servers for a single file. e.g. for cutekitten.jpg: filepath="server1\cutekitten.jpg;server2\cutekitten.jpg", and then choose the server depending on the server load (or randomly, or alternating, ...)

  • if you're careful you may be able to move around files automatically depending on the current load. so if your cute-kitten image gets reddited/slashdotted hard, move it to the server with the lowest load and update the entry.
    you could do this with a cron-job. just log the downloads for the last xx minutes. try some formular like (downloads-per-minute*filesize*(product of serverloads)) for weighting. pick tresholds for increasing/decreasing the number of servers those files are distributed to.

  • if you add a new server, it's relativley painless (just add the address to the server pool)

cons:

  • homebrew solutions are always risky

  • your load distribution algorithm must be well tested, otherwise bad things could happen (everything mirrored everywhere)

  • constantly moving files around for balancing adds additional server load

* or use a mixed weighting algorithm: free-space, server-load, file-popularity

disclaimer: never been in the situation myself, just guessing.

Schnalle
A: 

Consider HDFS, which is part of Apache's Hadoop. This will integrate with PHP, but you'll be setting up a second application. This will also solve all your points of balancing among servers and handling things when your file space usage exceeds one server's ability. It's not purely in PHP, though, but I don't think that's what you meant when you said "pure" anyway.

See http://hadoop.apache.org/core/docs/current/hdfs_design.html for the idea of it. They cover the whole idea of how it handles large files, many files, replication, etc.

Autocracy
+2  A: 

You could try MogileFS. It is a distributed file system. Has a good API for PHP. You can create categories and upload a file to that category. For each category you can define on how many servers it should be distributed. You can use the API to get a URL to that file on a random node.

Niels
+1  A: 

If you are doing as much data transfer as you say, it would seem whatever it is you are doing is growing quite rapidly.

It might be worth your while to contact your hosting provider and see if they offer any sort of shared storage solutions via iscsi, nas, or other means. Ideally the storage would not only start out large enough to store everything you have on it, but it would also be able to dynamically grow beyond your needs. I know my hosting provider offers a solution like this.

If they do not, you might consider colocating your servers somewhere that either does offer a service like that, or would allow you install your own storage server (which could be built cheaply from off the shelf components and software like Freenas or Openfiler).

Once you have a centralized storage platform, you could then add web-servers to your hearts content and load balance them based on load, all while accessing the same central storage repository.

Not only is this the correct way to do it, it would offer you much more redundancy and expandability in the future if you endeavor continues to grow at the pace it is currently growing.

The other solutions offered using a database repository of what is stored where, would work, but it not only adds an extra layer of complexity into the fold, but an extra layer of processing between your visitors and the data they wish to access.

What if you lost a hard disk, do you lose 1/3 or 1/2 of all your data?

Should the heavy IO's of static content be on the same spindles as the rest of your operating system and application data?

WerkkreW