tags:

views:

145

answers:

4

Basically my dilemma is this. I have a list of x servers that host files. There is another server, that hosts the site's mysql db and application. When a file is uploaded (to the frontend server), the application checks to see which server has the most free space on it, and moves the file there. This works fine if you started with 2+ empty servers with identical amount of free space. If you introduce another server into the mix at a later point.... which will have more free space than the current servers, this method isnt so effective, because all the new files will be uploaded elusively to the new server, which would overload since it will be handling most of the new traffic till it catches up with the rest of the boxes in terms of free space.

So I thought to introduce a weighting system as well, which will help normalize the distribution of files. So if the 3 servers are set at 33% each, and 1 server has significantly more free space, it would still get more uploads than the other servers (even though it has the same weight), but the load would be spread out over all the servers.

Can anyone suggest a good php-only implementation of this?

+3  A: 

One approach would be to sum all available space on all of the servers that have the space to hold the file (so a server with available space but not enough to hold the file would obviously be excluded). Then determine the percentage of that space that each server accounts for (so a new server would account for a proportionally larger percentage). Use a random number and align it with the percentages to determine which server to pick.

For instance, consider having five servers with the following free space levels:

Server 1:   2048MB
Server 2:  51400MB
Server 3:   1134MB
Server 4: 140555MB

You need to store a 1500MB file. That knocks Server 3 out of the running, leaving us with 194003MB total free space.

Server 1:  1.0%
Server 2: 26.5%
Server 4: 72.5%

You then choose a random number between 0 and 100: 40

Numbers between 0 and 1 (inclusive) would go to Server 1
Numbers > 1 and <= 26.5 would go to Server 2
Numbers > 26.5 and <= 100 would go to Server 4

So in this case, 40 means it gets stored on Server 4.

Adam Robinson
Although on average a good approach, consider the small chance that 0 or 1 gets picked. Would you even want to leave the possibility open to store it on Server 1 at all?
Sev
And all numbers between 0 and 1 as well, as we're dealing with fractions here as well.
Sev
This is interesting, but it doesn't account for numeric weights that would also be assigned to each server.
Yegor
@Yegor: This creates the numeric weights. If you wanted to weight the servers by something other than available space, you could apply a similar concept of summing your weights and finding the proportional percentages, then use those percentages to apply the weighting to your free-space weights.
Adam Robinson
@Sev: I excluded server 3 because it had insufficient space. Obviously this is a little bit of an oversimplification. If a server was wholly disqualified, you could just leave it out. This approach is just to decide which of the eligible--meaning you would actually be OK with storing it there--servers.
Adam Robinson
@Sev: I'm not sure what you mean about the fractions. That is taken care of in the approach
Adam Robinson
+1  A: 

Traffic balancing is usually very crucial. You can add some sort of weighting system to balance it (although, as you say, the new server will still be overloaded more than the others), or some other alternating method where one server never gets hit twice in a row, just as an example.

But I think I would probably artificially balance out the servers data so that they're almost equal to each other by moving content from one to the other, and then let the original or weighted/alternating algorithm do its job normally.

That's not a php-only implementation, but just some ideas to consider.

Sev
+1  A: 

A way to implement it is the following:

  1. Create an array of all the empty space, as a fraction, in your case { 0.5, 0.5, 1.0 }
  2. Create a second array of weights - the amount of space in the server divided by the total amount of space, as it is represented in the first array - { 0.25, 0.25, 0.5 }
  3. Get a random number, normalised to (0.0,1.0) by calling 1.0*mt_rand()/mt_getmaxrand()
  4. run the following loop:

    $total_weight = 0.0;
    for ( $i = 10; $i <= sizeof($weights); $i++) {
      $total_weight += #weights[$i];
      if($rand <= $total_weight) {
    return $i;
      }
    }
    

The returned value is the index of the server

David Rabinowitz
+1  A: 

You've entered the world of distributed filesystems -- a problem space larger than you likely anticipated.

A lot of work/research has been done in this field. You should consider using an available solution like MogileFS, or, at the very least, doing some research on how they solved the problems you've encountered (as well as the problems you haven't encountered yet)

For an example of what I mean by "problems you haven't encountered yet": shouldn't you actually be storing at least 2 copies of every file, so that if you lose one server, you haven't lost all the files on it? Of course, once you start doing that, shouldn't you be able to read parts of a single file from multiple servers simultaneously, for a performance gain? And of course, now you have to figure out how files are distributed, how they get redistrubuted when a server fails, when a new server comes online, etc. etc...

Doing this right is complicated. Don't reinvent the wheel if you can avoid it. And if you have to reinvent the wheel, at least spend some time looking at how other people built theirs.

Frank Farmer