views:

129

answers:

1

Hi everybody,

I have thought a bit about making a somewhat lightweight consistent-hashing-like PHP function to shard uploaded files between different servers.

Obviously, rand() would work to distribute the files among the servers somewhat evenly, but when requesting the files, no one will know which file lies on what server...

I know that there's some extensive libraries out there to create consistent-hashing, but I wonder how these works and how I can do to roll out my own, very lightweight one?

Note: I do not take into account that servers will be removed, but instead more ones added to the pool further on.

Thanks!

Update:

Here's a quick line of psuedocode:

$config['shards'] = array('192.168.1.1, 192.168.1.2');

function shard ($filename) {

    $servers = $config['shards'];

    // do lookup in some magic way to decide which server to return.

    return $appropriateserver;
}


echo shard('filename.jpg'); // returns the appropriate server to distribute the file.
+1  A: 

Well, one thing you could do would be to use crc32...

$crc = crc32($mykey);
$serverNo = $crc % count($servers);

It should be fairly consistent (meaning evenly balanced), and 100% reproducible...

ircmaxell
Hi ircmaxell. Thanks for your help! However there's going to be some issue if I add a server to the pool and thereby the count is changed
Industrial
As far as I can see, short of keeping a map there is no way to do that. One possibility would be to re-distribute the files when you add a server, but that's non-trivial as well...
ircmaxell
Actually, I thought of another way as long as you don't have a huge volume of files. Store all the files on each server. Then, just pick the server you go to based on the crc... That way you're still sharding (because at any point in time, all requests for a particular file will go through a single server), but you're also saving the problem of adding or removing a server with breaking everything... Of course this depends upon your use case, but for some it may work...
ircmaxell
Hi again! Thanks a lot for your time and valuable thoughts, ircmaxell!That's definitely one idea, but it would take the total amount of data * the number of servers to store, which would be great unless as you say there's huge amounts of data. Reads would be distributed that way at least...
Industrial
It really depends upon your use case, and your exact scenario. If you have very high traffic with low data requirements, this could be very effective. If you have high traffic with large data requirements, you may want to look into a distributed filesystem such as HDFS or MongoGrid. If you have low traffic with large data requirments, it may be better to just get a single server with a bunch of 2TB drives to handle all the requests (for now at least). The point is, the solution you come up with will need to be tailored to the needs of the application it's built for...
ircmaxell
Hi again! Yep, I can understand that. It's like eveything else - a balance depending on what you're building and what you want to accomplish. Thanks a lot for your help. I have to put in a lot of thought into this before I decide something.
Industrial