tags:

views:

113

answers:

3

Hi -- Looking to shard a database and to assign different users to different home servers based on their user id. User IDs are 10 character strings, e.g., "f4gKUKkj91" ... each server has an ID of 1 - 1000. How can I create a hash function in php to uniquely and consistently assign each user id to a specific shard ? If the user id were an integer I could do userid % 1000 ... but since they are alphanumeric I'm not sure how to do this with even distribution in php.

Thank you!

+2  A: 

I'm not quite getting your problem, but it seems this link might be useful.

metrobalderas
+3  A: 

You could use crc32() which gives you a numerical hash of the alphanumeric userids.

allanmc
Thanks -- Could I do something like:$shard_num = NUM_SHARDS % abs(crc32($id));Would that give me even distribution?
ensnare
Well you would have to do it $shard_num = abs(crc32($id)) % NUM_SHARDS;I havn't tested it, but i think crc32 should provide even distribution.
allanmc
+1  A: 

This is not a perfect algorithm, as there will be a slight preference for smaller ID numbers. It assumes the user IDs are spread out fairly evenly, so to speak; if they're not, the distribution may not be good.

Figure out what your alphabet is and put it in a string like $str = '0123456789abcdefghijklmnopqrstuvwxxyzABCDEFGHIJKLMNOPQRSTUVXYZ'; This string has n characters. Now, we will essentially treat the user ID as a base n integer.

For each character, find its index in the string (0-based). Take this index and multiply it with nx, where x is the character position in your original string, starting with 0. Add all of these together, and take the modulo of the sum.

You probably only want to do this for a few characters - once you've read a few characters, the sum becomes quite big, and PHP can't handle it properly unless you resort to using functions suitable for large integer math (you can certainly use GMP and such, but it may not be ideal for your case). If you are using native integers, stop before the maximum possible sum goes beyond 2^31 (nx+nx+1+...+n).

You can use either start from the beginning or going backwards (going backwards corresponds to usual integer notation). One of them may be more suitable, depending on how the ID generation works.

Michael Madsen