views:

110

answers:

3

Hi,

I am implementing a memcached client library. I want it to support several servers and so I wish to add some load-balancing system.

Basically, you can do two operations on a server:

  • Store a value given its key.
  • Get a value given its key.

Let us say I have N servers (from 0 to N - 1), I'd like to have a repartition function which, from a given key and server number N, would give me an index in the [0, N[ range.

unsigned int getServerIndex(const std::string& key, unsigned int serverCount);

The function should be as fast and simple as possible and must respect the following constraint:

getServerIndex(key, N) == getServerIndex(key, N); //aka. No random return.

I wish I could do this without using an external library (like OpenSSL and its hashing functions). What are my options here?


Side notes:

Obviously, the basic implementation:

unsigned int getServerIndex(const std::string& key, unsigned int serverCount)
{
  return 0;
}

Is not a valid answer as this is not exactly a good repartition function :D


Additional information:

Keys will usually be any possible string, within the ANSI charset (mostly [a-zA-Z0-9_-]). The size may be anything from a one-char-key to whatever-size-you-want.

A good repartition algorithm is an algorithm for which the probability of returning a is equal (or not too far) from the probability of returning b, for two different keys. The number of servers might change (rarely though) and if it does, it is acceptable that the returned index for a given key changes as well.

A: 

What about something very simple like

hash(key) % serverCount
Daniel Brückner
The main issue I have is... what `hash()` function should I use ? ;)
ereOn
This has the major problem that adding a server requires redistributing every existing data item.
Nick Johnson
@Nick Johnson: It is acceptable. A caching system is by nature not reliable. If a servers dies and is removed from the list, there will be cache-misses anyway.
ereOn
@ereOn But it's not just an issue of cache misses (and bear in mind that this isn't just a few cache misses - this pretty much means invalidating almost your entire cache whenever a server is added or removed). Consider the following series of events: 1) I add an item with key 'a' and value '1'. It gets mapped to server 's1'. 2) Server 's2' joins. 3) I update 'a' with '2'. It's now mapped to server 's2'. 4) Server s2 parts. 5) I retrieve 'a' and get '1' from s1 - the item has reverted to an earlier value!
Nick Johnson
@Nick Johnson: Yes sure. But by definition, the cache server is not reliable. In most cases, the values will be available. In worst case scenarios (like the one you mentionned), they won't. Of course having a solution that doesn't invalidate the entries when a server is added or removed would be great, but I think it is too much effort for too few gains. Most of my cache entries have a lifetime of 5 minutes; so a global invalidation is not that much an issue.
ereOn
@ereOn That still doesn't address my point that using this strategy without some way to detect when a value is 'shadowed' by another server may lead to reversions. It's a given that cache is unreliable, but that's a far cry from being willing to accept reversions to previous values for elements!
Nick Johnson
@Nick Johnson: If the first value was still available in the cache, it means it had **not** expired yet anyway and that it is thus still **valid**. Many things can alterate the cache: another software might use the same key and store it's own data. Memcached has a mechanism to help you detect when a value has been tampered, which solves both cases.
ereOn
Then you have a use-case very different to most people. In general use, when you set a value in memcache, you expect it to replace any existing value with the same key, not merely shadow them. Having old values replace new ones at unpredictable times is thus a really bad idea - and I'm honestly not sure how you'd write an application that dealt with that gracefully.
Nick Johnson
+1  A: 

I think the hashing approach is the right idea. There are many simplistic hashing algorithms out there.

With the upcoming C++0x and the newly standard unordered_map, the hash of strings is becoming a standard operation. Many compilers are already delivered with a version of the STL which features a hash_map and thus already have a pre-implemented hash function.

I would start with those... but it would be better if we had more information on your strings: are they somehow constrained to a limited charset, or is it likely that they will be many similar strings ?

The problem is that a "standard" hash might not produce a uniform distribution if the input is not uniformly distributed to begin with...

EDIT:

Given the information, I think the hash function already shipped with most STL should work, since you do not seem to have a highly concentrated area. However I am by now way expert in probabilities, so take it with a grain of salt (and experiment).

Matthieu M.
Thanks for your detailed answer. I edited my question to add the requested information.
ereOn
This answer only addresses the issue of how to hash the keys - which isn't actually the main thrust of the question.
Nick Johnson
`hash(key) % N` is the (trivial) missing step.
MSalters
+3  A: 

You're probably looking for something that implements consistent hashing. The easiest way to do this is to assign a random ID to each memcache server, and allocate each item to the memcache server which has the closest ID to the item's hash, by some metric.

A common choice for this - and the one taken by distributed systems such as Kademlia - would be to use the SHA1 hash function (though the hash is not important), and compare distances by XORing the hash of the item with the hash of the server and interpreting the result as a magnitude. All you need, then, is a way of making each client aware of the list of memcache servers and their IDs.

When a memcache server joins or leaves, it need only generate its own random ID, then ask its new neighbours to send it any items that are closer to its hash than to their own.

Nick Johnson
+1 For the highlights about "consistent hashing". I'll definitely give it a try. Thanks.
ereOn
Oh beautiful, you thus obtain an ad-hoc network of cache servers.
Matthieu M.