ansaurus

Question

Answer 1

A:

What about something very simple like

hash(key) % serverCount

Daniel Brückner 2010-06-21 09:12:39

The main issue I have is... what `hash()` function should I use ? ;)

ereOn 2010-06-21 09:28:03

This has the major problem that adding a server requires redistributing every existing data item.

Nick Johnson 2010-06-21 09:47:20

@Nick Johnson: It is acceptable. A caching system is by nature not reliable. If a servers dies and is removed from the list, there will be cache-misses anyway.

ereOn 2010-06-21 09:48:37

@ereOn But it's not just an issue of cache misses (and bear in mind that this isn't just a few cache misses - this pretty much means invalidating almost your entire cache whenever a server is added or removed). Consider the following series of events: 1) I add an item with key 'a' and value '1'. It gets mapped to server 's1'. 2) Server 's2' joins. 3) I update 'a' with '2'. It's now mapped to server 's2'. 4) Server s2 parts. 5) I retrieve 'a' and get '1' from s1 - the item has reverted to an earlier value!

Nick Johnson 2010-06-21 09:53:07

@Nick Johnson: Yes sure. But by definition, the cache server is not reliable. In most cases, the values will be available. In worst case scenarios (like the one you mentionned), they won't. Of course having a solution that doesn't invalidate the entries when a server is added or removed would be great, but I think it is too much effort for too few gains. Most of my cache entries have a lifetime of 5 minutes; so a global invalidation is not that much an issue.

ereOn 2010-06-21 11:33:39

@ereOn That still doesn't address my point that using this strategy without some way to detect when a value is 'shadowed' by another server may lead to reversions. It's a given that cache is unreliable, but that's a far cry from being willing to accept reversions to previous values for elements!

Nick Johnson 2010-06-21 11:37:06

@Nick Johnson: If the first value was still available in the cache, it means it had **not** expired yet anyway and that it is thus still **valid**. Many things can alterate the cache: another software might use the same key and store it's own data. Memcached has a mechanism to help you detect when a value has been tampered, which solves both cases.

ereOn 2010-06-21 12:03:23

Then you have a use-case very different to most people. In general use, when you set a value in memcache, you expect it to replace any existing value with the same key, not merely shadow them. Having old values replace new ones at unpredictable times is thus a really bad idea - and I'm honestly not sure how you'd write an application that dealt with that gracefully.

Nick Johnson 2010-06-21 12:55:38

Answer 2

+1 A:

I think the hashing approach is the right idea. There are many simplistic hashing algorithms out there.

With the upcoming C++0x and the newly standard unordered_map, the hash of strings is becoming a standard operation. Many compilers are already delivered with a version of the STL which features a hash_map and thus already have a pre-implemented hash function.

I would start with those... but it would be better if we had more information on your strings: are they somehow constrained to a limited charset, or is it likely that they will be many similar strings ?

The problem is that a "standard" hash might not produce a uniform distribution if the input is not uniformly distributed to begin with...

EDIT:

Given the information, I think the hash function already shipped with most STL should work, since you do not seem to have a highly concentrated area. However I am by now way expert in probabilities, so take it with a grain of salt (and experiment).

Matthieu M. 2010-06-21 09:15:07

Thanks for your detailed answer. I edited my question to add the requested information.

ereOn 2010-06-21 09:28:44

This answer only addresses the issue of how to hash the keys - which isn't actually the main thrust of the question.

Nick Johnson 2010-06-21 09:59:40

`hash(key) % N` is the (trivial) missing step.

MSalters 2010-06-21 11:56:45

Answer 3

+3 A:

You're probably looking for something that implements consistent hashing. The easiest way to do this is to assign a random ID to each memcache server, and allocate each item to the memcache server which has the closest ID to the item's hash, by some metric.

A common choice for this - and the one taken by distributed systems such as Kademlia - would be to use the SHA1 hash function (though the hash is not important), and compare distances by XORing the hash of the item with the hash of the server and interpreting the result as a magnitude. All you need, then, is a way of making each client aware of the list of memcache servers and their IDs.

When a memcache server joins or leaves, it need only generate its own random ID, then ask its new neighbours to send it any items that are closer to its hash than to their own.

Nick Johnson 2010-06-21 09:51:04

+1 For the highlights about "consistent hashing". I'll definitely give it a try. Thanks.

ereOn 2010-06-21 11:38:56

Oh beautiful, you thus obtain an ad-hoc network of cache servers.

Matthieu M. 2010-06-21 12:36:33

ansaurus

tags:

views:

answers:

A good repartition algorithm

related questions