tags:

views:

609

answers:

2

I am looking for a specific hashcode that has the following properties. I dont know of any such hashcodes and I dont know if it is possible to do such a thing. Just wanted to put it out there and see what people say.

I have two databases (loosely used term - dont think of SQL or anything of that sort), one master and one backup. There is a need to keep the two databases in sync and to detect when the databases are not in sync. Instead of verifying every data in there, it would be preferable to keep some hashcode that can be verified. But the two databases dont necessarily share every modifications. Since changes from master to backup are batched, it is possible that certain modifications from the master to backup are collapsed.

ie: lets say current state of database has elements A->X, B->Y and C->Z. Now B gets modified such that B->Y1 and then later B->Y2. The only change that will get sent from master to backup is B->Y2. The intermediate B->Y1 is skipped.

Now instead of looping over every element in each database to verify that they match, we would prefer to keep a running hash code of the elements in both the locations and then just compare that. The hashcode would have to compute something like:

assuming previous hashcode of hm0:
hashcode hm1 = f(hm0, A->X, B->Y, C->Z)

When B changes now:
hashcode hm2 = f(hm1, B->Y1)
and then
hashcode hm3 = f(hm2, B->Y2)

So master will have hashcode of h3. Now backup will not receive the modification B->Y2, so if it computes a running hash code, it will be like this:

hashcode hb1 = f(hb0, A->X, B->Y, C->Z)
hashcode hb2 = f(hb1, B->Y2)

Now we want hb2 and hm3 to match, as the current state of the databases are the same. But most (if not all) hashcodes dont work this way.

So what we would want then is that we would want to "remove" the contribution of B->Y from the hash first and then 'add' the contribution of B->Y1 and then remove contribution of B->Y1 and add the contribution of B->Y2 into the hash code. So we want something like this:

Two functions, f, g: f modifies existing hashcode by adding contribution of a new element, while g modifies existing hashcode by removing contribution of an element.

On master:
hm1 = f(hm0, A->X, B->Y, C->Z)

when B gets modified to B->Y1:
hm2 = g(hm1, B->Y)
hm3 = f(hm2, B->Y1)

when B get modified to B->Y2:
hm4 = g(hm3, B->Y1)
hm5 = f(hm4, B->Y2)

hm5 is the new hashcode for the current state of the database (A->X, B->Y2, C->Z)

On backup:
hb1 = f(hb0, A->X, B->Y, C->Z)

when B gets modified to B->Y2:
hb2 = g(hb1, B->Y)
hb3 = f(hb2, B->Y2)

Now hm5 and hb3 should match, as the current state of both databases is the same.

So: are there such algorithms f and g? I hope I made the question clear.... Thanks.

+1  A: 

Just add and subtract your codes. With h(x) being any hash function:

hm2 = hm1 + h(B->Y)
hm3 = hm2 + h(B->Y1)
hm4 = hm3 - h(B->Y1) 
hm5 = hm4 + h(B->Y2)

hb2 = hb1 + h(B->Y)
hb3 = hb1 + h(B->Y2)

hm5 and hb3 are equal.

Note that it does not necessarily have to be add or subtract. Any reversible operation will work (theoretically multiply/divide can also work, but there could be more overflow issues and ambiguity about what happens around 0).

CookieOfFortune
Ah, good idea! You may want to use XOR instead of addition/subtraction so you don't need to worry about overflow.
Brian Campbell
A: 

Hmm. I'm not sure about a hash function that does exactly what you are asking for. But it seems that a structure similar to how Git stores its revisions might do what you need (which was inspired by how Monotone stored its revisions).

Git computes the SHA-1 sum of each of the files in the repository. These are used as blob identifiers. It then has a tree, which maps filenames to blob IDs (and other sub-trees, for subdirectories). The identifier of a tree is its SHA-1 sum. (Though its not relevant to your usage, I don't believe, the trees are then referred to by revisions, which include things like author, date, and one or more parent revisions).

This means that you don't have to re-compute the SHA-1 sum for each blob as you update one; you just recompute the SHA-1 for the blob that changes, and recompute the SHA-1 for the tree.

You could do the same with your data. Compute the hash of each of your objects, and put all your key->hash(value) mappings into one file, and compute the hash of that. If the file containing key->hash(value) is too big for you to want to re-hash it each time, you could divide it into sections, and have a key->hash(section), where each section had key->hash(value). One level of branching should generally be sufficient for most cases, but you can build a tree structure out of these if you really need to.

Brian Campbell