Best Hash function for detecting data changes?

views:

512

answers:

+2 Q:

Best Hash function for detecting data changes?

We have a pricing dataset that changes the contained values or the number of records. The number of added or removed records is small compared to the changes in values. The dataset usually has between 50 and 500 items with 8 properties.

We currently use AJAX to return a JSON structure that represents the dataset and update a webpage using this structure with the new values and where necessary removing or adding items.

We make the request with two hash values, one for the values and another for the records. These are MD5 hashes returned with the JSON structure to be sent with a following request. If there is a change to the hashes we know we need a new JSON structure otherwise the hashes are just returned to save bandwidth and eliminate unnecessary client-side processing.

As MD5 is normally used with encryption is the best choice of hashing algorithm for just detecting data changes?

What alternative ways can we detect a change to the values and update as well as detecting added or removed items and manipulating the page DOM accordingly?

+5 A:

MD5 is a reasonable algorithm to detect changes to a set of data. However, if you're not concerned with the cryptographic properties, and are very concerned with the performance of the algorithm, you could go with a simpler checksum-style algorithm that isn't designed to be cryptographically secure. (though weaknesses in MD5 have been discovered in recent years, it's still designed to be cryptographically secure, and hence does more work than may be required for your scenario).

However, if you're happy with the computational performance of MD5, I'd just stick with it.

Jonathan 2009-04-16 14:37:24

I think that any commonly used hash function will do what you want - provide a unique representation of an entity.

For the problem you are trying to solve, my solution would be to have a backend table that records all changes. Not the changes themselves, but an identifier of the rows that have changed. On a periodic basis callback to the server and get a list of all the objects that have changed, and use this to decide on the client which rows need updating/deleting/adding.

Visage 2009-04-16 14:39:15

This is a common misunderstanding. Hash functions do not "provide a unique representation of an entity". In fact, it is guaranteed not to be the case for any hash function whose domain is larger than its range.

recursive 2009-04-16 14:44:35

What you're doing sounds pretty good to me.

If server-side capacity is cheap and minimising network usage is crucial, you could have the server remember, for each client, what it's last dataset was, and send only the differences (as a list of insertions, deletions and edits) on each request. If you sort your data rows first, these differences can be calculated fairly efficiently using a differencing algorithm such as that used by diff.

This approach is sensitive to network outages -- if one response is not received by the client, errors will accumulate. However this can be remedied by having the client sent the MD5 hash with each request: if it is different than what the server expects, an entire list will be sent instead of a list of changes.

j_random_hacker 2009-04-16 14:43:28

+1 A:

MD5 is just fine. Should it have too low performance, you can try fast checksum algorithm, such as for example Adler-32.

vartec 2009-04-16 14:47:47

I agree with Jonathan's answer regarding MD5. As for alternative ways to detect changes, if you are willing to store (or already store) on the server the time/date of the most recent change, you could pass that back and forth to the client. You avoid the computation entirely and you might even be able to use most of your existing code.

--
bmb

bmb 2009-04-16 15:14:22

ansaurus

tags:

views:

answers:

Best Hash function for detecting data changes?

related questions