tags:

views:

204

answers:

2

I have database schema for an integration project in which I need to be able to query for records that have changed, but only based on a given set of fields within that record.

So, for instance, here's an example table:

CUSTOMERS

  • ID
  • Name
  • Phone
  • Fax
  • Balance

I need to be to query to fetch records whose Name, Phone, or Fax fields have changed. However, other fields should not be taken into account, i.e. if just the Balance field changes, my query should not pull that record in (thus, an timestamp field that updates automatically whenever the record is modified does not work).

Also, this has to run on a number of different databases and platforms, so TRIGGERS or something similar are not really an option unless they'll run on MySQL, PostgreSQL, SQL Server, and SQLLite.

The fields are modified by a third-party application that I can't modify, so I can't just add a flag and have the third-party application set the flag to TRUE whenever it modifies a relevant field.

My initial solution to this is to calculate a HASH of the relevant fields and store it in a new field 'LastHash' or something. Then, I can calculate the hash of the relevant fields for the data currently in the record, and if it doesn't match the stored LastHash, I know it's changed.

That seems pretty messy... but it seems like it will work. Is there a better way? If not, is there a good way to implement that hash so it's efficient and not too time-consuming to extract those changed records?

EDIT

Some clarifications: Both my application and the other application update and insert into these tables. I can make my application calculate the initial hash. I can't make the other application calculate it though.

Timestamp columns that automatically update whenever a record changes are do-able, those are easy enough to replicate in all database systems using different column types or very simple triggers.

ADDITIONAL QUESTION

If hashing is the way to go... is there any sort of efficient hash algorithm that won't take forever to calculate on all of these records? MD5 or SHA1 might work, but they seem like they'd be sllloowwww.

+2  A: 

That's a tough one. You're still going to have to table scan (or index scan), since YOU have to calculate the new hash and compare it to the old hash stored.

If triggers are not possible because of cross-platform concerns, you might be able to have the database engine calculate the current hash (i.e. persisted computed column - effectively like a trigger). This is also cross-platform problem, though! Then if you index the current hash and your hash, it's a relatively easier search.

Can you at least use the timestamp field to reduce the number of hashes you need to check?

Another thing to remember is that there's no such thing as a perfect hash function, so you could potentially have false negatives (inadvertent hash collision leads to a change not being detected). Is that (astronomically small) risk worth taking?

Cade Roux
The idea of using the timestamps together with the hash is a good one, I like that. That should keep performance much better. I think the likelihood of finding a hash collision is pretty low. I could use SHA1 or something if I found MD5 wasn't enough.
Keith Palmer
A: 

I would standardize how your application checks for a difference, not how the database implements it. Try something like using a view with a particular column that signifies a change. Then use the proper tricks implemented in each database to make that view a reality. Code that depends on checking for this difference would then be the same, using the same view and column.

KM
Splitting the columns into two tables *is not* an option. I can't change the other third-party application to accommodate that database change.
Keith Palmer
@Keith Palmer, just reread the question
KM