views:

156

answers:

2

I am looking for a way to quickly compare the state of a database table with the results of a Web service call.

I need to make sure that all records returned by the Web service call exist in the database, and any records in the database that are no longer in the Web service response are removed from the table.

I have to problems to solve:

  1. How do I quickly compare a data structure with the results of a database table?
  2. When I find a difference, how do I quickly add what's new and remove what's gone?

For number 1, I was thinking of doing an MD5 of a data structure and storing it in the database. If the MD5 is different, then I'd move to step 2. Are there better ways of comparing response data with the state of a database?

I need more guidance on number 2. I can easily retrieve all records from a table (SELECT * FROM users WHERE user_id = 1) and then loop through an array adding what's not in the DB and creating another array of items to be removed in a subsequent call, but I'm hoping for a better (faster) was of doing this. What is the best way to compare and sync a data structure with a subset of a database table?

Thanks for any insight into these issues!

A: 

Don't kill yourself doing premature optimization. Go with the simple approach of inserting each row one at a time. If you find your having transactional issues like locking of the table is to long while looping you could insert the rows first into a temporary table then do a single insert into the real destination table.

If you were using SQL Server you could do bulk inserts, or package the data into XML, But I'd still highly recommend implement it the easy way first, then test it and if you can test with production data (or the same quantity of data), then look to optimize only if you need to.

JoshBerke
+1  A: 

I've recently been caught up in a similar problem. Our--very simple--solution was to load the web service data into a table with the same structure as the DB table. The DB table keeps a hash of its most important columns, and the same hash function is applied to the corresponding columns in the web service table.

The "sync" logic then goes like this:

  1. Delete any rows from the web service table with hashes that do exist in the DB table. This is duplicate data that doesn't need synchronizing.

    DELETE FROM ws_table WHERE hash IN (SELECT hash from db_table);

  2. Delete any rows from the DB table with hashes not found in the web service table.

    DELETE FROM db_table WHERE hash NOT IN (SELECT hash FROM ws_table);

  3. Anything left over in the web service table is new data, and should now be inserted into the DB table.

    INSERT INTO db_table SELECT ... FROM ws_table;

It's a pretty brute-force approach, and if done transactionally (even just steps 2 and 3) locks up the DB table for the duration, but it's very simple.

One refinement would be to deal with changed records using UPDATE statements, but that adds a good deal of complexity, and may not be any faster than a DELETE followed by an INSERT.

Another possible optimization would be to set a flag instead of deleting rows. The rows could then be deleted later on. However, any logic using the DB table would have to ignore rows with a set flag.

yukondude