views:

196

answers:

6

I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.

For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."

I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:

  1. Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.

  2. Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).

  3. Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)

Thoughts?

+1  A: 

Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.

More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx

But it requires SQL Server, so if you're using something else....

rwmnau
+1  A: 

Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to

  • insert new rows into a target table from a source which don't exist there yet
  • update existing rows in the target table based on new values from the source
  • optionally even delete rows from the target that don't show up in the import table anymore

SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.

Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.

Marc

marc_s
+3  A: 

I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.

The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.

A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.

kquinn
+1  A: 
BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;

Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.

If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.

Autocracy
Yeah, but if loading the several hundred megs of data takes several minutes, your system will not be able to serve its client during that time..... this solution is great for REALLY small data sets only. Also, what if your data has referential integrity enabled? You might just not be able to totally wipe out your base table to begin with....
marc_s
Yes, you will be able to serve the client during that time. That's the point of transaction isolation. If you have referential integrity, then you need to delete things in the proper order, just as you would need to insert them in the proper order.
Autocracy
That's clearly the proper SQL solution, using transactions.
bortzmeyer
+1  A: 

Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.

Marius Burz
+1  A: 

We solved this problem by using PostgreSQL's table inheritance/constraints mechanism. You create a trigger that auto-creates sub-tables partitioned based on a date field.

This article was the source I used.

Scott