what's the best way to backup a large real-time data warehouse?

I've got a db2 ese 9.7 non-dpf data warehouse using data compression with 20 TB of data that gets 100 million rows a day via loads every 10 minutes and gets another million a day via 50,000 imports every day. Additionally, there is a small amount of transactional data associated with the other two large sets of data.

Currently, we're using application-level backups - and rely on loading previously exported summary tables or reloading the 100 million rows a day in case of a recovery. But for the sake of the small amount of transactions and the imports - I'd like online backups.

However, it appears that online tablespace-specific backups require an initial offline backup. And this is the problem, even if I can redirect the off-line backup to /dev/null, an offline backup will take approximately 48 hours of downtime. Which is unacceptable. And may be required again at some point in the future.

At some point we'll probably be splitting this up into 8+ partitions and that would help both this and load index builds. But that may not happen for a while, and it's hard to justify for tasks that shouldn't be necessary in the first place.

EDIT: The reason that we didn't initially go with DPF, and why it's not a driving issue for our queries is that over 99% of our queries hit summary tables, and the 1% that does have to dip into the table with 3+ billion rows per month can almost always take advantage of table partitioning, MDC and indexes in order to only scan a much smaller amount. What this gets to is that the traditional heuristics regarding how much data per CPU don't always apply.

Any way to get around that offline backup requirement? Any third-party tools that can help me out? Any other suggestions?

Unfortunately there isn't really a way to get around this -- you have to plan for your recovery when you do the physical design of the database. Using separate tablespaces with range partitioning allows you to back up only the tablespaces with new data (assuming you know which tablespaces are changing...)

Generally this would fall into the domain of using split mirrors or snapshots at the disk level. This, of course, requires that your disk subsystem supports this functionality (unless you're using software like Veritas Volume Manager), AND that you have the capacity to actually enable this. DB2 fully supports this, though, and it's very useful. I have done this with EMC Symmetrix and Clariion; but it does require a brief "outage" where you freeze database I/O in order to issue the operating system commands to handle breaking the mirrors.

In v9.5, DB2 added a feature called Advanced Copy Services (ACS), which allows the storage vendors to integrate into the BACKUP DATABASE command. IBM supports this with some of their storage subsystems, and NetApp added support for this very quickly as well. It is pretty amazing to say, "BACKUP DATABASE HUGEDB USE SNAPSHOT" and watch it take 10 seconds. And then "RESTORE DATABASE HUGEDB USE SNAPSHOT TAKEN AT timestamp".

Thanks Ian. Our backup/recovery strategy based on reloading files has worked extremely well for a number of years and we really want to keep it. But I was under the impression from the BCU 2.1 overview document that we could now perform tablespace backups without a full database backup to base it upon. This is the scenario that I'd prefer, since it won't require new storage hardware, and the newly added transactional tablespace in question is less than 0.1% of the size of the rest of the database. I'll keep ACS in mind though - I am looking at a hardware refresh for 2011.

KenFar 2010-09-18 19:00:14

You can do tablespace backups without a full database backup (and you can restore a database from individual tablespace backups); however, you can't enable archive logging without an offline backup. Archive logging is a prerequisite for tablespace backups.

Ian Bjorhovde 2010-09-21 22:17:58

ansaurus

tags:

views:

answers:

what's the best way to backup a large real-time data warehouse?

related questions