Database Schema Design - Tips for improving ability to archive?

views:

377

answers:

+1 Q:

Database Schema Design - Tips for improving ability to archive?

I am designing a table in the database which will store log entries from the application. There are a few things which is making me think about this design more than usual.

However these log entries will be used at runtime by the system to make decisions so they need to be relatively fast to access.
They also have the problem is that there is going to be lots of them (12.5 million added per month is my estimate).
I don't need more than the last 30 to 45 days at most for the decision processing.
I need to keep all of them for much longer than 45 days for support & legal issues, likely atleast 2 years.
The table design is fairly simple, all simple types (no blobs or anything), where possible will use the database engine to put in the default data, at most one foreign key.
If it makes any difference the database will be Microsoft SQL Server 2005.

What I was thinking is having them written to a live table/database and then using an ETL solution move "old" entries to an archive table/database - which is big and on slower hardware.

My question is do you know of any tips, tricks or suggestions for the database/table design to make sure this works as well as possible? Also if you think it's a bad idea please let me know, and what you think a better idea would be.

+3 A:

Some databases offer "partitions" (Oracle, for example). A partition is like a view which collects several tables with an identical definition into one. You can define criteria which sort new data into the different tables (for example, the month or week-of-year % 6).

From a user point of view, this is just one table. From the database PoV, it's several independent tables, so you can run full table commands (like truncate, drop, delete from table (without a condition), load/dump, etc.) against them in an efficient manner.

If you can't have a partition, you get a similar effect with views. In this case, you can collect several tables in a single view and redefine this view, say, once a month to "free" one table with old data from the rest. Now, you can efficiently archive this table, clear it and attach it again to the view when the big work has been done. This should help greatly to improve performance.

[EDIT] SQL server 2005 onwards (Enterprise Edition) supports partitions. Thanks to Mitch Wheat

Aaron Digulla 2009-01-28 09:36:28

SQL Server also supports partitioned tables

Mitch Wheat 2009-01-28 11:01:20

I should say SQL server 2005 onwards (Enterprise Edition)

Mitch Wheat 2009-01-28 11:21:22

+1 A:

Big tables slow down quickly, and it's a big performance overhead to use ETL to pull data based on date, from a big table and then delete the old rows. The answer to this is to use multiple tables - probably 1 table/month based on your figures. Of course you'll need some logic to generate the table names within your queries.

I agree with using Triggers to populate the 'CurrentMonthAudit' table, at the end of month, you can then rename that table to MonthAuditYYYYMM. Moving old tables off your main server using ETL will then be easy, and each of your tables will be manageable. Trust me this is much better than trying to manage a single table with approx 250M rows.

MrTelly 2009-01-28 10:06:46

+1 A:

Your first good decision is keeping everything as simple as possible.

I've had good luck with your pattern of a simple write-only transaction log file where the records are just laid down in chronological order. Then you have several options for switching out aged data. Even having monthly disparate tables is manageable query-wise as long as you keep simplicity in mind. If you have any kind of replication in operation, your replicated tables can be rolled out and serve as the archive. Then start with a fresh empty table at the first of each month.

Normally I shudder at the relational design consequences of doing something like this, but I've found that write-only chronological log tables are an exception to the usual design patterns, for the reasons you are dealing with here.

But stay away from triggers. As far as possible. The simplest solution is a primary table of the type you're talking about here, with a simple robust off-the-shelf time-proven replication mechanism.

(BTW - Large tables don't slow down quickly if they are well designed - they slow down slowly.)

le dorfier 2009-01-28 10:17:22

If you do not need to search the recent log records, there is another option: Don't use a database at all. Instead, write the log info to a file and rotate the filename every night. When a file has been written, you can then start a background job to import the data directly into the archive database.

Databases are not always the best option, especially for log files :)

Aaron Digulla 2009-01-28 12:21:45

-1: You missed the very first requirement: "However these log entries will be used at runtime by the system to make decisions so they need to be relatively fast to access".

Robert MacLean 2009-01-29 11:36:53

I know. But when someone else reads this, this tip might still be useful.

Aaron Digulla 2009-01-30 14:11:48

ansaurus

tags:

views:

answers:

Database Schema Design - Tips for improving ability to archive?

related questions