views:

239

answers:

9

In a lot of databases I seem to be working on these days I can't just delete a record for any number of reasons, including so later on they can be displayed later (say a product that no longer exists) or just keeping a history of what was.

So my question is how best to expire the record.

I have often added a date_expired column which is datetime field. Generally I query either where date_expired = 0 or date_expired = 0 OR date_expired > NOW() depending if the data is going to be expired in the future. Similar to this, I have also added a field call expired_flag. When this is set to true/1, the record is considered expired. This is the probably the easiest method, although you need to remember to include the expire clause anytime you only want the current items.

Another method I have seen is moving the record to an archive table, but this can get quite messy when there are a large number of tables that require history tables. It also makes the retrieval of the value (say country) more difficult as you have to first do a left join (for example) and then do a second query to find the actual value (or redo the query with a modified left join).

Another option, which I haven't seen done nor have I fully attempted myself is to have a table that contains either all of the data from all of the expired records or some form of it--some kind of history table. In this case, retrieval would be even more difficult as you would need to search possibly a massive table and then parse the data.

Are there other solutions or modifications of these that are better?

I am using MySQL (with PHP), so I don't know if other databases have better methods to deal with this issue.

+1  A: 

I think adding the date_expired column is the easiest and least invasive method. As long as your INSERTS and SELECTS use explicit column lists (they should be if they're not) then there is no impact to your existing CRUD operations. Add an index on the date_expired column and developers can add it as a property to any classes or logic that depend on the data in the existing table. All in all the best value for the effort. I agree that the other methods (i.e. archive tables) are troublesome at best, by comparison.

Dave Swersky
+3  A: 

I prefer the date expired field method. However, sometimes it is useful to have two dates, both initial date, and date expired. Because if data can expire, it is often useful to know when it was active, and that means also knowing when it started existing.

thursdaysgeek
Yes, quite useful in a case such as a product table or taxes.
Darryl Hein
+1  A: 

I usually don't like database triggers, since they can lead to strange "behind the scenes" behavior, but putting a trigger on delete to insert the about-to-be-deleted data into a history table might be an option.

In my experience, we usually just use an "Active" bit, or a "DateExpired" datetime like you mentioned. That works pretty well, and is really easy to deal with and query.

There's a related post here that offers a few other options. Maybe the CDC option?

http://stackoverflow.com/questions/349524/sql-server-history-table-populate-through-sp-or-trigger

Andy White
A: 

A very nice approach by Oracle to this problem is partitions. I don't think MySQL have something similar though.

Pablo Santa Cruz
+1  A: 

May I also suggest adding a "Status" column that matches an enumerated type in the code you're using. Drop an index on the column and you'll be able to very easily and efficiently narrow down your returned data via your where clauses.

Some possible enumerated values to use, depending on your needs:

  1. Active
  2. Deleted
  3. Suspended
  4. InUse (Sort of a pseudo-locking mechanism)

Set the column up as an tinyint (that's SQL Server...not sure of the MySQL equivalent). You can also setup a matching lookup table with the key/value pairs and a foreign key constraint between the tables if you wish.

Boydski
+2  A: 

I like the expired_flag option over the date_expired option, if query speed is important to you.

Scott Ferguson
A: 

There are some fields that my tables usually have: creation_date, last_modification, last_modifier (fk to user), is_active (boolean or number, depending on the database).

Sam
I usued to do this, but got tired of it and have instead use a separate table where I insert every query (other than selects) which gives me a complete history and whereas last modified and who can be pretty much useless in most cases.
Darryl Hein
Great idea, I have to say. Another option would be to use the auditing options of databases instead of manually keeping track of changes, but yours is good: simple and effective.
Sam
+1  A: 

I've always used the ValidFrom, ValidTo approach where each table has these two additional fields. If ValidTo Is Null or > Now() then you know you have a valid record. In this way you can also add data to the table before it's live.

MrTelly
A: 

Look at the "Slowly Changing Dimension" SCD algorithms. There are several choices from the Data Warehousing world that apply here.

None is "best" -- each responds to different requirements.

Here's a tidy summary.

Type 1: The new record replaces the original record. No trace of the old record exists.

  • Type 4 is a variation on this moves the history to another table.

Type 2: A new record is added into the customer dimension table. To distinguish, a "valid date range" pair of columns in required. It helps to have a "this record is current" flag.

Type 3: The original record is modified to reflect the change.

  • In this case, there are columns for one or more previous values of the columns likely to change. This has an obvious limitation because it's bound to a specific number of columns. However, it is often used on conjunction with other types.

You can read more about this if you search for "Slowly Changing Dimension".

http://en.wikipedia.org/wiki/Slowly_Changing_Dimension

S.Lott