tags:

views:

244

answers:

11

Recently I’ve found myself in a database tangle where management wants the ability to remove data from the database, but still wants that data to appear in other places. Example: They want to remove all instances of the product whizbang, but they still want whizbang to appear in sales reports. (if they ran one for a previous date).

Now I can add a field, say is_deleted, that will track whether that product has been deleted and thus still keep all my references, but over a period of time, I have the potential of housing a lot of dead data. (data that is never accessed again). How to handle this is not my question.

I’m curious to find out, in your experience what is the average life span of data? That is, on average how long is data alive or good for before it gets either replaced or deleted? I understand that this is relative to the type of data you are housing, but certainly all data has some sort of life span?

A: 

Usually it's better to move such data into seporate database (database warehouse) and keep working database clean. At data warehouse your data can be kept for many years without impacting your application.

Reference: Data Warehouse at Wikipedia

Koistya Navin
A: 

I've always gone by what is the ruling body looking for. Example the IRS wants you to keep 7 years of history or for security reasons we keep 3 years of log information, etc. So I guess you could do 2 things, determine what the life span of your data is I would say 3 years would be enough and then you could add the is_deleted flag along with a date that way you would be able to flag some data to delete sooner than later.

Robert
+4  A: 

Data lives forever...or often it should. One common practice is to have end and/or start dates for a record. So for your whizbang, you have a start date (so that it won't appear on sales reports before it's official launch), and an end date (so that it drops off of reports after it's been end-of-lifed). Using the proper dates as criteria for your reporting as well as your applications, you won't see the whizbang except for when you should, and the data still exists (which it should, theoretically infinitely).

As Koistya Navin mentions, moving data to a data warehouse at a certain point is also an option, but this depends in large part on how large your 'old' data is, and how long you need to keep it readily available for access.

Harper Shelby
A: 

Yes, all data has a lifespan. And yes, it is relative to the type of data you have.

Some data has a lifespan measured in seconds (authentication tokens, for instance), some other data virtual eternity (more than the medium and formats it is stored into, like for instance ownership records).

You will have to either be more specific as to the type of data you are envisioning, or do a census in your own organization as to the usual lifespan of stuff.

Varkhan
A: 

Our particular flavor varies. We have some data (a vast majority) which goes stale after 3 months (hard product limit) but can be revived at any later date.

We have other data that is effectively immortal.

In practice, most of the data we serve up is fresh and frequently requested for a few weeks, at most a month, before falling to sporadic use.

Jeff
A: 

How much is "a lot of dead data"?

With processing power and data storage so cheap, I wouldn't purge old data unless there's a really good reason to. You also need to consider the legal implications. Large (and even small) companies may have incredibly long retention policies for old data, to save themselves millions down the road when they are subpoenaed for it by a judge.

I would check with whatever legal department you have and find out how long the data needs to be stored. That's the safest bet.

Also, ask yourself what the benefit of removing the old data is. Is the only benefit a tidier database? If so, I wouldn't do it. Are you going to see a 10X performance increase? If so, I'd do it. This really is a complex question though, and it's tough for us to have all the information required to give you good advice.

William Brendel
A: 

I have a few projects where the customer wants all the historical data (going back over 19 years). Quite a bit of the really old data is malformed and is going to be a nightmare to import into the new system. We convinced them that they won't need records going back any further than 10 years, but like you said it's all relative to the type of data you're housing.

On a side note, data storage is extremely cheap right now, and if it isn't affecting the performance of your application, I would just leave it where it is.

Ryan Smith
A: 

[...] but certainly all data has some sort of life span?

Not any kind of life span we can talk about meaningfully. A lot of data is useless as soon as it's created or recorded. Such data could be discarded immediately with no effect. On the other hand, some data has enough value that it will outlive the current system that hosts it. If Amazon were to completely replace their current infrastructure, the customer histories they have stored would still be immensely valuable.

As you said, it's relative. Each type of data has its own life span that has no relation to another type of data's life span. There's no meaningful "average life span of data".

Derek Park
A: 

I have the potential of housing a lot of dead data. (data that is never accessed again).

But they will when they perform those reports then they are accessing that data.

Until then you'll need to keep the data in some form. Move to another table or have a switch like you mentioned.

Ólafur Waage
+1  A: 

Many of our customers keep data online for 2 years. After that it's moved to backup disks, but it can be put online if needed.

Consider adding a column "expiration" or "effective date". This will allow you mark a product as obsolete, but reports will return that product if the time range is satisfied.

Slidell4life
A: 

uh...at the risk of oversimplifying...it sounds like using DateDeleted instead of a bit would solve your how-long-to-keep issue.

Steven A. Lowe