ansaurus

Question

How to deal with old, obsolete database data of a long running system?

Answer 1

A:

One solution could be (assuming records referencing obsolete data are the oldest): archive those records and remove old reference data.

friol 2008-10-12 10:49:33

Archiving means that I cannot use old data from the programs as before, so for example querying Italian bank accounts of 1990 will not be possible.

rics 2008-10-12 11:43:38

Answer 2

+4 A:

With old data not used in most queries the best solution is to partition tables by the the key which differentiates stale from current data (such as date, currency_id or things like that). You can then put the stale data in separate tables, databases or even servers (depending on the configuration you have running).

The downside to this is that your application must become partition-aware to know where to find the data (though there are abstractions that help deal with sharding and partioning).

Eran Galperin 2008-10-12 12:01:10

It seems to me that partitioning mainly improves performance and simplifies maintenance but it does not solve the designer's problem.

rics 2008-10-14 07:42:10

Who is the designer, and what are his problems if not performance and maintenance?

Eran Galperin 2008-10-14 09:19:00

Answer 3

+1 A:

In several cases, I have had the old data and the old program duplicated with the appropriate read-only permissions set. Thus the users have the ability to see the old data and doing reporting using the old program. Then you are free to advance the modern program how you see fit, deleting columns or tables, migrating some data, etc.

Knox 2008-10-12 12:09:28

Answer 4

+1 A:

You really have to handle it case-by-case, since it is business rules that defines when a obsolete record is relevant or not. For example, in some historical resports it would make sense to include sales to USSR, in other cases you would leave it out.

A general pattern would be to have a "relevant from/to" datatime field on the records. In that case historical reports can include the types which is relevant for the period. (A simpler solution would be a a boolean "obsolete" flag on the records, but since this does not indicate when it was relevant, it wont be as helpful for historical report.)

JacquesB 2008-10-12 12:09:37

Answer 5

A:

In addition to what Eran said about paritioning, you could partly automate the process of deciding what to put into the archived parition by having a LastModified column or similar. Then by simply partioning based on LastModified < -1y or so, the system should learn about stale data itself.

AlexDuggleby 2008-10-12 12:10:05

But that would take a year from now to be effective ;)

Eran Galperin 2008-10-12 12:15:36

Answer 6

+1 A:

This is the standard Slowly-Changing Dimension problem. You have SCD's with status and/or date ranges.

"each of them means a case by case decision and it is hard to know what types of entities need this special handling"

Yep. Sorry about that. You have to analyze your data and think. No easy way around the thinking part of this.

S.Lott 2008-10-12 12:16:26

I know that there is no free lunch. In case of currency or country it is clear at first glance that they may change in time. But there are less obvious entities and if users define business logic rules _at the time_ when their first appear programmers have to deal with the situation.

rics 2008-10-14 07:34:42

Right. Each one is an ad hoc -- at the time -- unique business rule. No general solution. That's true for big data warehouses, as well. There are three different standard algorithms, depending on the level of ad-hocracy involved.

S.Lott 2008-10-14 09:59:55

Answer 7

+2 A:

For any entity which can have a limited lifetime, just add a time-component in its definition. E.g. your Italian Lira can be modeled as:

CREATE TABLE Currency (CurrencyID NUMBER, CurrencyStartDate DATETIME, CurrentEndDate DATETIME)

You can then exclude the expired currencies from any application functions related to current activity, and still maintain the relationship for historical data.

Andrew from NZSG 2008-10-12 12:26:28

I have called this solution as activity period in my question. The problem with this approach is that each query must be aware of this new property of the currency.

rics 2008-10-14 07:22:32

What you call 'activity period' is otherwise known as the 'valid time' and it's an important concept in temporal databases, that I wrote about here: http://stackoverflow.com/questions/310963/relational-schema-for-fowlers-temporal-expressions#312534 if you are interested in the details and research.

Henrik Gustafsson 2008-12-31 11:23:47

Answer 8

A:

Commercial DBMS (Informix, DB2, probably Oracle, ...) have partitioning or fragmentation capabilities such that you can place different data in different fragments, and the query optimizer will ignore fragments that it knows it does not need. You can sometimes use these to place the less frequently used data into storage areas only used for archaic data. The advantage of this is that the system deals with the placement (OK, the system plus the DBA), and the applications are completely oblivious to it.

Any scheme that requires changes to reporting applications is doomed to break at least some of those applications.

Jonathan Leffler 2008-10-12 14:34:03

Answer 9

+1 A:

I would suggest separating the operational system and the reporting system. Have one database for the operational - online system and another for the reporting. (Could be data warehouse, or a simple another database) based on how versatile you need the reporting system to be.

Move data from the operational system to the reporting system periodically. (frequency depends on the nature of your system). All historical reports would be based on the reporting database. The online database would also contain reports, but not (very) historical ones.

And, yes. You need to maintain dates or flags on the tables to decide whether an item has expired.

Nivas 2008-10-12 15:24:33

Answer 10

A:

I found a similar question: What is the best way to implement soft deletion? dealing with the activity flag solution.

And here is another one on activity flags `active’ flag or not? for mysql and postgresql.

Based on these two questions activity flags and/or table partitioning are the most common solutions to the problem.

rics 2008-12-31 09:55:45

Answer 11

A:

You can also update the old data. For example you can convert the Italian lira amounts to Euro amounts. But it is indeed a case-by-case decision. You know your system and the requirements best.

tuinstoel 2008-12-31 10:08:37

ansaurus

tags:

views:

answers:

How to deal with old, obsolete database data of a long running system?

related questions