views:

544

answers:

11

What are the possibilities of a programmer to handle data that are rarely used but cannot be simply deleted because at least reporting still requires it?

Some examples I am thinking of:

  • Discountinued funding types of older years of a university
  • Unused currencies (e.g. Italian lira)
  • Names of disappeared countries (e.g. Austro-Hungary, USSR)

Some partial solutions are activity flags, activity periods, priorities of visualization but each of them means a case by case decision and it is hard to know what types of entities need this special handling.

May be there is a design pattern for this problem.

Conclusions: (based on the answers so far)

  • If old data makes everyday work difficult on a huge database, partitioning would be helpful. Oracle's description on this subject is here.

  • From the point of view of the designer the taxonomy of Slowly changing dimension gives some background information.

A: 

One solution could be (assuming records referencing obsolete data are the oldest): archive those records and remove old reference data.

friol
Archiving means that I cannot use old data from the programs as before, so for example querying Italian bank accounts of 1990 will not be possible.
rics
+4  A: 

With old data not used in most queries the best solution is to partition tables by the the key which differentiates stale from current data (such as date, currency_id or things like that). You can then put the stale data in separate tables, databases or even servers (depending on the configuration you have running).

The downside to this is that your application must become partition-aware to know where to find the data (though there are abstractions that help deal with sharding and partioning).

Eran Galperin
It seems to me that partitioning mainly improves performance and simplifies maintenance but it does not solve the designer's problem.
rics
Who is the designer, and what are his problems if not performance and maintenance?
Eran Galperin
+1  A: 

In several cases, I have had the old data and the old program duplicated with the appropriate read-only permissions set. Thus the users have the ability to see the old data and doing reporting using the old program. Then you are free to advance the modern program how you see fit, deleting columns or tables, migrating some data, etc.

Knox
+1  A: 

You really have to handle it case-by-case, since it is business rules that defines when a obsolete record is relevant or not. For example, in some historical resports it would make sense to include sales to USSR, in other cases you would leave it out.

A general pattern would be to have a "relevant from/to" datatime field on the records. In that case historical reports can include the types which is relevant for the period. (A simpler solution would be a a boolean "obsolete" flag on the records, but since this does not indicate when it was relevant, it wont be as helpful for historical report.)

JacquesB
A: 

In addition to what Eran said about paritioning, you could partly automate the process of deciding what to put into the archived parition by having a LastModified column or similar. Then by simply partioning based on LastModified < -1y or so, the system should learn about stale data itself.

AlexDuggleby
But that would take a year from now to be effective ;)
Eran Galperin
+1  A: 

This is the standard Slowly-Changing Dimension problem. You have SCD's with status and/or date ranges.

"each of them means a case by case decision and it is hard to know what types of entities need this special handling"

Yep. Sorry about that. You have to analyze your data and think. No easy way around the thinking part of this.

S.Lott
I know that there is no free lunch. In case of currency or country it is clear at first glance that they may change in time. But there are less obvious entities and if users define business logic rules _at the time_ when their first appear programmers have to deal with the situation.
rics
Right. Each one is an ad hoc -- at the time -- unique business rule. No general solution. That's true for big data warehouses, as well. There are three different standard algorithms, depending on the level of ad-hocracy involved.
S.Lott
+2  A: 

For any entity which can have a limited lifetime, just add a time-component in its definition. E.g. your Italian Lira can be modeled as:

CREATE TABLE Currency (CurrencyID NUMBER, CurrencyStartDate DATETIME, CurrentEndDate DATETIME)

You can then exclude the expired currencies from any application functions related to current activity, and still maintain the relationship for historical data.

Andrew from NZSG
I have called this solution as activity period in my question. The problem with this approach is that each query must be aware of this new property of the currency.
rics
What you call 'activity period' is otherwise known as the 'valid time' and it's an important concept in temporal databases, that I wrote about here: http://stackoverflow.com/questions/310963/relational-schema-for-fowlers-temporal-expressions#312534 if you are interested in the details and research.
Henrik Gustafsson
A: 

Commercial DBMS (Informix, DB2, probably Oracle, ...) have partitioning or fragmentation capabilities such that you can place different data in different fragments, and the query optimizer will ignore fragments that it knows it does not need. You can sometimes use these to place the less frequently used data into storage areas only used for archaic data. The advantage of this is that the system deals with the placement (OK, the system plus the DBA), and the applications are completely oblivious to it.

Any scheme that requires changes to reporting applications is doomed to break at least some of those applications.

Jonathan Leffler
+1  A: 

I would suggest separating the operational system and the reporting system. Have one database for the operational - online system and another for the reporting. (Could be data warehouse, or a simple another database) based on how versatile you need the reporting system to be.

Move data from the operational system to the reporting system periodically. (frequency depends on the nature of your system). All historical reports would be based on the reporting database. The online database would also contain reports, but not (very) historical ones.

And, yes. You need to maintain dates or flags on the tables to decide whether an item has expired.

Nivas
A: 

I found a similar question: What is the best way to implement soft deletion? dealing with the activity flag solution.

And here is another one on activity flags `active’ flag or not? for mysql and postgresql.

Based on these two questions activity flags and/or table partitioning are the most common solutions to the problem.

rics
A: 

You can also update the old data. For example you can convert the Italian lira amounts to Euro amounts. But it is indeed a case-by-case decision. You know your system and the requirements best.

tuinstoel