Referential integrity in a relational data warehouse. Is it worth it? and what are the alternatives?

+1 A:

Oh, I most certainly would! What you have to remember is your database is your data store, it is not just the data store for the frontend. This is subtle difference, but it's important when you start to consider the future. Right now, you (presumably) own the management applications, but who is to say that in the future this will still be the case?

By off loading as much validation as you can onto the database, you somewhat future proof your application - at least if someone else tries to develop against your database, more of your assumptions will hold.

The drawbacks of having this on the database side is slower insertion, so you need to weigh up how heavy your application is against reading and writing. At work, we have much more demand for read than write, so referential integrity seems obvious. However, our tables are big (and freely available for importing), so we go down a multiple step import route of creating the tables, inserting the data, creating indexes, and then creating foreign keys and other constraints.

I hope this helps!

aCiD2 2010-10-11 08:22:51

@aCiD2, the OP is asking about a **datawarehouse**.

Mark Bannister 2010-10-11 12:15:01

@Mark, I understand that, but does that affect my anwser?

aCiD2 2010-10-11 13:44:06

@aCiD2, I don't think references to the frontend are relevant to datawarehouses - "source systems" would be more relevant. There is also the question of whether the ETL process should enforce referential integrity - normally, I would expect this, so it would be unnecessary to enforce it in the DB schema. But then, I wouldn't normally expect to use a fully normalised schema as the basis for a datawarehouse.

Mark Bannister 2010-10-11 14:03:15

@Mark, yes that makes sense, thanks for the clarification

aCiD2 2010-10-11 14:52:30

I'm thinking along the same line as ACiD2. Although we currently own the source systems and know there is referential integrity in them, that may not always be the case. I like the "set and forget"approach that foreign keys provide.

David 2010-10-12 22:02:40

Also.. Does the size of the warehouse affect the design with respect to referential integrity and how you deal with it?

David 2010-10-12 22:28:42

+2 A:

In the first place, I wouldn't build a data warehouse that (physically) conformed to a relational schema. Is the proposed data warehouse to be fully normalised, or does the word "relational" in the question simply indicate that it will be built in a SQL database?

Mark Bannister 2010-10-11 12:19:45

There will be a normalised layer that captures all of the detail. One or more layers will be built after that where summary data will be loaded for analytics.

David 2010-10-12 21:59:30

Because of the potential size of the warehouse, and the fact that we don't know how this (medical) data will be used in the future, we need to store the detail in the first layer in a normalised form.

David 2010-10-13 01:57:08

Star schemas are rarely *significantly* larger than the same data stored relationally - this is because the denormalisation involved in constructing a star schema is only applicable to the dimensions, which are normally orders of magnitude smaller than the facts tables.

Mark Bannister 2010-10-13 10:04:19

Having said all that, if you are committed to building a truly relational datawarehouse, then yes, I would definitely use foreign keys.

Mark Bannister 2010-10-13 10:05:55

+4 A:

Take a look at the similar question/answer.

Damir Sudarevic 2010-10-11 15:03:32

Not sure if I should add here or to the similar question but...If integrity is an issue you can always right integrity functions or stored procedures that look for "orphaned" facts. (Rows where the foreign keys don't make sense). You can then clean those up after/during/before the next cycle of loads on your database.

Markus 2010-10-12 18:23:14

+1 A:

Yes I would generally use foreign keys. That's important in any database but perhaps especially so if the warehouse is a complex one with many tables.

The reasons for using integrity constraints in the warehouse are much the same as in any other database: It minimises the risk of incorrect data getting into the database; It is very often the most economical and performant method of implementing such integrity rules; It means those constraints are available to the optimiser to improve query peformance; The constraints are also available to development tools and users who consume the data and need to interpret its structure.

dportas 2010-10-12 17:52:43

ansaurus

tags:

views:

answers:

Referential integrity in a relational data warehouse. Is it worth it? and what are the alternatives?

related questions