views:

88

answers:

4

Hi,

If you had to build a relational data warehouse of biblical proportions using SQL Server 2008, would you use foreign keys to enforce data integrity, or would you use some other means?

I like foreign keys because you only need to get them right once, and they are always there to protect integrity. I was thinking of going the disable, load, enable route.

Any thoughts?

Thanks in advance.

+1  A: 

Oh, I most certainly would! What you have to remember is your database is your data store, it is not just the data store for the frontend. This is subtle difference, but it's important when you start to consider the future. Right now, you (presumably) own the management applications, but who is to say that in the future this will still be the case?

By off loading as much validation as you can onto the database, you somewhat future proof your application - at least if someone else tries to develop against your database, more of your assumptions will hold.

The drawbacks of having this on the database side is slower insertion, so you need to weigh up how heavy your application is against reading and writing. At work, we have much more demand for read than write, so referential integrity seems obvious. However, our tables are big (and freely available for importing), so we go down a multiple step import route of creating the tables, inserting the data, creating indexes, and then creating foreign keys and other constraints.

I hope this helps!

aCiD2
@aCiD2, the OP is asking about a **datawarehouse**.
Mark Bannister
@Mark, I understand that, but does that affect my anwser?
aCiD2
@aCiD2, I don't think references to the frontend are relevant to datawarehouses - "source systems" would be more relevant. There is also the question of whether the ETL process should enforce referential integrity - normally, I would expect this, so it would be unnecessary to enforce it in the DB schema. But then, I wouldn't normally expect to use a fully normalised schema as the basis for a datawarehouse.
Mark Bannister
@Mark, yes that makes sense, thanks for the clarification
aCiD2
I'm thinking along the same line as ACiD2. Although we currently own the source systems and know there is referential integrity in them, that may not always be the case. I like the "set and forget"approach that foreign keys provide.
David
Also.. Does the size of the warehouse affect the design with respect to referential integrity and how you deal with it?
David
+2  A: 

In the first place, I wouldn't build a data warehouse that (physically) conformed to a relational schema. Is the proposed data warehouse to be fully normalised, or does the word "relational" in the question simply indicate that it will be built in a SQL database?

Mark Bannister
There will be a normalised layer that captures all of the detail. One or more layers will be built after that where summary data will be loaded for analytics.
David
Because of the potential size of the warehouse, and the fact that we don't know how this (medical) data will be used in the future, we need to store the detail in the first layer in a normalised form.
David
Star schemas are rarely *significantly* larger than the same data stored relationally - this is because the denormalisation involved in constructing a star schema is only applicable to the dimensions, which are normally orders of magnitude smaller than the facts tables.
Mark Bannister
Having said all that, if you are committed to building a truly relational datawarehouse, then yes, I would definitely use foreign keys.
Mark Bannister
+4  A: 

Take a look at the similar question/answer.

Damir Sudarevic
Not sure if I should add here or to the similar question but...If integrity is an issue you can always right integrity functions or stored procedures that look for "orphaned" facts. (Rows where the foreign keys don't make sense). You can then clean those up after/during/before the next cycle of loads on your database.
Markus
+1  A: 

Yes I would generally use foreign keys. That's important in any database but perhaps especially so if the warehouse is a complex one with many tables.

The reasons for using integrity constraints in the warehouse are much the same as in any other database: It minimises the risk of incorrect data getting into the database; It is very often the most economical and performant method of implementing such integrity rules; It means those constraints are available to the optimiser to improve query peformance; The constraints are also available to development tools and users who consume the data and need to interpret its structure.

dportas