Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

views:

940

answers:

+1 Q:

Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

When I asked our DB designers why our Fact table do not have a PK, I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected. Whenb I suggested that we an an identity column in that case I was told that "I'd just be wasting space and that it wasn't needed."

My feeling is that every table in the source system should have a PK, even if it is an identity column. Given that the data warehouse (DW) is a recipient of data from other system-how would I otherwise be able to ensure that the data in the DW accurately reflects what is in the source system if there is no way to tie individual records? If you have a runaway load program that screws up data and has run for a week, how would you reconcile the differences with a live transaction source system w/o some sort of unique constraint to compare?

+3 A:

Database table without primary key seems like a poor design choice and making lots of room for different types of anomalies i.e. how would you delete or update single record in such table?

empi 2009-05-30 18:16:47

We don't, we just load them. But point taken!

Velika 2009-05-31 03:50:08

Fact tables do not need primary keys

adolf garlic 2009-06-03 08:49:49

Sorry, I meant in the sense of having an "ID" column.

adolf garlic 2009-06-03 08:55:13

+1 A:

You are correct--sort of. Without a primary key, a table does not meet the minimal definition of being relational. It's fundamental to being a relation that it must not permit duplicate rows. Tables in a Data Warehouse design should be relational, even if they're not strictly in normal form.

So there must be some column (or set of columns) in the row that serve to identify rows uniquely. But it doesn't necessarily have to be an identity column for a surrogate key.

If the Fact Table has no set of columns that can serve this role of being a candidate key, then more Dimension Tables are needed in this DW, and more columns are needed in the Fact Table.

This new Dimension alone may not be the primary key; it may be combined with existing columns in the Fact Table to create a candidate key.

Bill Karwin 2009-05-30 18:18:36

+1 A:

I would agree with you.

"I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected." - this seems to break something fundamental about relational databases as I understand them.

A fact consists of additive values plus foreign keys to dimensions. Time is an obvious dimension that is common to every dimensional model that I know. If nothing else, a composite key that contains timestamp would certainly be unique enough.

I wonder if your DBAs have much knowledge about dimensional modeling. It's a different way of thinking from the normal relational, transactional style.

duffymo 2009-05-30 18:22:09

Fact table is dimensional database, not relational database

adolf garlic 2009-06-03 08:50:39

To be precise about it, a dimensional schema is a particular relational style that uses a combination of dimension and fact tables in a star or snowflake arrangement to represent data.

duffymo 2009-06-03 09:58:35

+3 A:

An identity type column is a "surrogate" key that replaces one of your "candidate" keys (simply put). Adding a surrogate key columns adds nothing if you can't identify a row without it. Which requires a candidate key.

gbn 2009-05-30 18:23:09

I think I agree. That suggests to me that the source transaction system should be passing the logical key (the keys that the end user views as uniquely identifying a record.)How else can you reconcile differences between your Warehouse and source transaction system?

Velika 2009-06-04 03:15:49

+1 A:

A data warehouse is not necessarily a relational data store, although you may choose to make it one, so relational definitions don't necessarily apply.

A primary key is only required if you want to do something with the data that requires a unique identifier (like trace it to a source, but that's not always required or necessary or even possible anyway); and data in a data warehouse can often be used in ways that don't require primary keys. Specifically, you may not need to distinguish rows from each other. Most often for constructing aggregate values.

Time is not a required dimension in constructing data warehouse tables.

It may be psychologically uncomfortable, and wasted space is a trivial issue, but your colleague is correct - PKs aren't necessary.

le dorfier 2009-05-30 18:33:09

Time isn't required, but I haven't seen many that aren't concerned with history.

duffymo 2009-05-30 18:42:18

Even for the ones that are, the resolution of the timestamp varies - often only as precise as a year - with many legitimate identical data points having the same value.

le dorfier 2009-05-30 21:08:33

I agree with le dorfier. Hence, time doesn't make transactions necessarily unique.

Velika 2009-05-31 03:54:00

Not having a unique identifier for each row is even worse than it first seems. Sure, it is precarious and it's easy to inadvertently delete some rows.

But performance is much worse too. Each time you end up asking the database to get you the rows for Employees with EmployeeType = 'Manager' you are doing a string comparison. Identifiers are just faster and better.

Besides, storage is cheap and in this case I imagine the impact on space will be less than a quarter percentage point if that--as a data warehouse you are probably designing for terabytes of data.

aleemb 2009-05-30 21:49:19

If the fact table is at the center of a star schema, then there is in reality a candidate key. If you take all the foreign keys in the fact table together, the ones that point to rows in the dimension tables, that's a candidate key.

It probably would not do much good to declare it as a primary key. The only thing it would do is protect you against a rogue ETL process. The folks who run the warehouse might have the ETL processing well in hand.

As far as indexing and query speed is concerned, that's a whole different issue with star schemas than it is with OLTP oriented databases. The people who run the warehouse may have that in hand as well.

When designing a database for OLTP use, it's unwise to have a table without a primary key. The same considerations don't carry over into warehouses.

Walter Mitty 2009-05-31 02:30:07

"The only thing it would do is protect you against a rogue ETL process"Isn't that consideration alone enough to warrant having a PK?

Velika 2009-05-31 03:56:36

A surrogate autogenerated PK doesn't protect against a rogue ETL process. Some warehouse DBAs trust the ETL programs more than they trust random application programs. That trust might be misplaced, so your point is worth considering.

Walter Mitty 2009-05-31 11:03:59

http://www.ralphkimball.com/html/controversies.html

Fable:

The primary key of a fact table consists of all the referenced dimension foreign keys.

Fact:

A fact table often has 10 or more foreign keys joining to the dimension tables’ primary keys. However, only a subset of the fact table’s foreign key references is typically needed for row uniqueness. Most fact tables have a primary key that consists of a concatenated/composite subset of the foreign keys.

adolf garlic 2009-06-03 08:54:24

ansaurus

tags:

views:

answers:

Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

Fable:

Fact:

related questions