views:

344

answers:

7

I would like to see an example of:

  • When this is appropriate
  • When this is not appropriate

Is there a time when the choice of database would make a difference to the above examples?

+1  A: 

A few examples...

Appropriate:

  • OLTP systems, in most situations when implementing most many-to-many relationships.

Inappropriate:

  • For dimension tables in OLAP systems -- you want to make your dimension key as small as possible so that your fact table is as small (and fast) as possible.

  • For times when you aren't sure if the combination is unique. Granted this is a pretty crummy example, but a "Person" table would be a bad choice for a multi-column PK.

Dave Markle
A: 

One example of when it's appropriate is when you have a linking table with foreign key fields connecting different tables.

In general, it's probably a good idea to use existing, identifying fields as your primary key when possible. If you don't have a natural id field, and you would have to combine a lot of fields to get a unique PK, it's probably better to use an auto number. Primary keys with more than 2 fields can get messy.

froadie
+2  A: 

You nearly always want a primary key, so I assume the choice is between choosing an existing two columns to be the primary key, or making a new auto-incrementing PK and putting an ordinary unique constraint on the two columns instead.

When you want a 2-column primary key:

  • If you have an intermediate table that references two other tables and it consists only of two foreign keys, i.e. a many to many relationship, then there is no point adding an extra column just to be a primary key. Use the two columns you already have as the primary key.

When you want an auto-increment primary key:

  • If you are referencing a table from another table, you want the primary key of the target table to be small, because that data will be repeated as the foreign key in the referring table. You also want it to be fast to compare.
  • Every index you add to a table includes a copy of the clustering key (which is typically the same as the primary key). If your clustering key is larger than it needs to be, every index on that table will be larger than it needs to be as well.
Mark Byers
*Every index you add to a table includes a copy of the primary key. If your primary key is larger than it needs to be, every index on that table will be larger than it needs to be as well.* what does an index on a table without a primary key reference? Do you know this to be true, or do you just assume it that way?
Evan Carroll
Every index includes the clustering key, which may or may not be the primary key (it usually is).
Aaronaught
What if you create an index before a primary key? I think it is given that an index points to a row, but what makes you think the row pointer is simply the primary key? What does the primary key index point to? I think the statement was simply not true.
Evan Carroll
@Evan: If you have no clustering key (which, again, may or may not be the primary key), then every nonclustered index includes a copy of nothing, so the statement is still true. ;)
Aaronaught
@Aaronaught: Good point about the clustering key vs primary key, you're exactly right - I made an assumption here that may not be true.
Mark Byers
@Evan Carroll, "Do you know this to be true, or do you just assume it that way?" I know it to be true (after Aaronaught's correction) for some DBs. Obviously it could be DB dependent and there may be a DB that does it a different way, but I don't know of any DB that does it any other way. Here's a link to SQL Server's doc about this: If the table has a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row. http://msdn.microsoft.com/en-us/library/ms177484.aspx
Mark Byers
+1  A: 

I think it's almost always better (from an application developer standpoint, at least) to make the primary key an auto-generated key, and create a UNIQUE constraint and an index on the multiple columns.

  • With a single auto-generated primary key, you'll be able to easily add references to this table from other tables.
  • Auto-generated primary keys work more simply with ORM libraries.
  • Also, if your uniqueness constraints change in the future, you don't have to change the existing primary keys.

I've run into several headache-inducing situations because a DBA thought that a multiple-column primary key would always be sufficient, and future requirements changes proved this incorrect.

Kaleb Brasee
A: 

We found great performance increases in our application when we used multi column indexs and keys. It allowed us to create index on our most common queries and the main table was not even accessed since the entire select clause could be in the index. However, it depends on your app and data set.

rerun
Be aware that this shouldn't be taken as general advice for all databases. For example, multi-column indexes on Teradata only get used if all columns within the index are used in the query, since Teradata uses hashes for indexing.
lins314159
Yes this is on a enterprise system with hundreds of millions of rows. that is why I stated our applications for most apps you would probably not receive the benefits we have. Our index were tuned by DB2 engineers at IBM to get the maximum gain.
rerun
Yeah, but if you have a five-column primary key, any JOIN from a child table is going to be a royal mess! It needs five conditions just to establish the JOIN..... JOINs from HELL!
marc_s
+8  A: 

This really seems to be a question about surrogate keys, which are always either an auto-incrementing number or GUID and hence a single column, vs. natural keys, which often require multiple pieces of information in order to be truly unique. If you are able to have a natural key that is only one column, then the point is obviously moot anyway.

Some people will insist on only using one or the other. Spend sufficient time working with production databases and you'll learn that there is no context-independent best practice.

Some of these answers use SQL Server terminology but the concepts are generally applicable to all DBMS products:


Reasons to use single-column surrogate keys:

  • Clustered indexes. A clustered index always performs best when the database can merely append to it - otherwise, the DB has to do page splits. Note that this only applies if the key is sequential, i.e. either an auto-increment sequence or a sequential GUID. Arbitrary GUIDs will probably be much worse for performance.

  • Relationships. If your key is 3, 4, 5 columns long, including character types and other non-compact data, you end up wasting enormous amounts of space and subsequently reduce performance if you have to create foreign key relationships to this key in 20 other tables.

  • Uniqueness. Sometimes you don't have a true natural key. Maybe your table is some sort of log, and it's possible for you to get two of the same event at the same time. Or maybe your real key is something like a materialized path that can only be determined after the row is already inserted. Either way, you always want your clustered index and/or primary key to be unique, so if you have no other truly unique information, you have no choice but to employ a surrogate key.

  • Compatibility. Most people will never have to deal with this, but if the natural key contains something like a hierarchyid, it's possible that some systems can't even read it. In this case, again you must create a simple auto-generated surrogate key for use by these applications. Even if you don't have any "weird" data in the natural key, some DB libraries have a lot of trouble dealing with multi-column primary keys, although this problem is quickly going away.

Reasons to use multi-column natural keys

  • Storage. Many people who work with databases never work with large enough ones to have to care about this factor. But when a table has billions or trillions of rows, you are going to want to keep the absolute minimum amount of data in this table that you possibly can.

  • Replication. Yes, you can use a GUID, or a sequential GUID. But GUIDs have their own trade-offs, and if you can't or don't want to use a GUID for some reason, a multi-column natural key is a much better choice for replication scenarios because it is intrinsically globally unique - that is, you don't need a special algorithm to make it unique, it's unique by definition. This makes it very easy to reason about distributed architectures.

  • Insert/Update Performance. Surrogate keys aren't free. If you have a set of columns that are unique and frequently queried on, and you therefore need to create a covering index on these columns; the index ends up being almost as large as the table, which wastes space and requires that a second index be updated every time you make any modifications. If it is ever possible for you to have only one index (the clustered index) on a table, you should do it!


That's what comes to mind right off the bat. I'll update if I suddenly remember anything else.

Aaronaught
+1 - Very nice explanation.
Mark Brittingham
A: 

Sometimes composite natural keys make intuitive sense. E.g. Suppose you have a table for a company (PK is ComapnyId) with some details of the company in columns. You also have a requirement to store the CEO name of the company thoughout its history. The natural invariant is that one company can have only one CEO at a time. It is then intuitive to create a CompanyCeo table with a composite PK of CompanyId (a FK to CompanyId in Company table) + FromDate. Other columns in that table may be ToDate and CeoName. This way you can guarantee that one and only one CEO can start on a particular date.

Pratik