views:

360

answers:

14

It's fairly general question, but I'd like to know what do you use in determination of primary key of the table. Examples supplied with you reasoning are highly desired.

I noticed that many programmers add ID column and use it as a primary key. I think, it is flawed from design point of view, as ID in that case HAS NOTHING TO DO WITH THE TABLE.

+12  A: 

The role of a primary key is to uniquely identify each row in your table. If no column or set of columns matches this requirement, an column containing a unique id is often added as a primary key.

I do not agree with your comment about programmers adding an id that has nothing to do with table data. When you need to link data across several tables, a concise id is easier to use than a compound key.

Altherac
The `id` key is usually called a surrogate key: http://en.wikipedia.org/wiki/Surrogate_key
hhaamu
If a table does not have a column or set of columns which uniquely identify any given row then the data model is almost certainly flawed. Certainly a surrogate key is often more convenient than a compound key (especially when interacting with ORM tools) but it is a supplement to, not a substitute for, a properly identified business key.Note that the business key can be a monotically incrementing sequence e.g. Order Number.
APC
A: 

The primary key should always be an auto-increment integer that's unrelated to your data.

Edited to add that GUIDs are fine too. The important thing is that the key doesn't describe your data, so if your data changes, your PK doesn't. Always use an ID field.

Consider you're using an email as your primary key, and then the user changes their email address. You then have to cascade that change to every joined table. Using real data as your PK doesn't make sense.

Matt Grande
This is highly debatable. Some people think they should be guids. Some people think they should be based on the data in the db, and not be a seperate column. You'll need some reasoning behind your assertion to gain much support.
Beska
@Matt are you joking or is this flamebait?
Sinan Ünür
-1; far too broad of a rule. If the data inherently has uniquely-identifying information then it can be used; if you're concerned about replication or moving the data then a rowguid can be a very good idea. An auto-incrementing integer is simple and good for performance but doesn't scale well and is not a silver bullet.
STW
Primary Keys do not need to be surrogate keys, surrogate keys are useful but you should also be able to describe the natural key of the data. If you add a surrogate key without understanding what the natural key is (often if you're using a surrogate the natural includes a date/timestamp), you will have problems deciding what goes into the table.
SamH
Okay fine, use a guid instead of an int. The fact of the matter is, it shouldn't be based on your data. Consider you're using an email as your primary key, and then the user changes their email address. You then have to cascade that change to every joined table. Using real data as your PK doesn't make sense.
Matt Grande
Matt: Your comment has a good point that supports your answer...so edit your answer and put it in there, and you'll be more likely to increase your support.
Beska
Good point, thanks Beska.
Matt Grande
+3  A: 

Hollywood Couples: Relationships are harder than you think ... one of my favorites.

Sinan Ünür
Thanks for the funny link.
Tiberiu Ana
+2  A: 

A key should be a column where each entry is guaranteed to be unique. Examples might be things like a social insurance number or driver's license number. In theory you can tie multiple columns together into a compound key. So perhaps name and birth day might be unique together so they could be a key. However in practice nobody does that because crossing tables is a pain. The best solution is usually to add an autoincrementing value or GUID column.

stimms
+3  A: 

You pick anything that you know will be a unique value, preferably something numeric such as a customer ID or account number. Stay away from string based keys if at all possible. If nothing else, use a GUID value or an Auto-increment integer.

BBlake
+1 for GUIDs. traditionally, I have always used an incrementing integer for the key id, but I have found the GUID method to be much easier. This may not be the case for everyone, but I have jumped ship from integers to GUIDs :)
Anders
Good point but it wouldn't hurt to mention why. A numeric lookup is faster, and strings, especially those with varying lengths make for slower indexes in general. Read High Performance MySQL for a detailed explanation. Here's an excellent, advanced look at some indexing issues: http://www.xaprb.com/blog/2009/02/12/5-ways-to-make-hexadecimal-identifiers-perform-better-on-mysql/
joedevon
@joe not to mention string primary keys use up a ton of space in InnoDB.
orlandu63
+2  A: 
Tiberiu Ana
+1  A: 

A primary key doesn't necessarily have to be a single column, but can also be a combination of columns. As Altherac's answer states it's purpose is to uniquely identify each row.

For performance reasons it's better to have a small key, but depending on the needs of the system the type of key used can vary greatly.

STW
A: 

Well, in one of the systems we use (and I designed), each user has an auto-incremented primary key as their ID. Other tables that are related to that particular user uses their ID as its primary key (although obviously not auto-incremented) as well, so it does make sense if used correctly.

Brian
A: 

In theory any unique field could be used (e.g. Social Security Number, url, etc.), but in practice I don't think there's any big disadvantage to using an auto-generated ID. For example, some wacky mistake makes a duplicate SSN could be disastrous to your data.

Neko
+5  A: 

My thought process in determining a primary key goes like this.

"One record in this table will represent ...?"

"For distinct value of Col X, Col Y, Col Z.. there should only be one row in the table", What are Cols X Y and Z ?"

The CAR_MODEL table.

Hmm this table will store information about different types of cars, should the MANUFACTURER_NAME be the key ? No, I can have many rows identifying different car models from the same manufacturer. Hmm should the MANUFACTURER_NAME and MODEL_NAME be the key ? No, i want to have different rows with the same MANUFACTURER_NAME and MODEL_NAME but different release years in the table at the same time. Ok what about "MANUFACTURER_NAME", "MODEL_NAME" and "RELEASE_YEAR".

Is it possible for me to have two rows with the same MANUFACTURER_NAME, MODEL_NAME and RELEASE_YEAR at the same time? Hmmm no. That wouldn't make sense, they would be the same Car Model, and I only want 1 record per car model. Great, that's the key.

One record in this table will represent a particular model from a particular year from a particular manufacturer. I decide this when i create the table, that's why i created the table, if you can't describe what's going in the table in terms that help identify the key you don't really understand why you are creating it.

Horrible Changes Over Time!!! (surrogate keys, Natural Key, Slowly changing dimensions)

Ah but the information I am storing about a particular Car Model (from a particular Manufacturer and Release Year) may change. Initially I was told that it had two doors, now I find it has four, I want to have this correct information in my table but not lose the old record as people have reported off it and I need to be able to reproduce their old results.

Ok, I will add a new column "MODEL_ID" and make it the primary key of the table, so I can store multiple records with the same model name, manufacturer name and release year. I will also add a valid_from and valid_to timestamp.

This can work well, and indeed with my changes the Primary Key of the table is now MODEL_ID, a surrogate key. But the Natural Key, the Business Key, the key 'at any point in time', is still Model_Name, Manufacturer_Name and Release_Year, and I can't loose sight of that.

Note on Surrogate Keys :

A surrogate key is unique for each row, by definition! A surrogate key makes it easier to manipulate data sometimes, especially data that changes over time. But a surrogate key doesn't in any way replace a Natural Primary Key, you still need to know what the 'grain' of the table is.

If we said that every person in Australia will be assigned a Stack_Overflow_User_id what would we do when Jeff and Joel started giving Stack_Overflow_User_Id's to dogs and cats and multiple IDs to the same people ??

We would say, "hey Jeff and Joel, only give out 1 ID per First_Name, Last_Name, Date_of_Birth and Place_of_Birth!". *

We need to know the natural key or we can give anything a surrogate key!

(* what about people where all these are the same ? don't we need a passport number or some sort of surrogate ? In practice a surrogate is nice and clean, but where did it originate ? originally it came from a natural key.)

SamH
This answer could use a bit of cleanup, but is an easy +1 for explaining the thinking process behind the concept of multiple column natural keys.
Beska
+1 good, thorough thought process.
KG
+1  A: 

When I use surrogate keys it seems to me that performance increases. I usually use Int ID for performance.

THEn
A: 

Think of it as a possible unique identifier (single or multiple columns) for your records.

Think about finger prints. Do you think they are unique to an individual; it hasn't been proven yet, but it sure seems like a decent unique identifier until the population gets so large that redundancy creeps in. At present, this is like a primary key for records that identify you. [1 column]

In the case that our population explodes, and finger prints start showing their weaknesses, we can combine finger prints and iris scans to be a much stronger primary key. [2 columns]

The primary key is usually unique by design such as an ID number provided at the instantiation of the record in our database.

At the least, I hope this helps with the concept.

+1  A: 

Every Data that is needed to uniquely identify your Entry should be your table ID. If none such Data exists, you have to create one (most commonly, a running number is used).

I disagree on your point that all ID should have something to do with the table because sometimes it isn't enough to uniquely identify the record. Furthermore, you'd have to use multiple ID's which is much harder to work with than a simple running number as ID.

Primary Keys are relatively simple for single tables, but once you have your Entries spread out on multiple tables, things can get messy, especially with many to many connections. Working with foreign Keys also has to be thought through before implementing them.

If you want to work professionally with Databases (or want to do it by the book), it would be best to familiarize yourself with Data Structure Diagrams

EDIT: Unified Modeling Language should help you determine what to use as Primary Key

Mike
+1  A: 

Use natural keys wherever they work and can be trusted. If you analyze your subject matter into entities and relationships among entities (ER), you should come up with keys that identify the entities in the data itself. If there is an entity whose identity is confused in the data itself, invent an artificial key (usually called a surrogate key). Inventing a key is a last resort.

When you go to build tables, some tables describe entities, and others describe relationships. Entity tables get the same key as the entity. Relationship tables get a compound key with one component for each entity that participates in the relationship. Some relationships won't get a table of their own (many to one). Instead, they will be represented by adding foreign keys to existing tables, so they won't need a primary key of their own.

This will slow you down a little compared to using invented ID fields for every table. But it will result in better data management, which results in better data.

Walter Mitty