views:

743

answers:

7

So I've seen several mentions of a surrogate key lately, and I'm not really sure what it is and how it differs from a primary key.

I always assumed that ID was my primary key in a table like this:

Users
    ID, Guid
    FirstName, Text
    LastName, Text
    SSN, Int

however, wikipedia defines a surrogate key as "A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data."

According to Wikipedia, it looks like ID is my surrogate key, and my primary key might be SSN+ID? Is this right? Is that a bad table design?

Assuming that table design is sound, would something like this be bad, for a table where the data didn't have anything unique about it?

LogEntry
    ID, Guid
    LogEntryID, Int [sql identity field +1 every time]
    LogType, Int
    Message, Text
+8  A: 

No, your ID can be both a surrogate key (which just means it's not "derived from application data", e.g. an artificial key), and it should be your primary key, too.

The primary key is used to uniquely and safely identify any row in your table. It has to be stable, unique, and NOT NULL - an "artificial" ID usually has those properties.

I would normally recommend against using "natural" or real data for primary keys - are not REALLY 150% sure it's NEVER going to change?? The Swiss equivalent of the SSN for instance changes each time a woman marries (or gets divorced) - hardly an ideal candidate. And it's not guaranteed to be unique, either......

To spare yourself all that grief, just use a surrogate (artificial) ID that is system-defined, unique, and never changes and never has any application meaning (other than being your unique ID).

Scott Ambler has a pretty good article here which has a "glossary" of all the various keys and what they mean - you'll find natural, surrogate, primary key and a few more.

marc_s
So long story short, surrogate key can be the same as a primary key?
Nate Bross
Yes, it can. Subtle difference.
Oded
SURE ! Surrogate is really just whether you use natural data (SSN) or a surrogate (artificial) data for your key. Surrogate can be primary, sure, no problem there. Primary doesn't have to be natural data (I usually argue against natural primary keys, except in very few cases)
marc_s
I agree with marc_s +1.
JonH
OK, thats what I thought, but I'd never heard it called a surrogate key. Thanks!
Nate Bross
@Nate: The difference is that any key--primary or not--can either be surrogate (there are also many other names for this) or natural. It's not that the "surrogate key can be the same as the primary key", it's more like "the primary key can be a surrogate key or a natural key".
Adam Robinson
Absolutely not... Without a natural Key you risk corrupting the consistency of your data... See my answer below... If you care about data consistency having only a surrrogate key is a bad bad idea...
Charles Bretana
And Primary Key is kinda misleading.... All Keys are just that - Keys. What you use them for is more impoortant than whether they are the primary key or an alternate key. Any key ensures the uniqueness of the rows by the attribytes in the key. And any key can be used as the target of a FK or in a join...
Charles Bretana
"The primary key... has to be stable... REALLY 150% sure it's NEVER going to change??" -- stable <> immutable.
onedaywhen
No key has to be 100% sure it's never going to change.. Sure, the less likely it is to change the better, but unless you are using it as a foreign key in dependant tables (cause you wisely use an alternate surrogate key for that) then the impact of a change is .... trivial. -- You just ... change it..
Charles Bretana
+2  A: 

Wow, you opened a can of worms with this question. Database purists will tell you never to use surrogate keys (like you have above). On the other hand, surrogate keys can have some tremendous benefits. I use them all the time.

In SQL Server, a surrogate key is typically an auto-increment Identity value that SQL Server generates for you. It has NO relationship to the actual data stored in the table. The opposite of this is a Natural key. An example might be Social Security number. This does have a relationship to the data stored in the table. There are benefits to natural keys, but, IMO, the benefits to using surrogate keys outweigh natural keys.

I noticed in your example, you have a GUID for a primary key. You generally want to stay away from GUIDS as primary keys. The are big, bulky and can often be inserted into your database in a random way, causing major fragmentation.

Randy

Randy Minder
GUIds as primary key are okay in SQL Server - as long as you don't use them as CLUSTERING KEY - those don't have to be the same (are by default, but you can split that up)
marc_s
The index fragmentation and all those negative properties come from using a GUID as CLUSTERING KEY - that's different from the PRIMARY KEY
marc_s
Database purists have apparently never used real world data. I have used literally hundreds of databases covering hundreds of different types of information and never yet seen a good natural key. Every database I've seen that tries to use one has trouble eventually with the data not being unique or changing and the change needing to filter through hundreds or thousands or even millions of related records.
HLGEM
"...never yet seen a good natural key" Are you kidding me? You've never seen a many-to-many table? Its key is ( table_1_id, table_2_id ). The whole point of natural keys is that surrogate keys often *allow* duplication of data -- case in point, the email address example I posted below!
@user151841: A many-to-many table that uses two surrogate IDs as its primary key is not a "natural key" since it does not depend on a real-world model.
FreshCode
A: 
  • The primary key is whatever you make it. Whatever you define as the primary key is the primary key. Usually its an integer ID field.
  • The surrogate key is also this ID field. Its a surrogate for the natural key, which defines uniqueness in terms of your application data.

The idea behind having an integer ID as the primary key (even it doesnt really mean anything) is for indexing purposes. You would then probably define a natural key as a unique constraint on your table. This way you get the best of both worlds. Fast indexing with your ID field and each row still maintains its natural uniqueness.

That said, some people swear by just using a natural key.

Alex
A: 

There are actually three kinds of keys to talk about. The primary key is what is used to uniquely identify every row in a table. The surrogate key is an artificial key that is created with that property. A natural key is a primary key which is derived from the actual real life data.

In some cases the natural key may be unwieldy so a surrogate key may be created to be used as a foreign key, etc. For example, in a log or diary the PK might be the date, time, and the full text of the entry (if it is possible to add two entries at the exact same time). Obviously it would be a bad idea to use all of that every time that you wanted to identify a row, so you might make a "log id". It might be a sequential number (the most common) or it might be the date plus a sequential number (like 20091222001) or it might be something else. Some natural keys may work well as a primary key though, such as vehicle VIN numbers, student ID numbers (if they are not reused), or in the case of joining tables the PKs of the two tables being joined.

This is just an overview of table key selection. There's a lot to consider, although in most shops you'll find that they go with, "add an identity column to every table and that's our primary key". You then get all of the problems that go with that.

In your case I think that a LogEntryID for your log items seems reasonable. Is the ID an FK to the Users table? If not then I might question having both the ID and the LogEntryID in the same table as they are redundant. If it is, then I would change the name to user_id or something similar.

Tom H.
There are not 3 kinds of Keys. Primary Key vs Alternate Key is just a label to attach to one of the keys. All Keys (Prim/Alternate) are the same. What matters is what it is used for. ANY key (Primary or ALternate), can be used as target of FK constraint or in a Join. Only a natural Key, (whether it is Primary or ALternate) can be used to help ensure data integrity/consistency.
Charles Bretana
Is a Primary Key a key? Is an Alternate Key a key? Is a Natural Key a key? Are they all exactly the same? It's not rocket science Charles... 1 + 1 + 1 = 3
Tom H.
@Tom H.: I think what @Charles Bretana is trying to say is that a key is just a key and that 'primary key' is a deprecated concept in relational theory. Here's some light reading: http://consultingblogs.emc.com/davidportas/archive/2006/09/14/Down-with-Primary-Keys_3F00_.aspx
onedaywhen
I understand, but when the OP specifically asked about primary keys I think that it's appropriate to talk about that term, whether or not Charles thinks it's a relevant term today. It's certainly not worth a down-vote.
Tom H.
+2  A: 

First, a Surrogate key is a key that is artificially generated within the database, as a unique value for each row in a table, and which has no dependency whatsoever on any other attribute in the table.

Now, the phrase Primary Key is a red herring. Whether a key is primary or an alternate doesn't mean anything. What matters is what the key is used for. Keys can serve two functions which are fundementally inconsistent with one another.

  1. They are first and foremost there to ensure the integrity and consistency of your data! Each row in a table represents an instance of whatever entity that table is defined to hold data for. No *Surrogate* Key, by *definition*, can *ever* perform this function. Only a properly designed natural Key can do this. (If all you have is a surrogate key, you can always add another row with every other attributes exactly identical to an existing row, as long as you give it a different surrogate key value)
  2. Secondly they are there to serve as references (pointers) for the foreign Keys in other tables which are children entities of an entity in the table with the Primary Key. A Natural Key, (especially if it is a composite of multiple attributes) is not a good choice for this function because it would mean tha that A) the foreign keys in all the child tables would also have to be composite keys, making them very wide, and thereby decreasing performance of all constraint operations and of SQL Joins. and B) If the value of the key changed in the main table, you would be required to do cascading updates on every table where the value was represented as a FK.

So the answer is simple... Always (wherever you care about data integrity/consistency) use a natural key and, where necessaruy, use both! When the naatural key is composite, or long, or not stable enough, add an alternate Surrogate key (as auto-incrementing integer for example) for use as targets of FKs in child tables. But at the risk of losing data consistency of your table, DO NOT remove the natural key from the main table.

To make this crystal clear let's make an example. Say you have a table with Bank accounts in it... A natural Key might be the Bank Routing Number and the Account Number at the bank. To avoid using this twin composite key in every transaction record in the transactions table you could decide to put an artificially generated surrogate key on the BankAccount table which is just an integer. But you better keep the natural Key!, If you didn't, if you did not also have the composite natural key, you could quite easily end up with two rows in the table as follows

id  BankRoutingNumber BankAccountNumber   BankBalance
 1     12345678932154   9876543210123       $123.12
 2     12345678932154   9876543210123    ($3,291.62)

Now, which one is right?

To marc from comments below, What good does it do you to be able to "identify the *row*"?? No good at all, it seems to me, because what we need to be able to identify is which *bank account* the row represents! Identifying the row is only important for internal database technical functions, like joins in queries, or for FK constraint operations, which, if/when they are necessary, should be using a surrogate key anyway, not the natural key.

You are right in that a poor choice of a natural key, or sometimes even the best available choice of a natural key, may not be truly unique, or guaranteed to prevent duplicates. But any choice is better than no choice, as at least it will prevent duplicate rows for the same values in the attributes chosen as the key. These issues can be kept to a minimum by the appropriate choice of key attributes, but sometimees they are unavoidable and must be dealth with. But it is still better to do so than tpo allow incorrect inaccurate or redundant data into the database.

As to "ease of use" If all you are using the natural key for is to contrain the insertion of duplicate rows, and you are using another, surrogate, key as the target for FK constraints, I do not see any ease of use issues of concern.

Charles Bretana
I disagree - very strongly. A surrogate key e.g. an INT IDENTITY is **PERFECT** to uniquely and clearly identify a row. Any natural key more often than not might *look* like a good candidate, but in the end won't be neither truly always unique, nor easy to use.
marc_s
Also, yes - the primary key is just one of the possible candidate keys - a set of columns that will uniquely identify a row. Which one of all the possible candidate keys you end up picking for your primary key is somewhat arbitrary - but it makes good sense to pick **one** and stick with that one.
marc_s
@marc, keys are NOT there to "uniquely identify a row". They are there to guarantee data integrity and prevent the insertion of multiple rows that represent the SAME entity. What good does it do to be able to uniquely Identify a row if you have no idea what actual entity the row represents. or if the other data in the row is accurate or not ?? and if there are multiple rows for the same entity, with different data attributes, you have no idea which one has the right values in it.
Charles Bretana
@marc_s, No, a "primary" key is just another key.... what matters moer is whether it can accurately identify the real world entity the row represents (is it a surrogate or a natural?) and what are you using the key for... It matters nopt at all whether you label it as a "Primary" key, or just another unique Key...
Charles Bretana
+2  A: 

The reason that database purists get all up in arms about surrogate keys is because, if used improperly, they can allow data duplication, which is one of the evils that good database design is meant to banish.

For instance, suppose that I had a table of email addresses for a mailing list. I would want them to be unique, right? There's no point in having 2, 3, or n entries of the same email address. If I use email_address as my primary key ( which is a natural key -- it exists as data independently of the database structure you've created ), this will guarantee that I will never have a duplicate email address in my mailing list.

However, if I have a field called id as a surrogate key, then I can have any number of duplicate email addresses. This becomes bad if there are then 10 rows of the same email address, all with conflicting subscription information in other columns. Which one is correct, if any? There's no way to tell! After that point, your data integrity is borked. There's no way to fix the data but to go through the records one by one, asking people what subscription information is really correct, etc.

The reason why non-purists want it is because it makes it easy to use standardized code, because you can rely on refering to a single database row with an integer value. If you had a natural key of, say, the set ( client_id, email, category_id ), the programmer is going to hate coding around this instance! It kind of breaks the encapsulation of class-based coding, because it requires the programmer to have deep knowledge of table structure, and a delete method may have different code for each table. Yuck!

So obviously this example is over-simplified, but it illustrates the point.

yeah, but in this case, just put a unique constraint on the e-mail field - problem solved. PLUS: I **really** don't want a potentially 150-character field as my primary key!! Imagine having to reference that from a child table.......
marc_s
the second idea of having composite primary keys is even worse - if you need to reference that table from 3 or 4 child tables - imagine those queries and those messy statements! Avoid that like the plague - just use a single, surrogate ID and all your problems are solved.
marc_s
How do you do a primary key in a join table for a many-to-many relation?
either no PK at all, or then an INT IDENTITY - plain and simple
marc_s
Also, what if your table uses e-mail as PK (so no duplicates there, granted), but also has another field which has to be unique? YOu can't have two PK, nor does picking the e-mail as a natural PK make it any easier to avoid duplicates.
marc_s
I was just trying to illustrate sides of the argument. Do *you* ever use composite keys? Where? (If not, why do they exist in RDBMSes?) Do you always have a surrogate key? FWIW, I'm not against UNIQUE keys, so I don't see any problem with another column being unique.
"The reason that database purists get all up in arms about surrogate keys is..." -- that's not the whole story. You could use `UNIQUE (email_address)` and I'd still be "up in arms" if you used the surrogate for FKs because, now that the real key values no longer appear in the referencing tables, I have to use more joins... and wasn't using a surrogate meant to makes things more efficient?
onedaywhen
A well designed Delete method will take a dictionary of colName-colValue pairs as a primary key, and therefore even with natural keys, the code is the same for all tables.
ProfK
@marc_s, regards other unique fields: There is a reason one unique key is called primary, and others are just called unique. This isn't a symptom of a problem, but of the fact that it happens all the time, naturally,
ProfK
Right, the 'problem' with the "well-designed Delete method" is that the client programmer who calls the method has to construct the dictionary of the key. This required them to know the table structure, instead of just doing delete() or delete(id). It breaks encapsulation, which is why OO enthusiasts don't like it.
A: 

Users Table

Using a Guid as a primary key for your Users table is perfect.

LogEntry table

Unless you plan to expose your LogEntry data to an external system or merge it with another database, I would simply use an incrementing int rather than a Guid as the primary key. It's easier to work with and will use slightly less space, which could be significant in a huge log stretching several years.

FreshCode