views:

836

answers:

2

There's a healthy debate out there between surrogate and natural keys:

SO Post 1

SO Post 2

My opinion, which seems to be in line with the majority (it's a slim majority), is that you should use surrogate keys unless a natural key is completely obvious and guaranteed not to change. Then you should enforce uniqueness on the natural key. Which means surrogate keys almost all of the time.

Example of the two approaches, starting with a Company table:

1: Surrogate key: Table has an ID field which is the PK (and an identity). Company names are required to be unique by state, so there's a unique constraint there.

2: Natural key: Table uses CompanyName and State as the PK -- satisfies both the PK and uniqueness.

Let's say that the Company PK is used in 10 other tables. My hypothesis, with no numbers to back it up, is that the surrogate key approach would be much faster here.

The only convincing argument I've seen for natural key is for a many to many table that uses the two foreign keys as a natural key. I think in that case it makes sense. But you can get into trouble if you need to refactor; that's out of scope of this post I think.

Has anyone seen an article that compares performance differences on a set of tables that use surrogate keys vs. the same set of tables using natural keys? Looking around on SO and Google hasn't yielded anything worthwhile, just a lot of theorycrafting.


Important Update: I've started building a set of test tables that answer this question. It looks like this:

  • PartNatural - parts table that uses the unique PartNumber as a PK
  • PartSurrogate - parts table that uses an ID (int, identity) as PK and has a unique index on the PartNumber
  • Plant - ID (int, identity) as PK
  • Engineer - ID (int, identity) as PK

Every part is joined to a plant and every instance of a part at a plant is joined to an engineer. If anyone has an issue with this testbed, now's the time.

+1  A: 

Natural keys differ from surrogate keys in value, not type.

Any type can be used for a surrogate key, like a VARCHAR for the system-generated slug or something else.

However, most used types for surrogate keys are INTEGER and RAW(16) (or whatever type your RDBMS does use for GUID's),

Comparing surrogate integers and natural integers (like SSN) takes exactly same time.

Comparing VARCHARs make take collation into account and they are generally longer than integers, that making them less efficient.

Comparing a set of two INTEGER is probably also less efficient than comparing a single INTEGER.

On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.

And here are the numbers (in MySQL):

CREATE TABLE aint (id INT NOT NULL PRIMARY KEY, value VARCHAR(100));
CREATE TABLE adouble (id1 INT NOT NULL, id2 INT NOT NULL, value VARCHAR(100), PRIMARY KEY (id1, id2));
CREATE TABLE bint (id INT NOT NULL PRIMARY KEY, aid INT NOT NULL);
CREATE TABLE bdouble (id INT NOT NULL PRIMARY KEY, aid1 INT NOT NULL, aid2 INT NOT NULL);

INSERT
INTO    aint
SELECT  id, RPAD('', FLOOR(RAND(20090804) * 100), '*')
FROM    t_source;

INSERT
INTO    bint
SELECT  id, id
FROM    aint;

INSERT
INTO    adouble
SELECT  id, id, value
FROM    aint;

INSERT
INTO    bdouble
SELECT  id, id, id
FROM    aint;

SELECT  SUM(LENGTH(value))
FROM    bint b
JOIN    aint a
ON      a.id = b.aid;

SELECT  SUM(LENGTH(value))
FROM    bdouble b
JOIN    adouble a
ON      (a.id1, a.id2) = (b.aid1, b.aid2);

t_source is just a dummy table with 1,000,000 rows.

aint and adouble, bint and bdouble contain exactly same data, except that aint has an integer as a PRIMARY KEY, while adouble has a pair of two identical integers.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

Performance difference, if any, is within the fluctuations range.

Quassnoi
The example that I see most often of a surrogate vs natural key is States (in the US). Surrogate key on that table would be an int, but the natural key would be a char(2).
jcollum
(in case you're wondering, I didn't downvote you)
jcollum
I wonder if SSN will change from an INT to a CHAR when the US goes past 1B people. It's not too far off really.
jcollum
oops, apparently the US's growth rate isn't even close to that of the world, we've got a way (in years) to go to hit 1B people.
jcollum
i'm still trying to figure what "percents of percents" might mean...
Javier
@Javier: that means "0.0001 or something aroung that" :)
Quassnoi
This post may not fully answer the question (no quantification) but it certainly doesn't deserve negative numbers. Upvoted.
BobMcGee
I certainly disagree that natural keys only differ in value, not type. A natural key could be a company's name by state, couldn't it? It would certainly be unique (assuming that company names are unique by state, which seems reasonable). So why not use that as a natural key? Or are you in favor of using natural keys only in cases where the natural key is some arbitrary level of un-complexity? Like a part number that's alphanumeric but only A-Z, 0-9? Surrogate keys are typically much less complex, usually an int, guid or bigint.
jcollum
@jcollum: Could you please give me the definition of the term `surrogate key` as you understand it for the purposes of this discussion. You may want to read this Wikipedia article: http://en.wikipedia.org/wiki/Surrogate_key This question is crucial for the further discussion so I'd really ask you to answer it before this discussion can continue. Thank you.
Quassnoi
@Quassnoi: I think my use of the term matches this: "the surrogate key can be the primary key, generated by the database management system and not derived from any application data in the database" (from the link you gave, which I'd already read)
jcollum
@jcollum: ok. So we have two tables that hold companies: first one uses a `VARCHAR(200)` column filled with `NEWID()` as a `PRIMARY KEY`. `NEWID` is certainly "generated by the `RDBMS`", and certainly "is not derived from any application data in the database". The second table uses a `VARCHAR(200)` column filled with company name as a `PRIMARY KEY`. Now we see two identical tables whose keys differ only in type, not value, but the first table uses surrogate key, while the second one uses natural key. Now, what is that you disagree with?
Quassnoi
Read above as "whose keys differ only in value, not type".
Quassnoi
Typically, in the uses of surrogate keys that I've seen, the surrogate key is an int, bigint or a guid. Your example uses a natural key (VARCHAR(200)) and a surrogate key (VARCHAR(200)). That seems artificial. Who would choose a varchar field for a surrogate key? Why 200 characters? The number of possible rows in that table would probably take up more storage than you can find on Earth (9 E 282 rows by my math).
jcollum
(yipes, I just assumed 26 ^ 200, the number would be a lot larger than that!!)
jcollum
@jcollum: when I asked you of the real world problems yesterday you answered me that it's not relevant since we are speaking about concepts, not real world problems. Now, when I give you a concept, you are starting to speak about real-world usage and change the subject of discussion. You didn't answer my question. Aren't the keys in the example above surrogate and natural, respectively? Don't they differ only by value, not by type? If they are and they do, what was that you disagree with? Please answer. Thank you.
Quassnoi
@Quassnoi: I was referring to not getting into specifics about business rules: how many Account Execs can be attached to a Company for instance. I think your example addresses natural keys in many-to-many tables. Which is fine, but not something I'm concerned about. Composite (i.e. int, int) keys in many-to-many tables seems like a reasonable thing to do. I'm more concerned about natural vs. surrogate keys in primary entity tables (Part, Company, Employee etc.).
jcollum
`@jcollum`: you are conserned about performance of `INT32` comparison or what? This will be the same. A database will compare `32-bit` integers filled with part numbers with exactly same efficiency as ones filled with identity. Performance of `CMP EAX` does not depend on the source of the number being compared.
Quassnoi
+1  A: 

Use both! Natural Keys prevent database corruption. When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforece database consistency...

Charles Bretana
I think you are misusing the term "corruption" here (which implies random values being scribbled over the database data) - you really just mean "consistency".
anon
Perhaps, w/o picking at nits, I guess I (mis?)use the word corruption in a broader sense than I should... If there are two rows in a Transaction table that actually represent the same transaction, is that corrupt data or inconsistent data? And I have thought inconsistency specifically meant when some invariant was not satisifed, like, for e.g., if the totalAmount column in an Invoice row was not equal to the sum of the Invoice Line item row amounts.
Charles Bretana