ansaurus

Question

surrogate vs natural key: hard numbers on performance differences?

Answer 1

+1 A:

Natural keys differ from surrogate keys in value, not type.

Any type can be used for a surrogate key, like a VARCHAR for the system-generated slug or something else.

However, most used types for surrogate keys are INTEGER and RAW(16) (or whatever type your RDBMS does use for GUID's),

Comparing surrogate integers and natural integers (like `SSN`) takes exactly same time.

Comparing VARCHARs make take collation into account and they are generally longer than integers, that making them less efficient.

Comparing a set of two INTEGER is probably also less efficient than comparing a single INTEGER.

On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.

And here are the numbers (in MySQL):

CREATE TABLE aint (id INT NOT NULL PRIMARY KEY, value VARCHAR(100));
CREATE TABLE adouble (id1 INT NOT NULL, id2 INT NOT NULL, value VARCHAR(100), PRIMARY KEY (id1, id2));
CREATE TABLE bint (id INT NOT NULL PRIMARY KEY, aid INT NOT NULL);
CREATE TABLE bdouble (id INT NOT NULL PRIMARY KEY, aid1 INT NOT NULL, aid2 INT NOT NULL);

INSERT
INTO    aint
SELECT  id, RPAD('', FLOOR(RAND(20090804) * 100), '*')
FROM    t_source;

INSERT
INTO    bint
SELECT  id, id
FROM    aint;

INSERT
INTO    adouble
SELECT  id, id, value
FROM    aint;

INSERT
INTO    bdouble
SELECT  id, id, id
FROM    aint;

SELECT  SUM(LENGTH(value))
FROM    bint b
JOIN    aint a
ON      a.id = b.aid;

SELECT  SUM(LENGTH(value))
FROM    bdouble b
JOIN    adouble a
ON      (a.id1, a.id2) = (b.aid1, b.aid2);

t_source is just a dummy table with 1,000,000 rows.

aint and adouble, bint and bdouble contain exactly same data, except that aint has an integer as a PRIMARY KEY, while adouble has a pair of two identical integers.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

Performance difference, if any, is within the fluctuations range.

Quassnoi 2009-08-04 18:44:18

The example that I see most often of a surrogate vs natural key is States (in the US). Surrogate key on that table would be an int, but the natural key would be a char(2).

jcollum 2009-08-04 18:48:13

(in case you're wondering, I didn't downvote you)

jcollum 2009-08-04 18:50:19

I wonder if SSN will change from an INT to a CHAR when the US goes past 1B people. It's not too far off really.

jcollum 2009-08-04 18:53:25

oops, apparently the US's growth rate isn't even close to that of the world, we've got a way (in years) to go to hit 1B people.

jcollum 2009-08-04 18:59:58

i'm still trying to figure what "percents of percents" might mean...

Javier 2009-08-04 19:03:26

@Javier: that means "0.0001 or something aroung that" :)

Quassnoi 2009-08-04 19:11:31

This post may not fully answer the question (no quantification) but it certainly doesn't deserve negative numbers. Upvoted.

BobMcGee 2009-08-04 19:32:17

I certainly disagree that natural keys only differ in value, not type. A natural key could be a company's name by state, couldn't it? It would certainly be unique (assuming that company names are unique by state, which seems reasonable). So why not use that as a natural key? Or are you in favor of using natural keys only in cases where the natural key is some arbitrary level of un-complexity? Like a part number that's alphanumeric but only A-Z, 0-9? Surrogate keys are typically much less complex, usually an int, guid or bigint.

jcollum 2009-08-04 20:24:51

@jcollum: Could you please give me the definition of the term `surrogate key` as you understand it for the purposes of this discussion. You may want to read this Wikipedia article: http://en.wikipedia.org/wiki/Surrogate_key This question is crucial for the further discussion so I'd really ask you to answer it before this discussion can continue. Thank you.

Quassnoi 2009-08-04 21:11:05

@Quassnoi: I think my use of the term matches this: "the surrogate key can be the primary key, generated by the database management system and not derived from any application data in the database" (from the link you gave, which I'd already read)

jcollum 2009-08-04 21:45:43

@jcollum: ok. So we have two tables that hold companies: first one uses a `VARCHAR(200)` column filled with `NEWID()` as a `PRIMARY KEY`. `NEWID` is certainly "generated by the `RDBMS`", and certainly "is not derived from any application data in the database". The second table uses a `VARCHAR(200)` column filled with company name as a `PRIMARY KEY`. Now we see two identical tables whose keys differ only in type, not value, but the first table uses surrogate key, while the second one uses natural key. Now, what is that you disagree with?

Quassnoi 2009-08-04 22:13:58

Read above as "whose keys differ only in value, not type".

Quassnoi 2009-08-04 22:15:23

Typically, in the uses of surrogate keys that I've seen, the surrogate key is an int, bigint or a guid. Your example uses a natural key (VARCHAR(200)) and a surrogate key (VARCHAR(200)). That seems artificial. Who would choose a varchar field for a surrogate key? Why 200 characters? The number of possible rows in that table would probably take up more storage than you can find on Earth (9 E 282 rows by my math).

jcollum 2009-08-04 22:32:26

(yipes, I just assumed 26 ^ 200, the number would be a lot larger than that!!)

jcollum 2009-08-04 22:41:56

@jcollum: when I asked you of the real world problems yesterday you answered me that it's not relevant since we are speaking about concepts, not real world problems. Now, when I give you a concept, you are starting to speak about real-world usage and change the subject of discussion. You didn't answer my question. Aren't the keys in the example above surrogate and natural, respectively? Don't they differ only by value, not by type? If they are and they do, what was that you disagree with? Please answer. Thank you.

Quassnoi 2009-08-04 23:14:43

@Quassnoi: I was referring to not getting into specifics about business rules: how many Account Execs can be attached to a Company for instance. I think your example addresses natural keys in many-to-many tables. Which is fine, but not something I'm concerned about. Composite (i.e. int, int) keys in many-to-many tables seems like a reasonable thing to do. I'm more concerned about natural vs. surrogate keys in primary entity tables (Part, Company, Employee etc.).

jcollum 2009-08-10 16:42:42

`@jcollum`: you are conserned about performance of `INT32` comparison or what? This will be the same. A database will compare `32-bit` integers filled with part numbers with exactly same efficiency as ones filled with identity. Performance of `CMP EAX` does not depend on the source of the number being compared.

Quassnoi 2009-08-10 17:05:14

Answer 2

+1 A:

Use both! Natural Keys prevent database corruption. When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforece database consistency...

Charles Bretana 2009-08-04 19:18:46

I think you are misusing the term "corruption" here (which implies random values being scribbled over the database data) - you really just mean "consistency".

anon 2009-08-04 19:56:47

Perhaps, w/o picking at nits, I guess I (mis?)use the word corruption in a broader sense than I should... If there are two rows in a Transaction table that actually represent the same transaction, is that corrupt data or inconsistent data? And I have thought inconsistency specifically meant when some invariant was not satisifed, like, for e.g., if the totalAmount column in an Invoice row was not equal to the sum of the Invoice Line item row amounts.

Charles Bretana 2009-08-04 21:22:53

ansaurus

tags:

views:

answers:

surrogate vs natural key: hard numbers on performance differences?

Natural keys differ from surrogate keys in value, not type.

Comparing surrogate integers and natural integers (like `SSN`) takes exactly same time.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

related questions

ansaurus

tags:

views:

answers:

surrogate vs natural key: hard numbers on performance differences?

Natural keys differ from surrogate keys in value, not type.

Comparing surrogate integers and natural integers (like SSN) takes exactly same time.

On my machine, both queries run for 14.5 seconds, +/- 0.1 second

related questions

Comparing surrogate integers and natural integers (like `SSN`) takes exactly same time.