views:

314

answers:

9

Whenever I design a database I automatically start with an auto-generating GUID primary key for each of my tables (excepting look-up tables)

I know I'll never lose sleep over duplicate keys, merging tables, etc. To me it just makes sense philosophically that any given record should be unique across all domains, and that that uniqueness should be represented in a consistent way from table to table.

I realize it will never be the most performant option, but putting performance aside, I'd like to know if there are philosophical arguments against this practice?

Based on the responses let me clarify:

I'm talking about consistently using a GUID surrogate key as a primary key- irrespective of whether and how any natural or sequential keys are designed on a table. These are my assumptions:

  1. Data integrity based on natural keys can be designed for, but not assumed.
  2. A primary key's function is referential integrity, irrespective of performance, sequencing, or data.
+8  A: 

Jeff Atwood talks about this in great detail:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html

Another resource:
http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

Guid Pros:
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway

Guid Cons:
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

ewwwyn
If you don't do replication, then I've never seen a use of GUIDs to merge records because it's the contents that usually matter first and foremost. Additionally I rarely merge due to foreign keys, I'd love to be able to on a regular basis.I add them if my database is distributed or will be replicated, otherwise I rely on date timestamps.
databyte
anything not in your processors native int size will run massively slower for index operations. Also have to take into account that GUIDS generation typically has to lock something more global than a regular local sequence update.
Evan Carroll
@ewwwyn Good first link, and I'd add http://krow.livejournal.com/497839.htmlSeems like the cons all still in the performance camp, and that's becoming less relevant...
Yarin
+7  A: 

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table.

What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to. The main reason for this is indeed performance, which will come and bite you down the road... (it will, trust me - just a matter of time) - plus also a waste of resources (disk space and RAM in your SQL Server machine) which is really not necessary.

You really need to keep two issues apart:

1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

Marc

marc_s
@marc_s - I've yet to see any such analysis on COMB guids (not newsequentialid) vs integers as the clustered key. Yes it eats up more space but frankly, space isn't really the primary constraint on most database systems.
Thomas
@THomas: it eats a lot more space, and **not** just on disk - but also in your server's main memory (RAM) - something a lot of folks don't take into account! RAM isn't quite as cheap as disk space, and more space = more I/O = less performance (generalized and simplified)
marc_s
@marc_s - Adding an int clustered key on all your tables (along with a GUID pk) would not solve the problem of merging databases. You would still deal with the horrid nightmare of syncing identity columns. In addition, you would be eating up 20 bytes of index space instead of 16 with just a guid PK since you'll need a unique constraint on the guid column. I'm not convinced that an additional int clustered key that you have to never use is better than just picking a surrogate pk strategy (comb guids or int pks).
Thomas
@marc_s - Again, 25MB vs 106MB on a 4-8 GB system is peanuts and that's assuming your table has a million rows.
Thomas
@Thomas: yes, but that's just one single table... imagine you have 50, 100 of those - and not just 1 million, but 100, 500 million rows. It's all a matter of scale - if you only deal with 50'000 rows - go nuts and do whatever you like - (almost) no problem. But if you need to scale, you better know what the heck you're doing....
marc_s
Also, 106MB vs. 25 MB also means you most like have 4 times as much I/O just on that one single table. Multiply that by 50, 100 tables in your system, and by 50, 200, 500, 5000 users on your system - it gets out of hand very quickly.
marc_s
@marc_s - I would definitely agree that the strategy you use for medium to large database (one expect to hit 100GB for medium or 1TB for large) is a different kettle of fish (even in your decision to use surrogates of any kind). In fact, I would also say that if the system is a single install design vs a product database (where you might get customer's merge databases) that too should effect your decision. I've seen guid pk systems grow well over 100 GB with no problems in performance. However, I'll grant you I've never seen anything in the terabyte range with guid pks.
Thomas
@marc_s - How do you save that 4x IO if you have both a int clustered key and a guid key and are always searching on the guid key? Using both an int and a guid key throughout the database seems like a crazy design. Am I misreading your suggestion?
Thomas
Index in memory? 106mb is a lot for one table.....
James Westgate
@James Westgate: of course, even index pages must be loaded into memory, and if you're wasting space on those pages, you're wasting space in your server memory....
marc_s
@marc_s - Yes I think this is an excellent point, thanks- although I would still put it in the performance, not philosophical, camp
Yarin
+3  A: 

Adding to ewwwn:

Pros

  • It makes it nearly impossible for developers to "accidentally" expose the surrogate key to users (unlike integers where it happens almost all the time).
  • Makes merging databases several orders of magnitude simpler than dealing with identity columns.

Cons

  • Fatter. The real problem with it being fatter is that it eats up more space per page and more space in your indexes making them slower. The additional storage space of Guids is frankly irrelevant in today's world.
  • You absolutely must be careful about how new values are created. Truly random values do not index well. You are compelled to use a COMB guid or some variant that adds a sequential element to the guid.
Thomas
"truly uber-unique values" : INT IDENTITY are probably the utmost unique you can get - and they handle REALLY well as clustering keys. It's not the uniqueness of GUIDs that cause the issues - it's the randomness.
marc_s
@marc_s - Bad choice of words on my part. By uber-unique I mean across time and space. "Random" is a more apt terms and I'll adjust.
Thomas
It seems to me that it makes it several orders of magnitude easier to create duplicate entities when merging databases -- not good. I like the point about making it harder to expose surrogates, though :)
onedaywhen
@onedaywhen - RE: creating duplicates, if by that you mean creating duplicates of the data itself, that is a different matter. No matter what surrogate key strategy you use, you *must* have a business key on other columns in the table for reasons that have nothing to do with database merges.
Thomas
Another point then: if you shun surrogate keys entirely and always use natural keys then it makes it impossible to expose surrogate keys to users :)
onedaywhen
"you must have a business key on other columns in the table for reasons that have nothing to do with database merges" -- sure but in my experience "surrogate key as primary key" advocates more often than not don't bother with candidate keys, business keys or otherwise, in ignorance or otherwise.
onedaywhen
@onedaywhen - RE: business key. No question the most common mistake with surrogate keys, regardless of type, is not having a unique constraint on some other combination of columns. Avoiding surrogate keys entirely is often impractical but forcing the use on all tables is also often not necessary but I see it all the time. It is as if the developer feels that adding a surrogate key somehow obviates the need for any thoughts about uniqueness.
Thomas
A: 

A potentially big reason, but one often not thought of, is if you might have to provide compatibility with an Oracle database in the future.

Since Oracle doesn't have a uniqueid column data type, it can lead to a bit of a nightmare when you have two different data types for the same primary key across two different databases, especially when an ORM is involved.

Coxy
+2  A: 

You still implement the natural key of each table as well don't you? - GUID keys alone obviously won't prevent duplicate data, redundancy and consequent loss of data integrity.

Assuming that you do enforce other keys then adding GUIDs to every table without exception is probably just adding unnecessary complexity and overhead. It doesn't really make it easier to merge data in different tables because you still have to modify/de-duplicate the other key(s) of the table anyway. I suggest you should evaluate the use of a GUID surrogate on a case-by-case basis. Having a blanket rule for every table isn't necessary or helpful because every table models a different thing after all.

dportas
Agreed that as a DB designer you always need to assess and enforce data integrity on a case by case basis, and that one should be wary of blanket rules. But in my view there's value in the GUID surrogate providing a common interface approach to recognizing record uniqueness. We can try to enforce data integrity, but shouldn't assume it, and the GUID key can at least provide fool-proof record uniqueness every time, even if the natural key rules have been violated.
Yarin
All keys are "fool-proof" if they are enforced with a uniqueness constraint. I can't agree that we should not assume data integrity! The first responsibility of the database designer is to create the correct data model - one that ensures the relevant facts about the business are recorded in a way that avoids inaccurate results. Unless and until you implement the natural keys you have not achieved that. A guid will not help you.
dportas
@David- I'm generally wary of assuming anything, let alone real-world data integrity- It's impossible to predict how a database will be used/abused by software devs and future designers/admins once the original author cashes in his 401k. Natural keys cannot be fool-proof, hence the appeal of a surrogate key that is arbitrary i.e. not reliant on the data- So guess I'm talking about guaranteeing referential integrity, instead of assuming data integrity
Yarin
"It's impossible to predict how a database will be used/abused by software devs and future designers/admins".Including removing the GUID column.Referential integrity is largely pointless, if there's no integrity to the data. The point of creating unique key constraints/indexes on natural keys is that the DBMS will enforce that uniqueness for the records identified by that key - you're not **assuming** real world data integrity, you're **enforcing** it within the database.
Mark Bannister
@Yarin, I agree with Mark. You aren't really enforcing referential integrity if you aren't also enforcing the natural keys. By allowing duplicate data you are creating the potential for deletion anomalies which means the referencing row could actually point to the wrong version of the data while the row that should have been referenced gets deleted. Enforcing the keys of interest to the business users ought to come before RI and is generally more important.
dportas
@Mark, David, OK I hear you on data integrity. But regardless of how data integrity is enforced, are we doing any damage by adding the surrogate key?
Yarin
@Yarin, if you do it every time you are possibly adding complexity and some performance overhead where it isn't really needed. For instance you are probably creating more joins in queries than would otherwise be the case (because you have to join to retrieve the meaningful attribute from the other table). Also if surrogates become an excuse not normalize your database then you are potentially creating redundancy, which is obviously bad for all the usual reasons.
dportas
@Yarin: What David said. An additional thought occurs to me, though: if you're implementing an Object/Relational Mapping (ie. using an RDBMS as a bucket to store object-oriented data), the usual relational data integrity rules just don't apply.
Mark Bannister
@Mark, David Good points- thanks for the give-and-take
Yarin
A: 

Also read this

Make sure to read all the comments

It seems GUID is not unique

http://blog.sqlauthority.com/2010/04/28/sql-server-guid-vs-int-your-opinion/

Madhivanan
A: 

I can see the case for a given application's or enterprise's own identifiers to be unique and be represented in a consistent way across all its own domains (i.e. because they may span more than one database) but a GUID is overkill for these purposes. I guess they are popular because they are available out of the box and designing and implementing an 'enterprise key' takes time and effort. The rule when designing an artifical identifier is to make it as simple as possible but no simpler. IDENTITY is too simple, a GUID isn't simple enough.

Entities that exist outside of the application/enterprise usually have their own identifiers (e.g. a car has a VIN, a book has an ISBN, etc) maintained by an external trusted source and in such cases the GUID adds nothing. So I guess the philosphical argument against I'm getting at here is that using a artifical identifier on every table is unnecessary.

onedaywhen
I would caution against relying on external identifiers. The first time you have to enter that recovered stolen vehicle with the VIN's filed of, or that self published "book" in your collection that does not have an ISBN, you will be heading down the slippery slope of "making up your own VIN, ISBN, ..." entries.
Peter Stuer
If a self published "book" has a different identifier as a published book book then they wouldn't exist in the same base table (perhaps combined in a `VIEW`).
onedaywhen
@Peter Stuer: are there cases for artificial identifiers? Yes. Does every table need an artificial identifier? No.
onedaywhen
I'm down voting this (Nvmd I can't, not enough rep) because you're making assumptions about domain integrity that don't apply in the real world. Assuming that a record will only be referenced within its current domain is not practical.
Yarin
@Yarin: I think you miss my point: if an entity is to be identified outside the application/entity then EITHER it will already have an universal identifier of the UPC/EAN/ISBN/VIN variety maintained by a trusted source and therefore a GUID is redundant OR your application is to be the trusted source and therefore a GUID is not the optimum format for a key.
onedaywhen
@Yarin: "Assuming that a record will only be referenced within its current domain" -- in DBMS terms, 'record' and 'reference' have very specific meanings which I assume (hope!) you didn't intend to imply.
onedaywhen
@onedaywhen: I get what you're saying now- good point, chewing on it
Yarin
@Yarin: If one were to create 64-bit random mini-GUID values by hashing GUIDs, how likely is it that there would ever be a conflict in a domain which is expected to have no more than e.g. a few hundred thousand records? Even if the domain grows a thousand-fold beyond what's expected a duplicate would be unlikely (if the domain grows too big, the Birthday Principle says duplication would become likely, but a lot of domains would never grow anywhere near large enough for that to be a danger).
supercat
+1  A: 

Simple answer: it's not relational.

The record (as defined by the GUID) may be unique, but none of the associated attributes can be said to be occuring uniquely with that record.

Using a GUID (or any purely surrogate key) is no more relational than declaring a flat file to be relational, on the basis that each record can be identified by its row number.

Mark Bannister
+1  A: 

I wonder why there's no standard "miniGUID" type? It would seem that performing a decent hash on a GUID should yield a 64-bit number which would have a trivial probability of duplication in any universe which doesn't have a billion or more things in it. Since the universe in which most GUID/miniGUID identifiers are used will never grow beyond a million things, much less a billion, I would think a smaller 8-byte miniGuid would be very useful.

That would not, of course, suggest that it should be used as a clustered index; that would greatly impede performance. Nonetheless, an 8-byte miniGUID would only waste a third the space of a full GUID (when compared to a 4-byte index).

supercat
Interesting point
Yarin