views:

499

answers:

6

In reading this and this and then reading this (which references the other two ironically) I find myself wondering just how big the discussion of this topic is? I am a SQL Server guy and so I tend to use an Identity that is auto generated in the form of an int. However, when I know that I will need some form of replication between server and server or synchronization between client and server I tend to use a GUID as my key.

Question: Should I just use a GUID as my primary key across all tables at all times just in case I might need this possible scalability down the road? Does this make my schema more flexible in that it can be migrated between platforms at all times? Does this help me maintain the flexibility of my ORM (regardless of flavor) by not embedding platform specific features?

Responses:

@David Archer: Per your comment I updated my post to not say "Natural Key". Your are correct in that natural key is defined as such. Thanks for the correction.

+1  A: 

I have been burned too many times by "natural keys" changing or being duplicated to ever consider using them. My decision whether to use sequences or GUIDs for keys is determined by whether I expect to ever read or speak one of them.

Dour High Arch
Sequences have the downside that they can be guessed. So so someone may edit the URL to see what record it returns.
jm
If someone can navigate to forbidden pages by editing URLs, your access security is broken. Using a GUID is not going to fix the problem.
Dour High Arch
+3  A: 

You should probably not use raw GUIDs as your primary key. Doing so leads to substantial fragmentation of your data. SQL Server has a function to give you a "sequential guid" to help mitigate this problem. There is a good discussion of this topic in depth here. Another excellent discussion of that is here ...

This shows the amount of fragmentation for random guids is very significant (It is recomended that "Fragmentation in Percent" should be as near to zero as possible). The number of pages used by random guid is 40% higher and the amount of space used on each page is less, therefore the disc space required will increase.

JP Alioto
In the answer above, the value of a Guid is not hitting the database, but if the function to generate a sequential guid is a t-sql function, you're still hitting the database. If you're recommending a sequential guid and it requires a db hit, why not just hit the DB and get a sequential number.
Stephanie Page
+1  A: 

I haven't had much experience with this but using a GUIDs to join makes me cringe. 4 bytes versus 36 seems icky.

However I have begun using GUIDs as public identifiers rather than the identity fields themselves. Take a look at the URL above, 1156712. If for some reason SO had to be merged with another similar application (say SU) these question ids would collide one or the other would have to change it's URL messing up any hard coded links and probably google stats as well. Whereas if the way every element was publicly identified was through the use of a GUID and internal joins used the int or bigint fields you could have the best of both worlds.

Merges are still possible too using this approach. If a conflict is found a new internal identifier can be generated on the fly without disrupting the rest of the application.

Spencer Ruport
GUIDs are 32 hex characters == 16 bytes, not 36. :)
David Archer
+1  A: 

I'd avoid GUIDS for Primary Keys unless you know you are really going to need it (i.e. for multi-system synchronization, etc).

In the land of SQL Server replication, a guid is added to rows in replicated tables to achieve uniqueness, so it's quite possible to establish this design later if you have the need.

As to fragmentation, also consider the cost to your disk space. If you are going to be under 10,000 rows (in a table) this is probably not a huge problem but if your system has to support above 10,000 rows (in a table) you'll find performance and disk storage cost (and index frangmentation) is better served by the use of Big Ints (large integers) + identity (autonumber) which scale well to volume.

I'd avoid natural keys altogether - even the risk of logic changing around them makes it too risky IMHO (e.g. if they suddenly become non-unique).

RobS
+3  A: 

I tend to prefer application-generated primary keys, typically using the lo/hi algorithm as implemented by NHibernate (when I'm using it on a project). Otherwise, sequential GUIDs work just as well. This isn't just my advice but rather of several folks who have been doing this whole development thing a lot longer than myself.

The problem I see with using DB generated primary keys is that you have to hit the database to get those identity values versus having everything set up before you persist it to the database. It typically breaks the Unit of Work pattern in NHibernate as well due to this fact. If you're not using the UoW pattern in your app, then obviously this drawback doesn't apply.

If you are using GUIDs for your PK, you definitely want to use sequential GUIDs to eliminate index fragmentation. This also gives you the "rough sort order" that another poster mentioned although I'd typically have a DateInserted column or similiar for those types of things.

Joining on a GUID column has been shown to have a fairly minimal performance overhead versus your 4-byte integer and I'd venture to say that for non-large datasets, the performance difference is trivial.

Natural keys are the spawn of the devil. :)

David Archer
+1 - spawn of the devil!!! Nice answer.
Andrew Siemer
+1  A: 

I support most of the other answerers in saying you should avoid GUIDs as your clustered key in SQL Server - if you really want to, you could use them as primary key, but don't cluster your table on it.

The primary key is the logical concept of a key to uniquely identify each row - here, a GUID can make sense since it's pretty much guaranteed to be unique.

But the clustered key is a physical concept which physically orders the rows in the table, and here due to their random nature, GUIDs are poorly suited. This will lead to massive index fragmentation and thus to poor performance, even if you keep reorganizing your index (and thus table data) over and over again.

Furthermore, since the clustered index key is being used as the lookup value to find the row in the table, it will be added to each and every entry of each and every non-clustered index on your table, too, and here the size of the GUID (16 bytes) vs. INT (4 bytes) comes into play - you potentially waste a lot of space just for keeping track of the lookup values.

The best discussion of primary / clustered indices and GUIDs I know of is the couple of article by Kim Tripp, the Queen of Indexing in SQL Server land - check them out!

Her ultimate requirements for a clustered index are: small, stable, unique, and hopefully ever-increasing. GUID's violate two of those (small and ever-increasing). Even the GUIDs generated by the NEWSEQUENTIALGUID() function in SQL Server aren't totally and truly sequential - so I wouldn't use those either.

Marc

marc_s