views:

5051

answers:

10

I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.

I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.

My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.

What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?

A: 

I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.

Kevin Sheffield
+22  A: 

When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.

As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: http://www.codinghorror.com/blog/archives/000817.html.

As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.

[Edit] @Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:

For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.

Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.

If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.

Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.

http://forums.asp.net/t/264350.aspx

JeremiahClark
+4  A: 

The primary key serves three purposes:

  • indicates that the column(s) should be unique
  • indicates that the column(s) should be non-null
  • document the intent that this is the unique identifier of the row

The first two can be specified in lots of ways, as you have already done.

The third reason is good:

  • for humans, so they can easily see your intent
  • for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.

A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.

Mark Harrison
+2  A: 

@Mark It's definitely not a good idea to have a guid column as a primary key, because primary keys are clustered and guids are random. That means that any time you insert a new row your table is essentially being restructured on disk. People normally advise that primary keys should be sequential, ever-growing types so that each new row gets tacked onto the end of the table.

Matt Hamilton
A primary key is by default backed by a clustered index but it can be removed (the clustered index).
Andrei Rinea
+1  A: 

A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.

Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.

In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).

In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...

Matthew Schinckel
+1  A: 

@Jeremiah Fascinating. I will look into the "page splits and index rebuilds" option if performance becomes an issue. Thanks for that.

Matt Hamilton
A: 

re "...not a good idea to have a guid column as a primary key, because primary keys are clustered and guids are random"

to overcome this, you can use the "newsequentialid()" function on SQL 2005/2008

edit: found the requisite CodingHorror post that talks about this ;-)

Leon Bambrick
+1  A: 

Just jumping in, because Matt's baited me a bit.

You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.

A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.

Rob

Rob Farley
A: 

Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index. In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.

zvolkov