views:

60

answers:

5

If I had a table with the columns:

  • Artist
  • Album
  • Song
  • NumberOfListens

...is it better to put a clustered primary key on Artist, Album, and Song or to have an autoincrementing id column and put a unique constraint on Artist, Album, and Song.

How important is database consistency? If half of my tables have clustered primary keys and the other half an id column with unique constraints, is that bad or does it not matter? Both ways seem the same to me but I do not know what the industry standard is or which is better and why.

+5  A: 

I would never put a primary key on columns of long text like: Artist, Album, and Song. Use an auto increment ID that is the clustered PK. If you want the Artist, Album, and Song to be unique, ad an Unique Index on the three. If you want to search by Album or Song, independent of independent Artist, you'll need an index for each, which pulls in the PK, so having a small PK saves you on each other index. The savings are not just disk space but in memory cache, and more keys on a page.

KM
A: 

Clustered indexes are great for range based queries. For example, a log date or order date. Putting one on Artist, Album, and Song will [probably] cause fragmentation when you insert new rows.

If your DB supports it, add a non-clustered primary key on Artist, Album, and Song and call it good. Or just add a unique key on Artist, Album, and Song.

Having an autoincrementing primary key would only really be useful if you had to had referential integrity to another table.

beach
A classic urban myth - the range query is only about the third characteristic of a clustered index. Much more important are the physical lookup of the actual data rows (thus the clustering key is added to all non clustered indices and therefore should be as small as possible), and the physical ordering of the table (which then - as a side-effect - makes certain range queries just a tad more efficient than with a non-clustered index). See Kim Tripp's blog post for a great discussion of this: http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
marc_s
@marc_s: I wouldn't call including one of the benefits or use cases an urban myth ;) I just didn't include all of the aspects of clustered vs non-clustered indexes to the discussion. Other people contributed those aspects after I answered. I provided what I would do, which is create a unique key.
beach
A: 

Without knowing the exact requirements, in general you would probably have an artist table, and possibly album table too. A song table would then be a unique combination of artist id, album id and then song. I'd enforce the uniqueness by an index or constraint depending on application, and use an id for a primary key.

Miles D
+1  A: 

You really need to keep two issues apart:

1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario. You reference primary keys in your foreign key constraints, so those are crucial for the integrity of your database. Use them - always - period.

2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, unique, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way, you can easily pick a column that is not your primary key to be your clustering key.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a VARCHAR(20) or so as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

Marc

marc_s
+1, well put!!!
KM
Sorry I can't go too into detail about the actual database I am working on. It's got about 50 tables and it is too complex to type more detail. Are you saying that by clustering primary keys or even clustering columns that are not primary keys cause drain in server memory? Using my example, basically every table in this database has an Artist and Album column as a primary key, then 1 to 3 other columns as the primary key(s) as well. It's poorly designed and I'm reorganizing it.
Sarah
@Sarah: if your **clustering key** is fairly large (e.g. a VARCHAR(50) or so - or even worse, two or more columns), then YES - you are wasting disk and server memory space - and you're not really gaining anything from it, either.
marc_s
A: 

First of all, there's already a problem here because the data is not normalized. Creating any sort of index on a bunch of text columns is something that should be avoided whenever possible. Even if these columns aren't text (and I suspect that they are), it still doesn't make sense to have artist, album and song in the same table. A much better design for this would be:

Artists (
    ArtistID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    ArtistName varchar(100) NOT NULL)

Albums (
    AlbumID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    ArtistID int NOT NULL,
    AlbumName varchar(100) NOT NULL,
    CONSTRAINT FK_Albums_Artists FOREIGN KEY (ArtistID)
        REFERENCES Artists (ArtistID))

Songs (
    SongID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    AlbumID int NOT NULL,
    SongName varchar(100) NOT NULL,
    NumberOfListens int NOT NULL DEFAULT 0
    CONSTRAINT FK_Songs_Albums FOREIGN KEY (AlbumID)
        REFERENCES Albums (AlbumID))

Once you have this design, you have the ability to search for individual albums and artists as well as songs. You can also add covering indexes to speed up queries, and the indexes will be much smaller and therefore faster than the original design.

If you don't need to do range queries (which you probably don't), then you could replace the IDENTITY key with a ROWGUID if that suits your design better; it doesn't really matter much in this case, I would stick with the simple IDENTITY.

You have to be careful with clustering keys. If you cluster on a key that is completely not even remotely sequential (and an artist, album, and song name definitely qualify as non-sequential), then you end up with page splits and other nastiness. You don't want this. And as Marc says, a copy of this key gets added to every index, and you definitely don't want this when your key is 300 or 600 bytes long.

If you want to be able to quickly query for the number of listens for a specific song by the artist, album, and song name, it's actually quite simple with the above design, you just need to index properly:

CREATE UNIQUE INDEX IX_Artists_Name ON Artists (ArtistName)
CREATE UNIQUE INDEX IX_Albums_Artist_Name ON Albums (ArtistID, AlbumName)
CREATE UNIQUE INDEX IX_Songs_Album_Name ON Songs (AlbumID, SongName)
    INCLUDE (NumberOfListens)

Now this query will be fast:

SELECT ArtistName, AlbumName, SongName, NumberOfListens
FROM Artists ar
INNER JOIN Albums al
    ON al.ArtistID = ar.ArtistID
INNER JOIN Songs s
    ON s.AlbumID = al.AlbumID
WHERE ar.ArtistName = @ArtistName
AND al.AlbumName = @AlbumName
AND s.SongName = @SongName

If you check out the execution plan you'll see 3 index seeks - it's as fast as you can get it. We've guaranteed the exact same uniqueness as in the original design and optimized for speed. More importantly, it's normalized, so both an artist and an album have their own specific identity, which makes this a great deal easier to manage over the long term. It's much easier to search for "all albums by artist X." It's much much easier and faster to search for "all songs on album Y."

When designing a database, normalization should be your first concern, indexing should be your second. And you're likely to find that once you have a normalized design, the best indexing strategy becomes kind of obvious.

Aaronaught