views:

377

answers:

5

OK. I've read things here and there about SQL Server heaps, but nothing too definitive to really guide me. I am going to try to measure performance, but was hoping for some guidance on what I should be looking into. This is SQL Server 2008 Enterprise. Here are the tables:

Jobs

  • JobID (PK, GUID, externally generated)
  • StartDate (datetime2)
  • AccountId
  • Several more accounting fields, mainly decimals and bigints

JobSteps

  • JobStepID (PK, GUID, externally generated)
  • JobID FK
  • StartDate
  • Several more accounting fields, mainly decimals and bigints

Usage: Lots of inserts (hundreds/sec), usually 1 JobStep per Job. Estimate perhaps 100-200M rows per month. No updates at all, and the only deletes are from archiving data older than 3 months.

Do ~10 queries/sec against the data. Some join JobSteps to Jobs, some just look at Jobs. Almost all queries will range on StartDate, most of them include AccountId and some of the other accounting fields (we have indexes on them). Queries are pretty simple - the largest part of the execution plans is the join for JobSteps.

The priority is the insert performance. Some lag (5 minutes or so) is tolerable for data to appear in the queries, so replicating to other servers and running queries off them is certainly allowable.

Lookup based on the GUIDs is very rare, apart from joining JobSteps to Jobs.

Current Setup: No clustered index. The only one that seems like a candidate is StartDate. But, it doesn't increase perfectly. Jobs can be inserted anywhere in a 3 hour window after their StartDate. That could mean a million rows are inserted in an order that is not final.

Data size for a 1 Job + 1 JobStepId, with my current indexes, is about 500 bytes.

Questions:

  • Is this a good use of a heap?

  • What's the effect of clustering on StartDate, when it's pretty much non-sequential for ~2 hours/1 million rows? My guess is the constant re-ordering would kill insert perf.

  • Should I just add bigint PKs just to have smaller, always increasing keys? (I'd still need the guids for lookups.)

I read GUIDs as PRIMARY KEYs and/or the clustering key, and it seemed to suggest that even inventing a key will save considerable space on other indexes. Also some resources suggest that heaps have some sort of perf issues in general, but I'm not sure if that still applies in SQL 2008.

And again, yes, I'm going to try to perf test and measure. I'm just trying to get some guidance or links to other articles so I can make a more informed decision on what paths to consider.

+5  A: 

Yes, heaps have issues. Your data will logically fragment all over the show and can not be defragmented simply.

Imagine throwing all your telephone directory into a bucket and then trying to find "bob smith". Or using a conventional telephone directory with a clustered index on lastname, firstname.

The overhead of maintaining the index is trivial.

StartDate, unless unique, is not a good choice. A clustered index requires internal uniqueness for the non-clustered indexes. If not declared unique, SQL Server will add a 4 byte "uniquifier".

Yes, I'd use int or bigint to make it easier. As for GUIDs: see the questions at the right hand side of the screen.

Edit:

Note, PK and clustered index are 2 separate issues even if SQL Server be default will make the PK clustered.

gbn
Yep, I definately wanted to avoid GUIDs as the clustered key, as all the other questions show. StartDate would need an extra identity. I was just concerned that inserting essentially "random" startdates over a 2 hour period might mean lots of reordering or something. So, in short, go add a bigint PK to get it all nice and clustered?
MichaelGG
@MichaelGG: yes. it's narrow, numeric, strictly monotonically increasing, unique = good clustered index
gbn
Right, I understand that an int PK is much more suited than a guid. My question was if adding a new bigint column just for the sake of having clustered indexes, versus leaving it as a heap was a good idea. It seems that it is the right move.
MichaelGG
@MichaelGG: yes, it makes all NC indexes smaller...
gbn
+1  A: 

As a GUId is your primary and foreign key your database will still need to check the contraints on every insert you will probably need to index this. Indexing a GUId is not advisable due to it's randomness. Therefore I'd say absolutely you should go down the bigint (probably identity) route for your primary key and use it as a clustered index.

Robin Day
I still need the guid indexed, because i need to do occasional lookups on it. A nonclustered index with a lower fill factor should work out ok, no?
MichaelGG
I would make that call based on how often you need to do your "occasional" lookups. If as you say insert speed is the key then fewer indexes will help, especially complex ones like a GUId. If you're doing one GUId lookup a week then a slow table scan may be acceptable in comparison to maintaining the index on 100 inserts a second.... I suggest you profile this to be sure though.
Robin Day
OK - will do. Thanks!
MichaelGG
+3  A: 

Heap fragmentation isn't necessarily the end of the world. It sounds like you'll rarely be scanning the data, so that's not the end of the world.

Your non-clustered indexes are the things that will impact your performance. Each one will need to store the address of the row in the underlynig table (either a heap or a clustered index). Ideally, your queries never have to use the underlying table itself, because it stores all the information needed in the ideal way (including all columns, so that it's a covering index).

And yes, Kimberly Tripp's stuff is the best around for indexes.

Rob

Rob Farley
Some of the queries are aggregates and could be serviced off of an index alone. But many need to return almost every single column from both tables. Did I understand Kimberly's info that by switching from guid to clustered bigint PK, I'll save lots of space on my other indexes?
MichaelGG
Changing your structure (or what's in a clustered index) will affect every non-clustered index in the table. I would suggest that leaving your data as a heap would be the way to go. Spend your time getting your non-clustered indexes right. And actually, you may even find that non-clustered indexes with many included columns still work out better than getting a clustered index right. But yes, a bigint (8 bytes) is smaller than a guid (16 bytes), and that appears on every leaf row of your NCIX.
Rob Farley
Also... since you're not updating, the biggest differences between leaving it as a heap and putting a surrogate key are:1. Introducing a new field will grow each row, giving you fragmentation you don't need. 2. Your table will need to be rebuilt, and the NCIXs along with it. 3. The NCIXs will be slightly smaller, based the 'row address' will have shrunk a bit. But at the end of the day, get your NCIXs right and you shouldn't have to care what your underlying heap/CIX looks like.
Rob Farley
Something which may affect your decision - do you have data already pouring into your heap, or are you designing this empty?
Rob Farley
There is data in the heap, but it's small, and we're planning an upgrade to the system anyways. So, we have the ability to take it offline and redesign it as much as needed.
MichaelGG
+3  A: 

As your own research has shown, and as all the other answerers have mentioned, using a GUID as the clustered index on a table is a bad idea.

However, having a heap also isn't really a good choice, since heaps have other issues, mostly to do with fragmentation and other things that just don't work well with a heap.

My best practice advice would always be this:

  • do use a primary, clustered key on any data table (unless it's a temporary table, or a table used for bulk-loading)
  • try to make sure the clustered key is a INT IDENTITY or BIGINT IDENTITY

I would argue that the benefits you get by adding a INT/BIGINT - even just for the sake of having a good clustered index - far outweigh the drawbacks this has (as Kim Tripp also argues in her blog post you cited).

Marc

marc_s
A: 

We have a large distributed system and the ability to let the client generate the key instead of having to round-trip it from the server is significant. To combat the "randomness", we have stolen the last 32 bits of the GUID and injected a time-stamp. Each GUID now only has 96 random bits, but this is still significant enough (for 1 millisecond). It can now be generated at the client and is basically sequential (internet latency not withstanding).

I would be curious to hear if anyone has tried this or has an opinion on the approach.

Scott Serious