views:

763

answers:

3

This is a FACT Table in a Data Warehouse

It has a composite index as follows

ALTER TABLE [dbo].[Fact_Data] 
ADD  CONSTRAINT [PK_Fact_Data] 
PRIMARY KEY CLUSTERED 
(
    [Column1_VarChar_10] ASC,
    [Column2_VarChar_10] ASC,
    [Column3_Int] ASC,
    [Column4_Int] ASC,
    [Column5_VarChar_10] ASC,
    [Column6_VarChar_10] ASC,
    [Column7_DateTime] ASC,
    [Column8_DateTime] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON
) ON [PRIMARY]
GO

In this structure, all of the varchar 10 columns have numeric values only. Is it going to be beneficial for me to change this 78 million row structure to hold BIGINT instead of VARCHAR in terms of querying and indexing?

Any other benefits / drawbacks that I should consider?

+8  A: 

You should DEFINITELY introduce a surrogate INT IDENTITY() primary key!! INT already gives you potentially up to 2 billion rows - isn't that enough??

This primary key / clustered key on SQL Server will be up to 64 bytes in size (instead of 4, for a INT) - which will make your clustered index AND all your non-clustered index be bloated beyond recognition. The whole clustering key (all your 8 columns) will be included on every single page of every single non-clustered index on that table - wasting lots and lots of space for sure.

So on any given index table, you would have up to 16 times more entries with a surrogate INT clustered key - that means lots less I/O, lots less time wasted reading index pages.

And just imagine trying to establish a foreign-key relationship to that table.... any child table would have to have all 8 columns of your primary key as foreign key columns, and specify all 8 columns in every join - what a nightmare!!

At 78 million rows, even just changing the clustering key to INT IDENTITY will save you up to 60 bytes per row - that alone would come out to be up to 4 GByte of disk space (and RAM usage in your server). And that's not even beginning to calculate the savings on the non-clustered indices.......

And of course, yes, I would also change the VARCHAR(10) to INT or BIGINT - if it's a number, make the field type numeric - no point in leaving it at VARCHAR(10), really. But that alone is not going to make a huge difference in terms of speed or performance - it just makes working with the data that much easier (don't have to always cast around to numeric types when e.g. comparing values and so forth).

Marc

marc_s
this is a fact table in my data warehouse. there are no tables using the PK from this as an FK.
Raj More
@Raj: ah, ok, that explains a few things - you forgot to mention that. But still: do you have non-clustered indices on this table, too? Those would definitely hugely from having a single INT or BIGINT primary/clustering key
marc_s
Bigint uses 64 BITS not bytes, i.e. it's twice as big, not 16 times.
Yrlec
@Yrlec: yes, BIGINT is 8 Byte - GUID is 16 BYTE - twice as big. But the OP's original clustered key made up of 8 columns was going to be up to 64 bytes in size - that's 16 times larger than a regular INT - that's what I was referring to
marc_s
marc_s, thanks! When does it start to make sense to use bigint versus int for my identity columns? If I have ~700,000 rows in a table, and I delete and insert all those rows once per week (without reseed), I'm good for 60 years, but if I do it every day (not a requirement right now), I'm down to 8 years. I'm seriously considering using bigint!
JohnB
@JohnB: INT gives you at least 2 (or even 4) billion rows - you should be safe for quite some time, I think. :-) Plus : you could always reseed your table once a year as part of a year-end maintenance job.
marc_s
+1  A: 

Marc S is right in that the 64 byte primary key is going to be duplicated into every NC index so you are going to pay an I/O cost, which will impact on the amount of data that is held in memory (since you are wasting space on a NC index page). So on that basis the question is not 'should I convert my varchars' but 'should I consider converting my clustered index to something entirely different./

In terms of the varchar vs the bigint there is a good reason to convert if you can afford the time; that is outside of the 2 byte difference in storage per field, when you compare values that are of two different types, SQL will be forced to convert one of them. This would occur on every single comparison, whether that is for an index join, or a predicate within a where clause.

Depending on what you are selecting the data by, which dimension tables are joined to the fact table, you could be picking up conversion overhead costs on each query, for it to join, since it is having to convert one side of it.

Andrew
+3  A: 

Two things that can affect index (and overall DB) performance:

1) Size of index page 2) Comparison speed

So for the first one, in general the smaller your index/data page is, the more pages you can hold in memory, and the greater the likelihood that a given query will be able to find the page in cache vs. slow disk. Thus, you'd want to use the smallest datatype that can comfortably fit your existing and proposed future needs.

BigInt is 8 bytes; the VARCHAR's can be smaller if the size of the data is small, so it really depends on your data. However, 10 character long numbers may be able to fit in SQL Server's INT datatype (http://msdn.microsoft.com/en-us/library/ms187745.aspx) depending on the size, so int vs. bigint depends on your domain.

Also, if your entire row is of a fixed length there are some certain optimizations SQL Server can do in scans since it knows exactly where on disk the next row will be (assuming the rows are contiguous). An edge case, to be sure, but it can help.

For the second one, it is faster to compare integers than unicode strings. So, if you are only storing number data, you definitely should switch to an appropriately sized numeric datatype.

Finally, Marc is correct that this becomes a very convoluted primary key. However, if your data warrants it -- such as these being your ONLY columns and you are never doing add'l queries -- you may be perfectly fine making the optimized version (with Bigints etc.) your primary key. Kind of a code smell, though, so I will echo his advise to really take a look at your data model and see if this is correct.

Matt Rogish