views:

1441

answers:

7

I'm working on a table design that could involve many NULL values in about 10 fields maybe 75% of the time the fields would be unused.

I just generated some fake data (a million records) and could not sense any impact on SQL Server 2005. Size difference was in the KB. Performance - no measurable difference after adding an index to the 3 non-nullable columns.

I know SQL Server 2008 has the sparse columns feature (which I assume is going to be used on the next SharePoint's UserData table). I want my code to work on 2005 though. But lots of NULL values exist in the design of the current SharePoint UserData table. So if its good enough for Microsoft...

Any good articles, links, white papers on the drawbacks or pain points around many NULL values in SQL Server table? Anyone have any experience on what happens as you scale to 10 mil or 100 mil records?

A: 

Don't make a table with 75% unused columns. Make it with the columns your going to use all the time and look into using something like EAV for the other columns, or put them in a different table.

Andrew Clark
Thinking about the different table idea. Leaning against EAV because the of amount of pivoting I would have to do constantly and because the 10 fields never change. Its not a flexible schema like a CouchDB, SimpleDB, and Notes uses.
tyndall
If the 10 fields are never going to change/get added to go with a separate table for sure.
Andrew Clark
+1  A: 

The problems I've had in the past deal with the programing implications of having NULL values. For example issues with the clients, or issues with not in queries returning data when not expected because a null value was in there.

JoshBerke
+1  A: 

Well, NULL is always a bit of an oddball in databases. I don't think it has too much of a performance impact in your case - but of course, you'll have to deal with all the NULL values separately.

Whenever possible, I strive to use a default value instead, so if you have e.g. some ID value of type INT, you could use 0 or -1 as a "no value present" indicator. That way, you can avoid having to do checks for value (field < 0) and check for NULL separately (field IS NULL or IS NOT NULL).

Marc

marc_s
A: 

There's only one way to be sure. Go ahead and insert 100 million records then measure the end-to-end performance.

James L
While I agree with this as a method, it is a relatively sloppy way to test what, on the surface, appears to be bad architecture.
Gregory A Beamer
Agreed, and adding another column in the future would be near impossible.
GateKiller
+4  A: 

I have never had a problem with the performance on multiple null columns, even on databases in the 100s of gigs size. I imagine you can end up with issues if you are running indexes on these fields and then using null in the query, but I have not seen this as a problem personally. Then again, I have not created database tables where every field except 3 was nullable.

On the other hand, I see an architecture problem when most of the data is null. the general reason is either a) an improperly normalized database or b) an attempt to allow users to stage data in the end table rather than creating separate tables to "build" data prior to committing to the database.

It is up to you to determine the best architecture of your database.

Gregory A Beamer
+1. Thanks for the advice.
tyndall
+3  A: 

What I do in this situation, which is very common, is to split the data up into two tables:

  • Required Data
  • Optional Data

For example, I'm currently writing a community website and one of the tables will obviously be a user table. I am recording a large amount of information about users and so I have split the data I collect into two tables:

  • Users
  • UserDetails

The Users table contains basic information that I will need all the time such as Username, Name and Session Information.

The UserDetails table contain extra information which I don't need as often such as Profile Page, Email Address, Password, Website Address, Date of Birth and so on.

This is known as vertical partitioning.

GateKiller
+1 Thanks for the new terminology. I'll have to go of and do some reading on that now. I wonder what the performance is like with this strategy when you get into the 100s of millions of records. I guess the 1-to-1 JOIN is not really that expensive if things are indexed corrected.
tyndall
No Problem :) You should only be joining the information when you need to view the entire record. The Required Data should be used for searching, browsing, listing etc. It may be a little slow than 1 large table but it's much more scalable.
GateKiller
+1  A: 

The higher probability of NULL in column, the more close to end of record the column should be in table (to lat column in table).
The NULLS at the end of the row are not allocated any space, they are determined by NULL BITMAP linked to each record (it is 2 bytes, each bit of which tell about (non)NULL-ness of one of the column value in record).

Now, the NULL values are not read from column, they are read from NULL bitmaps. When NULL is detected the real value reading is skipped

The sparse feature should be used with cautions as it invokes overhead in time and space for non-null values For performance, you may to engage filtered indexing on non-null part of a column

vgv8