views:

168

answers:

3

I need to store data in a SQL Server 2008 database from various data sources with different data types. Data types allowed are: Bit, Numeric (1, 2 or 4 bytes), Real and String. There is going to be a value, a timestamp, a FK to the item of which the value belongs and some other information for the data stored.

The most important points are the read performance and the size of the data. There might be a couple thousand items and each item may have millions of values.

I have 5 possible options:

  1. Separate tables for each data type (ValueBit, ValueTinyInt, ValueSmallInt, etc... tables)
  2. Separate tables with inheritance (Value table as base table, ValueBit table just for storing the Bit value, etc...)
  3. Single value table for all data types, with separate fields for each data type (Value table, with ValueBit BIT, ValueTinyInt TINYINT etc...)
  4. Single table and single value field using sql_variant
  5. Single table and single value field using UDT

With case 2, a PK is a must, and,

1000 item * 10 000 000 data each > Int32.Max, and,
1000 item * 10 000 000 data each * 8 byte BigInt PK is huge

Other than that, I am considering 1 or 3 with no PK. Will they differ in size?

I do not have experience with 4 or 5 and I do not think that they will perform well in this scenario.

Which way shall I go?

+1  A: 

Your question is hard to answer as you seem to use a relational database system for something it is not designed for. The data you want to keep in the database seems to be too unstructured for getting much benefit from a relational database system. Database designs with mostly fields like "parameter type" and "parameter value" that try to cover very generic situations are mostly considered to be bad designs. Maybe you should consider using a "non relational database" like BigTable. If you really want to use a relational database system, I'd strongly recommend to read Beginning Database Design by Clare Churcher. It's an easy read, but gets you on the right track with respect to RDBS.

Whoever
A: 

I think it's a great question - This situation is fairly common, though it is awkward to make tables to support it.

In terms of performance, having a table like indicated in #3 potentially wastes a huge amount of storage and RAM because for each row you allocate space for a value of every type, but only use one. If you use the new sparse table feature of 2008 it could help, but there are other issues too: it's a little hard to constrain/normalize, because you want only only one of the multiple values to be populated for each row - having two values in two columns would be an error, but the design doesn't reflect that. I'd cross that off.

So, if it were me I'd be looking at option 1 or 2 or 4, and the decision would be driven by this: do I typically need to make one query returning rows that have a mix of values of different types in the same result set? Or will I almost always ask for the rows by item and by type. I ask because if the values are different types it implies to me some difference in the source or the use of that data (you are unlikely, for example, to compare a string and a real, or a string and a bit.) This is relevant because having different tables per type might actually be a significant performance/scalability advantage, if partitioning the data that way makes queries faster. Partitioning data into smaller sets of more closely related data can give a performance advantage.

It's like having all the data in one massive (albeit sorted) set or having it partitioned into smaller, related sets. The smaller sets favor some types of queries, and if those are the queries you will need, it's a win.

Details:

CREATE TABLE [dbo].[items](
    [itemid] [int] IDENTITY(1,1) NOT NULL,
    [item] [varchar](100) NOT NULL,
 CONSTRAINT [PK_items] PRIMARY KEY CLUSTERED 
(
    [itemid] ASC
)
) 

/* This table has the problem of allowing two values 
in the same row, plus allocates but does not use a 
lot of space in memory and on disk (bad): */

CREATE TABLE [dbo].[vals](
    [itemid] [int] NOT NULL,
    [datestamp] [datetime] NOT NULL,
    [valueBit] [bit] NULL,
    [valueNumericA] [numeric](2, 0) NULL,
    [valueNumericB] [numeric](8, 2) NULL,
    [valueReal] [real] NULL,
    [valueString] [varchar](100) NULL,
 CONSTRAINT [PK_vals] PRIMARY KEY CLUSTERED 
(
    [itemid] ASC,
    [datestamp] ASC
)
) 

ALTER TABLE [dbo].[vals]  WITH CHECK 
ADD  CONSTRAINT [FK_vals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO

ALTER TABLE [dbo].[vals] CHECK CONSTRAINT [FK_vals_items]
GO


/* This is probably better, though casting is required 
all the time. If you search with the variant as criteria, 
that could get dicey as you have to be careful with types, 
casting and indexing. Also everything is "mixed" in one 
giant set */

CREATE TABLE [dbo].[allvals](
    [itemid] [int] NOT NULL,
    [datestamp] [datetime] NOT NULL,
    [value] [sql_variant] NOT NULL
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[allvals]  WITH CHECK 
ADD  CONSTRAINT [FK_allvals_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO

ALTER TABLE [dbo].[allvals] CHECK CONSTRAINT [FK_allvals_items]
GO


/* This would be an alternative, but you trade multiple 
queries and joins for the casting issue. OTOH the implied
partitioning might be an advantage */

CREATE TABLE [dbo].[valsBits](
    [itemid] [int] NOT NULL,
    [datestamp] [datetime] NOT NULL,
    [val] [bit] NOT NULL
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[valsBits]  WITH CHECK 
ADD  CONSTRAINT [FK_valsBits_items] FOREIGN KEY([itemid])
REFERENCES [dbo].[items] ([itemid])
GO

ALTER TABLE [dbo].[valsBits] CHECK CONSTRAINT [FK_valsBits_items]
GO

CREATE TABLE [dbo].[valsNumericA](
    [itemid] [int] NOT NULL,
    [datestamp] [datetime] NOT NULL,
    [val] numeric( 2, 0 ) NOT NULL
) ON [PRIMARY]

GO

... FK constraint ...

CREATE TABLE [dbo].[valsNumericB](
    [itemid] [int] NOT NULL,
    [datestamp] [datetime] NOT NULL,
    [val] numeric ( 8, 2 ) NOT NULL
) ON [PRIMARY]

GO

... FK constraint ...

etc...
onupdatecascade
Thanks for reminding the sparse columns feature, it may come handy in this situation. Your answer is very helpful, I'm evaluating your other points as well.
Recep
A: 

What are usage scenarios? Start with samples of queries and calculate necessary indexes. Consider data partitioning as mentioned before. Try to understand your data / relations more. I believe the decision should be based on business meaning/usages of the data.

I do not want to leave this unanswered. What I applied corresponds to your suggestion, so I'm picking your answer. Option 3 was the way to go.
Recep