views:

1316

answers:

6

My simplified and contrived example is the following:-

Lets say that I want to measure and store the temperature (and other values) of all the worlds' towns on a daily basis. I am looking for an optimal way of storing the data so that it is just as easy to get the current temperature in all the towns, as it is to get all the temperature historically in one town.

It is an easy enough problem to solve, but I am looking for the best solution.

The 2 main options I can think of are as follows:-

Option 1 - Same table stores current and historical records

Store all the current and archive records in the same table.

i.e.

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This would keep everything simple, but what would be the most efficient query to get a list of towns and there current temperature? Would this scale once the table has millions of rows in? Is there anything to be gained by having some sort of IsCurrent flag in the table?

Option 2 - Store all archive records in a separate table

There would be a table to store the current live measurements in

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

And a table to store historical archived date (inserted by a trigger perhaps)

CREATE TABLE [dbo].[WeatherMeasurementHistory](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This has the advantages of keeping the main current data lean, and very efficient to query, at the expense of making the schema more complex and inserting data more expensive.

Which is the best option? Are there better options I haven't mentioned?

NOTE: I have simplified the schema to help focus my question better, but assume there will be alot of data inserted each day (100,000s of records), and data is current for one day. The current data is just as likely to be queried as the historical.

A: 

I suggest keep in the same table since historical data is queried just as often. Unless you will be adding many more columns to the table.

When size becomes an issue, you can partition it out by decade and have a stored procedure union the requested rows.

Gordon Bell
Do you have any views on what would be the most efficient query to get a list of towns and their current temperature.
Andrew Rimmer
+3  A: 

it DEPENDS on the applications usage patterns... If usage patterns indicate that the historical data will be queried more often than the current values, then put them all in one table... But if Historical queries are the exception, (or less than 10% of the queries), and the performance of the more common current value query will suffer from putting all data in one table, then it makes sense to separate that data into it's own table...

Charles Bretana
+1  A: 

Another alternative could be to go for one table for all data and have a view for the current temperature. This will not help performance but could well aid readability/maintainability. You could even go for an indexed view to improve performance if you have the appropriate version of sql .

PhilHoy
+1  A: 

I would keep the data in one table unless you have a very serious bias for current data (in usage) or history data (in volume). A compound index with DATE + TOWNID (in that order) would remove the performance concern in most cases (although clearly we don't have the data to be sure of this at this time).

The one thing I would wonder about is if anyone will want data from both the current and history data for a town. If so, you just created at least one new view to worry about and possible performance problem in that direction.

This is unfortunately one of those things where you may need to profile your solutions against real world data. I personally have used compound indexes such as specified above in many cases, and yet there are a few edge cases where I have opted to break the history into another table. Well, actually another data file, because the problem was that the history was so dense that I created a new data file for it alone to avoid bloating the entire primary data file set. Performance issues are rarely solved by theory.

I would recommend reading up on query hints for index use, and "covering indexes" for more information about performance issues.

Godeke
I'd slightly amend your statement to "Performance issues are rarely solved y theory *alone*." Knowing the theory is the only way to have good hunches to try while optimizing - otherwise you're just thrashing around and may never improve the performance. I gather that's what you meant. :)
Ian Varley
Proper indexing should eliminate any need for query hints. Query hints tend to hamstring the optimizer. In 12 years of SQL Server development and design I think I've had to use a query hint once - maybe twice. The problem is, if your data changes SQL Server can't adapt once you have the query hint.
Tom H.
I agree with both Ian and Tom. You need to understand the theory, but optimization is always hands on in the end. As far as query hints, I agree that they *shouldn't* be necessary, but if you hit a dead end with the built in optimizer (2005 fails where 2000 succeeds sometimes) then you use a hint.
Godeke
A: 

I would use a single table with index views to provide me with the latest information. SQL 2005 and 2008 server are designed for data warehousing so should preform well under this condition.

If you have a data pattern that requires writing to the db often, then the best choice would be to have an active table and archive table that you batch update at some interval.

Aaron Fischer
+1  A: 

Your table is very narrow and would probably perform in a single properly indexed table which would never outstrip the capacity of SQL Server in a traditional normalized OLTP model, even for millions and millions of rows. Even with dual-table model advantages can be mitigated by using table partitioning in SQL Server. So it doesn't have much to recommend it over the single table model. This would be an Inmon-style or "Enterprise Data Warehouse"- scenario.

In much bigger scenarios, I would transfer the data to a data warehouse (modeled with a Kimball-style dimensional model) on a regular basis and simply purge the live data - in some simple scenarios like yours, there might effectively be NO live data - it all goes straight into the warehouse. The dimensional model has a lot of advantages when slicing data different ways and storing huge numbers of facts with a variety of dimensions. Even in the data warehouse scenario, often fact tables are partitioned by date.

It might not seem like your data has this (Town and Date are your only explicit dimensions), however, in most data warehouses, dimensions can snowflake or there can be redundancy, so there would be other dimensions about the fact stored at time of load instead of snowflaking for more efficiency - like State, Zip Code, WasItRaining, IsStationUrban (contrived).

This might seem silly, but when you start to mine the data for results in data warehouses, this makes asking questions like - on a day with rain in urban environments, what was the average temperature in Maine? - just that little bit easier to get at without joining a whole bunch of tables (i.e. it doesn't require a lot of expertise on your normalized model and performs very quickly). Kind of like useless stats in baseball - but some apparently turn out to be useful.

Cade Roux