views:

77

answers:

4

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute) every measurement holds a FK to a timestamp. We have 8M (million) measurements, and 2M timestamps.

We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table. Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.

But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.

My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.

What is your first feeling on this?

EDIT: new measurements: 400 records per 5 minutes

EDIT 2: maybe the question is not so clear:

the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted. In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.

The second solution is to join the two tables in a query.

+3  A: 

With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.

*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.

tster
I would guess this to be true, but again it depends on queries, and the OP may just have to setup a benchmark using both methods and see which performs better
Earlz
+2  A: 

If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.

So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.

VladV
+2  A: 

I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.

Winston Smith
If the plain old join (perhaps encapsulated in a view) doesn't give you what you want, the indexed join is a good thing to try. I personally think the plain old join will be fine, but if not, the indexed view will make *and maintain* a denormalized copy of the data without you having to deal with it.
Robert Calhoun
allright, i'll check out the indexed view.
Michel
+2  A: 

Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

HLGEM
That's new for me: 'the cache'?
Michel