tags:

views:

75

answers:

4

I have a constant data flux. All data must be stored into the database with a timestamp. The data comes in a 5 minutes interval, and a select of the latest data is made in the same interval, in pseudo SQL code:

SELECT * FROM TB_TABLE WHERE TIMESTAMP = MAX(TIMESTAMP)

As this table grows really big (gigabytes), I did a premature optimization spliting it in two tables: one for all data (only for inserts), and another for the latest data (for inserts, delete and select).

I wonder if this duplication is a good thing to do, since I have no metrics to prove that it improved my application performance. As general guidelines, would you recommend what I did?

Update BTW I use MS SQL Server 2005 and .NET C# Linq-To-Sql

+2  A: 

Splitting tables with a high input volumes into a write-optimized "recent" table and a read optimized "archive" table is generally a pretty good optimization. It does increase complexity, so you don't want to do it where it's not needed, but it's reasonable if you're sure that the table in question is going to get tons of data.

Ryan Brunner
+1  A: 

I wouldn't recommend the approach you've taken. If the intent was to improve application performance, it would have been more appropriate to collect performance metrics first. If a trend indicated decreasing performance as the amount of data grew, then it would be clear that some database change was appropriate.

Assuming your primary concern is the performance of selects against a large table, steps like applying good indexes and replacing "select *" with just the columns you want might be a better place to start than duplicating data across multiple tables. If your queries had a substantial number of joins, I could see that having a negative impact on your performance. In that case, creating an additional table that eliminated the need for joins in your queries would be a good optimization.

Scott A. Lawrence
+1  A: 

You didn't mention what database you are using but I can think of a couple of possible quick optimizations. How many gigabytes are we talking about?

1) Calculating the max(timestamp) can be expensive given a large number of rows. You probably already know what this value is, store it in a different table or a config file or something. This will probably be your biggest optimzation.

2) Add another column to flag recent updates. When you start your update SET recent = false WHERE recent = true, write all your records with recent = true. You may be able to limit the size of your index by adding a where condition to it CREATE INDEX foo_index on "TB_TABLE" (recent) WHERE recent = true;

3) Make sure that your db server is properly optimized. Make sure your key and sort buffers are appropriately sized for your dataset. Most open source database come pre-tuned for a developer's workstation, not a production workload.

4) Reconsider your schema. Are you sure you need all your records? Are you recording all the data an not just the data that's changed? I've made good use of two timestamps in this situation, one timestamp for the last load and one timestamp for the last change.

Trey
5gb/month. sql server 2005
Jader Dias
+1  A: 

I wonder if table partitioning would be helpful. I haven't personally used it so cannot speak from experience, but this sounds like the appropriate situation in which to use it.

Nathan Koop
never heard of it. I'll google it. thanks
Jader Dias