views:

91

answers:

3

Let's say you're designing the DB schema for the next stack overflow and more specifically the part of the schema that handles question ratings.

I assume you'd use a table like:
ratings(question_id, user_id, rating)
... that will both record ratings and make sure no user votes twice on the same question.

That table alone could handle rating data but it might result in slow queries.

Taking performance into consideration, would you consider storing the sum of ratings for each question in the questions table, even though this data would be redundant since it's derivative from the data in the ratings table?

A: 

If you're planning to pre-aggregate tables it would be worth looking at materialised views (indexed views in T-SQL).

Steve Homer
+6  A: 

I would generally first start with a normalized model, not de-normalizing the sum of ratings in the question table.

Then, when the application is working well enough, I would do some performance testings, to determine whether the application handles load good enough -- compared to the load I expect to have in production.

If it doesn't handle load well enough, I would check for bottlenecks -- and correct the most important ones, until the application does well.

Once the application is in production, if the website has lots opf users, it'll be time to make some additionnal optimizations.


To make things simple :

  • Don't over-optimize
  • Get your application working
  • Once it works, benchmark it
  • If / when needed, optimize


In the end, yes, maybe, de-normalizing the sum of ratings to the questions table might help ; but do you need to do it ?

That is the real question ;-)

Pascal MARTIN
+1 - for the canonical approach
APC
+1 never optimize prematurely. You may (will probably) find that the query you thought would be a bottleneck gets optimized away by the database, while some other query you thought would run fast is a bottleneck. Once you've determined a bottleneck (in the real, implemented system), denormalizing is one thing you can do to speed up queries. See http://stackoverflow.com/questions/2161264/tips-for-improving-performance-of-db-that-is-above-size-40-gb-sql-server-2005-a and http://stackoverflow.com/questions/2079724/optimize-database-for-web-usage-lots-more-reading-than-writing
BlueRaja - Danny Pflughoeft
A: 

In general - it is valid approach to store aggregate values, if you know that data is read much more frequently then written. In this specific case I would also consider making phisical design of the answers table in the way, which makes aggregation cheap. To do so I would make clustered index defined on query_id, answer_id. As a result only several DB pages will be read from the disk to get all answers for the specific query.

David Gruzman