views:

196

answers:

3

I'm curious about what be the best way to model this for optimized performance... not as concerned about real time data integrity

I'll continue with the stackoverflow example

Question
  id
  title
Votes
  id
  user
  question

A question has many votes

For many queries however, we're only concerned with the aggregate number of votes (e.g. to show next to the question).

Good relational db theory would create the two entities (Q and V) as separate relations, requiring a join then a sum or count aggregate call.

Another possibility is to break normal form and occasionally materialize the aggregate value of votes as an attribute in Question (e.g. Question.votes). Performance is gained on reads, however, depending on how stale you are willing to let your "votes" data get, it requires a lot more rights to that Question record... in turn hindering performance.

Other techniques involving caching, etc. can be used. But I'm just wondering, performance wise what's the best solution? Let's say the site is higher traffic and receiving a considerable more amount of votes than questions.

Open to non-relational models as well.

+1  A: 

It's unlikely that a join will be too slow in this case, especially if you have an index on (question) in the Votes table.

If it is REALLY too slow, you can cache the vote count in the Question table:

 id - title - votecount

You can update the votecount whenever you record a vote. For example, from a stored procedure or directly from your application code.

Those updates are tricky, but since you're not that worried about consistency, I guess it's ok if the vote is sometimes not exactly right. To fix any errors, you can periodically regenerate all cached counts like:

 UPDATE q
 SET votecount = count(v.question)
 FROM questions q
 LEFT JOIN votes v on v.question = q.id

The aggregate count(v.question) returns 0 if no question was found, as opposed to count(*), which would return 1.

If locks are an issue, consider using "with (nolock)" or "set transaction isolation level read uncommited" to bypass locks (again, based on data integrity being a low priority.)

As an alternative to nolock, consider "read committed snapshot", which is meant for databases with heavy read and less write activity. You can turn it on with:

ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;

It is available for SQL Server 2005 and higher. This is how Oracle works by default, and it's what stackoverflow itself uses. There's even a coding horror blog entry about it.

Andomar
right. i addressed the materializing the vote count suggestion in the question. i was wondering if there is another way as this causes double the writes (locking out any reads)i know with proper indexing it should be alright. but if i'm retrieving a lot of questions and perhaps i have several many to many relationships (e.g. votes and comment count), the joins become nasty
nategood
Post edited. Be careful that you're not doing premature optimization; there has to be hard proof, backed by numbers, of performance issues before I'd move away from the normal join.
Andomar
+1  A: 

I used indexed views from sql 2005 all over the place for this kind of thing on a social networking site. Our load was definitely a high ratio of reads/writes so it worked well for us.

I agree with hainstech. Create an indexed view of the Votes Table and have it aggregated by question and count.
JD
A: 

I would suggest keeping the vote in memory for the lifetime of the application. Why hit a db for something as simple as a count, when at some point you will have loaded the item once and asked what the initial amount was on a request basis. It also has alot to do with how you are implementing repositories, if your question object lazy loads votes but eager loads the count of votes then you can speed up the process while not having an issue about keeping it in memory. Still keep the votes in db, just maintain the count in your application

Brandon Grossutti