views:

106

answers:

5

I'm fairly new to working with relational databases, but have read a few books and know the basics of good design.

I'm facing a design decision, and I'm not sure how to continue. Here's a very over simplified version of what I'm building: People can rate photos 1-5, and I need to display the average votes on the picture while keeping track of the individual votes. For example, 12 people voted 1, 7 people voted 2, etc. etc.

The normalization freak of me initially designed the table structure like this:

Table pictures
id* | picture | userID | 

Table ratings
id* | pictureID | userID | rating

With all the foreign key constraints and everything set as they shoudl be. Every time someone rates a picture, I just insert a new record into ratings and be done with it.

To find the average rating of a picture, I'd just run something like this:

SELECT AVG(rating) FROM ratings WHERE pictureID = '5' GROUP by pictureID 

Having it setup this way lets me run my fancy statistics to. I can easily find who rated a certain picture a 3, and what not.

Now I'm thinking if there's a crapload of ratings (which is very possible in what I'm really designing), finding the average will became very expensive and painful.

Using a non-normalized version would seem to be more efficient. e.g.:

Table picture
id | picture | userID | ratingOne | ratingTwo | ratingThree | ratingFour | ratingFive

To calculate the average, I'd just have to select a single row. It seems so much more efficient, but so much more uglier.

Can someone point me in the right direction of what to do? My initial research shows that I have to "find the right balance", but how do I go about finding that balance? Any articles or additional reading information would be appreciated as well.

Thanks.

+1  A: 

What would these ratingOne to ratingFive fields contain? The number of votes received? Then you won't know who cast the vote. If you really do need to denormalize, I'd just add an "average rating" field to the the picture table, and update that whenever a vote is cast (and keep the ratings table as is).

More generally, don't get caught in premature optimalisation. Try writing a test script which creates 100.000 pictures and 1 million ratings (or whatever figure you want to support), and see how long your AVG query takes. Chances are it will still be plenty fast. Make sure your "ratings" table has an index on pictureID, so the DB doesnt need to traverse the million rows.

Alexander Malfait
Thanks. I'll keep that in mind. I'll focus on writing test cases and see how it performs next time.
Foo
+3  A: 

Your normalized approach makes a lot of sense, the denormalized one doesn't.


In my experience (Telco Performance Management, hundreds of thousands of datapoints per 1/4 hour) we would do the following:

Table: pictures
id* | picture | userID | avg_rating | rating_count

Table: ratings
id* | pictureID | userID | rating

For the telco the pictures rating would be re-calculated once daily, you should do it periodical (e.g. hourly )or every time you insert (re-calc for the picture rated, not the entire table). This depends on the amounts of ratings you get.


In the telco we also keep the rating-date in what is your 'pictures' table and a 1/4h timestamp in the ratings table, but I don't think you need that level of detail.


The 'denormalization' is to move a calculateable fact (count (rating) and avg(rating)) to the pictures table. This saves CPU cycles, but costs more storage.

lexu
+1, I would have recommend the same ...
Nitin Midha
+1  A: 

In RDBMS world, denormalization means "I want to increase the query efficiency at the cost of increased maintenance while still retaining model correctness"

In your case, the efficiency will be slightly increased indeed (since all ratings are always retrieved from same data page).

But what about model correctness?

With this design, you, first, don't know who made the votes (this information is not stored anymore), and, second, cannot rate the picture more than five times.

Since you initial model didn't have any of these restrictions, I believe that this very kind of denormalization is not what you really want.

Quassnoi
+1  A: 

A nice way to enjoy both worlds is using Mysql Trigger. http://dev.mysql.com/doc/refman/5.0/en/triggers.html

Now add a trigger that when ever a user rate a picture it will update the avg_rating in the pictures tables. (using the same select you have stated)

Now when you select, you can select on one table only. And it is always updated. And if you wish to get the exact information of who rate which picture you can select from the rating table too.

aviv
+1  A: 

this is how i would approach the problem http://pastie.org/879604

drop table if exists picture;
create table picture
( 
 picture_id int unsigned not null auto_increment primary key,
 user_id int unsigned not null, -- owner of the picture, the user who uploaded it
 tot_votes int unsigned not null default 0, -- total number of votes 
 tot_rating int unsigned not null default 0, -- accumulative ratings 
 avg_rating decimal(5,2) not null default 0, -- tot_rating / tot_votes
 key picture_user_idx(user_id)
)engine=innodb;

insert into picture (user_id) values 
 (1),(2),(3),(4),(5),(6),(7),(1),(1),(2),(3),(6),(7),(7),(5);


drop table if exists picture_vote;
create table picture_vote
( 
 picture_id int unsigned not null,
 user_id int unsigned not null,-- voter
 rating tinyint unsigned not null default 0, -- rating 0 to 5
 primary key (picture_id, user_id)
)engine=innodb;

delimiter #

create trigger picture_vote_before_ins_trig before insert on picture_vote
for each row
begin
 declare total_rating int unsigned default 0;
 declare total_votes int unsigned default 0;

 select tot_rating + new.rating, tot_votes + 1 into total_rating, total_votes 
   from picture where picture_id = new.picture_id;

 -- counts/stats
 update picture set
    tot_votes = total_votes, tot_rating = total_rating, 
    avg_rating = total_rating / total_votes
 where picture_id = new.picture_id;

 end#
 delimiter ;

hope this helps :)

f00