ansaurus

Question

mySQL Efficiency Issue - How to find the right balance of normalization...?

Answer 1

+1 A:

What would these ratingOne to ratingFive fields contain? The number of votes received? Then you won't know who cast the vote. If you really do need to denormalize, I'd just add an "average rating" field to the the picture table, and update that whenever a vote is cast (and keep the ratings table as is).

More generally, don't get caught in premature optimalisation. Try writing a test script which creates 100.000 pictures and 1 million ratings (or whatever figure you want to support), and see how long your AVG query takes. Chances are it will still be plenty fast. Make sure your "ratings" table has an index on pictureID, so the DB doesnt need to traverse the million rows.

Alexander Malfait 2010-03-21 08:43:08

Thanks. I'll keep that in mind. I'll focus on writing test cases and see how it performs next time.

Foo 2010-03-21 19:54:51

Answer 2

+3 A:

Your normalized approach makes a lot of sense, the denormalized one doesn't.

In my experience (Telco Performance Management, hundreds of thousands of datapoints per 1/4 hour) we would do the following:

Table: pictures
id* | picture | userID | avg_rating | rating_count

Table: ratings
id* | pictureID | userID | rating

For the telco the pictures rating would be re-calculated once daily, you should do it periodical (e.g. hourly )or every time you insert (re-calc for the picture rated, not the entire table). This depends on the amounts of ratings you get.

In the telco we also keep the rating-date in what is your 'pictures' table and a 1/4h timestamp in the ratings table, but I don't think you need that level of detail.

The 'denormalization' is to move a calculateable fact (count (rating) and avg(rating)) to the pictures table. This saves CPU cycles, but costs more storage.

lexu 2010-03-21 08:43:42

+1, I would have recommend the same ...

Nitin Midha 2010-03-21 09:04:41

Answer 3

+1 A:

In RDBMS world, denormalization means "I want to increase the query efficiency at the cost of increased maintenance while still retaining model correctness"

In your case, the efficiency will be slightly increased indeed (since all ratings are always retrieved from same data page).

But what about model correctness?

With this design, you, first, don't know who made the votes (this information is not stored anymore), and, second, cannot rate the picture more than five times.

Since you initial model didn't have any of these restrictions, I believe that this very kind of denormalization is not what you really want.

Quassnoi 2010-03-21 08:43:48

Answer 4

+1 A:

A nice way to enjoy both worlds is using Mysql Trigger. http://dev.mysql.com/doc/refman/5.0/en/triggers.html

Now add a trigger that when ever a user rate a picture it will update the avg_rating in the pictures tables. (using the same select you have stated)

Now when you select, you can select on one table only. And it is always updated. And if you wish to get the exact information of who rate which picture you can select from the rating table too.

aviv 2010-03-21 08:59:03

Answer 5

+1 A:

this is how i would approach the problem http://pastie.org/879604

drop table if exists picture;
create table picture
( 
 picture_id int unsigned not null auto_increment primary key,
 user_id int unsigned not null, -- owner of the picture, the user who uploaded it
 tot_votes int unsigned not null default 0, -- total number of votes 
 tot_rating int unsigned not null default 0, -- accumulative ratings 
 avg_rating decimal(5,2) not null default 0, -- tot_rating / tot_votes
 key picture_user_idx(user_id)
)engine=innodb;

insert into picture (user_id) values 
 (1),(2),(3),(4),(5),(6),(7),(1),(1),(2),(3),(6),(7),(7),(5);


drop table if exists picture_vote;
create table picture_vote
( 
 picture_id int unsigned not null,
 user_id int unsigned not null,-- voter
 rating tinyint unsigned not null default 0, -- rating 0 to 5
 primary key (picture_id, user_id)
)engine=innodb;

delimiter #

create trigger picture_vote_before_ins_trig before insert on picture_vote
for each row
begin
 declare total_rating int unsigned default 0;
 declare total_votes int unsigned default 0;

 select tot_rating + new.rating, tot_votes + 1 into total_rating, total_votes 
   from picture where picture_id = new.picture_id;

 -- counts/stats
 update picture set
    tot_votes = total_votes, tot_rating = total_rating, 
    avg_rating = total_rating / total_votes
 where picture_id = new.picture_id;

 end#
 delimiter ;

hope this helps :)

f00 2010-03-21 14:12:44

ansaurus

tags:

views:

answers:

mySQL Efficiency Issue - How to find the right balance of normalization...?

related questions