views:

167

answers:

5

I am working on a website with a simple normalized database.

There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.

When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.

Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.

Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?

+1  A: 

I would probably include the views column in the Pages table.

It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.

Dana
+6  A: 
   SELECT p.pageid, count(*) as viewcount FROM 
   pages p
   inner join views v on p.pageid = v.pageid
   group by p.pageid
   order by count(*) desc   
   LIMIT 10 OFFSET 0;

I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).

cdonner
+1 for the mention of not storing the totals until a performance issue arises.
Thilo
+1  A: 

Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.

I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.

You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.

Thilo
A: 

Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.

Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.

In this case you would be losing the ability to instantly know your page counts unless you run this query manually.

Denormalization can definitely be employed to increase performance.

--Kris

KSimons
The loss is both the extra column and the need to keep it consistent. I agree that it is well justified in this case though.
thomasrutter
+3  A: 

It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.

Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;
fasih.ahmed