views:

67

answers:

4

Hi,

I was wondering how to best implement a "most viewed" featured (like youtube) on my database.

Let me explain the "most viewed" feature a little bit better: Basically I want to list the most visited pages/video/etc from this day/week/month, see http://www.youtube.com/charts/videos_views for an example.

So I was wondering how to best implement this feature as I can think of many many ways of doing it but all of them have their + and - to them.

Plus I also would love to hear the comments of various programmers on others programmers ideas. Also i would like to start a good conversation on this topic.

Cheers.

Ps. I'm looking specially on how to calculate the time, say most viewed in this month, without having a huge table saving every single view with the datetime. Any idea is welcome - thanks. Pps. I use Mysql and PHP, extra tips for those two are very welcome.

A: 

Off the top of my head, I would have a ItemViews table, that maps the number of views to a an item ID (assuming that there is only one item type). The table could have 2 columns: ItemId and ViewCount. When a new feature receives its first view, I'd insert a new row in that table and initialize ViewCount to 0. Then I'd increment the counter in that row every time I detect a new view.

I could then compute stats from that table. If there is a concept of categories (like in Youtube), I could join the category IDs with the ItemId and group my view counts that way.

svanryckeghem
Yes this is a simple idea but then you end up with a huge table, say for example if I have 1 million pages and every page has something like 10k views then 1mx10k=10billion rows. Now imagine calculating the time for each and every row to get the most viewed for this month. I would need a whole server just for that, surelly there must be a simpler way.
StfnoPad
I think the point is that you only do one insert for the new page / item. After that point you only increment the counter - so you'd only end up with 'x' rows - where 'x' is the number of items viewed. If you had 1m items, but only 10K had been viewed you'd only have 10K rows, not even the million items. A further saving. A great idea and something I'd think to implement in my solution too.
Paul Hadfield
Yes but if you only have a counter then you can't calculate the most viewed for say this week as you don't have date and time information. I'm sure there must be a simple way otherwise I wonder how youtube does it with their billion videos and million views per video. Anyhow, Many Thanks guys.
StfnoPad
+1  A: 

First I'd try to track users by using a cookie, giving them a unique ID as a visit and for that visit. That should help with identification.

I'd try to move the logic to update the most viewed item(s) out of the website codebase. So this just means each web request posts the visit info to a message queue - probably including the time/date, item being viewed, the tracked info from the cookie and maybe IP address / request header.

I'd then set up a service to read from that queue and process the information. That would allow you to upgrade / modify that logic without affecting the site. You'll probably want to handle multiple refreshes, etc to stop people messing with the results - either delibrately or by accident. Also by abstracting this processing out of the main site you're not slowing down your page request time by - you can make the logic to determine / update item views as complex as you like then.

Paul Hadfield
+1  A: 

Simple approach that comes to my mind increment the ViewCount field value on your table by 1 for each unique IP of the users which loaded that page.

Aayan
Just by using IP address you would be limited to logging one request per company if your site was accessed by multiple people behind a company proxy, etc.
Paul Hadfield
Yes but then you can't calculate the "most viewed" in time as you are not saving any of the time information. So by doing this I cant calculate the most viewed for this week as an example.
StfnoPad
Ohh yeah time information is needed as well. That was very simple approach thought just 10 seconds :)) anyway it will lead to the proper approach. Yeah also checking only unique IP will not be enough, cookie information is good. So add all these to the solution, unique IP, cookie, visit time, also have queue to update db, and have all the information is in seperate table rather than just field. It looks like quite sophisticated already.
Aayan
@Aayan: Yes - it does seem to get complex quickly which is why I'd say split the actual logic out into a separate job. Just have the website grab the information needed to record the view and post that to a queue, then have a service do all the grunt work of deduping / recording the visit, etc. It gets more complex when you think it may be updating most viewed per hour, day, week, month, etc. Then a what's hot and so on. If it's out of the main web processing thread it won't slow down the page request time, etc.
Paul Hadfield
why do you process the time when you update the db instead you can store just simple time information as ViewTime than process the time when you load the data as most viewed in day/week/month etc.
Aayan
+1  A: 

have the following tables : 1. views 2. views_hourly_summary 3. views_daily_summary 4. views_monthly_summary 5. views_alltime_summary

have a cron job run at the following intervals:

  1. run every hour and preaggregate the views for that hour from the views table and save the preaggregated result in the views_hourly_summary table, also update the views_alltime_summary table

  2. run at the end of everyday and preaggregate the views for that day from the hours table and save the preaggregated result in the views_daily_summary table

  3. run at the end of everymonth and preaggregate the views for that day from the hours table and save the preaggregated result in the views_daily_summary table

next when fetching results you will have to do some math as follows:

  1. For example, you want to fetch the views for last 4 hours, you would fetch the 3 whole hours data from the hourly table and for the remaining data fetch it from the views table as follows:

    select item_id, sum(views) as views from views_hourly_summary where hour between concat(left(now() - interval 3 hour, 14), '00:00') and concat(left(now(), 14), '00:00') group by item_id

    union

    select item_id, count(1) as views from views where datetime between (now() - interval 4 hour) and concat(left(now() - interval 3 hour, 14), '00:00') or datetime > concat(left(now(), 14), '00:00') group by item_id

ovais.tariq