views:

101

answers:

3

We have a website that contains a database of places. For each place our users are able to take one of the follow actions which we record:

  1. VIEW - View it's profile
  2. RATING - Rate it on a scale of 1-5 stars
  3. REVIEW - Review it
  4. COMPLETED - Mark that they've been there
  5. WISH LIST - Mark that they want to go there
  6. FAVORITE - Mark that it's one of their favorites

In our database table of places each place contains a count of the number of times each action above was taken as well as the average rating given by users.

  • views
  • ratings
  • avg_rating
  • completed
  • wishlist
  • favorite

What we want to be able to do is generate lists of the top places using the above information. Ideally, we would want to be able to generate this list using a relatively simple SQL query without needing to do any legwork to calculate additional fields or stack rank places against one another. That being said, since we only have about 50,000 places we could run a nightly cron job to calculate some fields such as rankings on different categories if it would make a meaningful difference in the overall results of our top places.

I'd appreciate if you could make some suggestions on how we should think about bubbling the best places to the top, which criteria we should weight more heavily, and given that information - suggest what the MySQL query would need to look like in order to select the top 10 places.

One thing to note is that at this time we are less concerned with the recency of a place being popular - meaning that looking at the aggregate information is fine and that more recent data doesn't need to be weighted more heavily.

Thanks in advance for your help & advice!

+1  A: 

The exact formula for the favorite would need to come from you, we will call it f(x).
For the actual implementation I would add a popularity_score field that I would calculate, as you said from a nightly cron job using f(x) for each row.

Then it is simply a case of doing a "select place name from table order by popularity_score desc".

Ok - Let's give it a stab popularity_score = (FAVORITE * 3 + COMPLETED * 2 + WISHLIST) * RATING * VIEW / AVG_VIEWS_OF_ALL_PROFILES

Romain Hippeau
@Romain - I was hoping was to get some input on more specifically what f(x) might look like. Any thoughts? I have mine but I would love some outside perspective as well.
Russell C.
@Russell C - I gave it a stab
Romain Hippeau
@Roman - Thanks! I'd love to better understand the logic of why you chose certain weights, etc. Also, how do you think this would compare versus stack ranking everything on each dimension and figuring out an average rank for each place across all categories which is one idea we've been toying with? Pros/cons of one approach over another? Thanks again!
Russell C.
@Russell C. If it is your favorite place then you probably have taken the time to go there and it is the one you liked the most out of all places you have been (RANK 3) If you have been there then it has been important enough for you to go there. (RANK 2) If it is in your wishlist then it is not important enough for you to have been yet. (RANK 1)
Romain Hippeau
@Roman - Thanks for the explanation. I've tweaked the algorithm a little bit but generally it seems to be working well. Thanks again for the suggestion.
Russell C.
@Russell C. You might also be able to do things dynamically as timdev says, removing the requirement for the cron job. Maybe a trigger on the criteria table to update the popularity_score on the fly ?
Romain Hippeau
+10  A: 

Use the lower bound of the Wilson score confidence interval for a Bernoulli parameter!

Paul Creasey
That's a very neat solution!
timdev
Damn... +1 that is very, very interesting
Ben
@Paul - I love the suggestion, very interesting read. Any idea on how to translate this into a MySQL query or some Perl code? Also, have you used this before and had success? Thanks for the great suggestion!
Russell C.
+1  A: 

I don't have an opinion on how to weigh things.

That said, why not just add a popularity column to the location table? All of a sudden, your SQL query is incredibly simple.

The tricky part, of course, is figuring out how and when to update that value. But since you're saving all of the activity data, you can always regenerate the popularity values from the log entries.

That way, you get nice fast queries for "most popular" locations, and if you want to change the way popularity is computed, you can do so at will.

If you're clever, you might be able to devise a simple enough formula so that popularity can be tracked in real time. For instance, if we only cared about average ratings, you can modify the average rating with just three variables: the current average rating, the number of times the object has been rated, and the new rating value.

Of course, things get more complex when you start mixing in how many times the object has been viewed, reviewed, favorited, etc ... but you might find that you can devise a method that's computationally cheap enough that you can update the overall popularity value on just about every action.

timdev
@Timdev - that was my thinking as well. What I was hoping was to get some input on more specifically what that might look like. Any thoughts?
Russell C.
+1 For suggesting to do it on the fly
Romain Hippeau