views:

469

answers:

4

I'd like to populate the homepage of my user-submitted-illustrations site with the "hottest" illustrations uploaded.

Here are the measures I have available:

  • How many people have favourited that illustration
    • votes table includes date voted
  • When the illustration was uploaded
    • illustration table has date created
  • Number of comments (not so good as max comments total about 10 at the moment)
    • comments table has comment date

I have searched around, but don't want user authority to play a part, but most algorithms include that.

I also need to find out if it's better to do the calculation in the MySQL that fetches the data or if there should be a PHP/cron method every hour or so.

I only need 20 illustrations to populate the home page. I don't need any sort of paging for this data.

How do I weight age against votes? Surely a site with less submission needs less weight on date added?

A: 

I've no useful ideas as far as the actual agorithm is concerned, but in terms of implementation, I'd suggest caching the result somewhere, with a periodic update - if the resulting computation results in an expensive query, you probably don't want to slow your response times.

Rob
A: 

Something like:

(count favorited + k) * / time since last activity

The higher k is the less weight has the number of people having it favorited.

You could also change the time to something like the time it first appeared + the time of the last activity, this would ensure that older illustrations would vanish with time.

Georg
+2  A: 

Obviously there is some subjectivity in this - there's no one "correct" algorithm for determining the proper balance - but I'd start out with something like votes per unit age. MySQL can do basic math so you can ask it to sort by the quotient of votes over time; however, for performance reasons, it might be a good idea to cache the result of the query. Maybe something like

SELECT images.url FROM images ORDER BY (NOW() - images.date) / COUNT((SELECT COUNT(*) FROM votes WHERE votes.image_id = images.id)) DESC LIMIT 20

but my SQL is rusty ;-)

Taking a simple average will, of course, bias in favor of new images showing up on the front page. If you want to remove that bias, you could, say, count only those votes that occurred within a certain time limit after the image being posted. For images that are more recent than that time limit, you'd have to normalize by multiplying the number of votes by the time limit then dividing by the age of the image. Or alternatively, you could give the votes a continuously varying weight, something like exp(-time(vote) + time(image)). And so on and so on... depending on how particular you are about what this algorithm will do, it could take some experimentation to figure out what formula gives the best results.

David Zaslavsky
I agree. The amount of measurable popularity should be measured over time, otherwise you'll just end up with a semi-stale homepage. Items that have earned most points over the past 24 hours for example.
Ross
+2  A: 

Many sites that use some type of popularity ranking do so by using a standard algorithm to determine a score and then decaying eternally over time. What I've found works better for sites with less traffic is a multiplier that gives a bonus to new content/activity - it's essentially the same, but the score stops changing after a period of time of your choosing.

For instance, here's a pseudo-example of something you might want to try. Of course, you'll want to adjust how much weight you're attributing to each category based on your own experience with your site. Comments are rare, but take more effort from the user than a favorite/vote, so they probably should receive more weight.

score = (votes / 10) + comments  
age = UNIX_TIMESTAMP() - UNIX_TIMESTAMP(date_created)

if(age < 86400) score = score * 1.5

This type of approach would give a bonus to new content uploaded in the past day. If you wanted to approach this in a similar way only for content that had been favorited or commented on recently, you could just add some WHERE constraints on your query that grabs the score out from the DB.

There are actually two big reasons NOT to calculate this ranking on the fly.

  1. Requiring your DB to fetch all of that data and do a calculation on every page load just to reorder items results in an expensive query.
  2. Probably a smaller gotcha, but if you have a relatively small amount of activity on the site, small changes in the ranking can cause content to move pretty drastically.

That leaves you with either caching the results periodically or setting up a cron job to update a new database column holding this score you're ranking by.

AvatarKava