views:

562

answers:

7

Hi,

I'm making a digg-like website that is going to have a homepage with different categories. I want to display the most popular submissions.

Our rating system is simply "likes", like "I like this" and whatnot. We basically want to display the submissions with the highest number of "likes" per time. We want to have three categories: all-time popularity, last week, and last day.

Does anybody know of a way to help? I have no idea how to go about doing this and making it efficient. I thought that we could use some sort of cron-job to run every 10 minutes and pull in the number of likes per the last 10 minutes...but I've been told that's pretty inefficient?

Help?

Thanks!

+5  A: 

Typically Digg and Reddit-like sites go by the date of the submission and not the times of the votes. This way all it takes is a simple SQL query to find the top submissions for X time period. Here's a pseudo-query to find the 10 most popular links from the past 24 hours using this method:

select * from submissions
 where (current_time - post_time) < 86400
 order by score desc limit 10

Basically, this query says to find all the submissions where the number of seconds between now and the time it was posted is less than 86400, which is 24 hours in UNIX time.

If you really want to measure popularity within X time interval, you'll need to store the post and time for every vote in another table:

create table votes (
 post foreign key references submissions(id),
 time datetime,
 vote integer); -- +1 for upvote, -1 for downvote

Then you can generate a list of the most popular posts between X and Y times like so:

select sum(vote), post from votes
 where X < time and time < Y
 group by post
 order by sum(vote) desc limit 10;

From here you're just a hop, skip, and inner join away from getting the post data tied to the returned ids.

Kyle Cronin
I was writing basically the same thing, you were faster than me. =)
Alix Axel
great answer...it looks like although the first method you describe is simpler, it doesn't handle the case where something that was posted a while back seeing a sudden resurgence of popularity (maybe due to a recent news event or something)? second method looks more robust, thanks I'll try it out!
Brian Armstrong
A: 

To complete nobody_'s answer I would suggest you read up on the documentation (if you are using MySQL of course).

Alix Axel
+1  A: 

Do you have a decent DB setup? Can we please hear about your CREATE TABLE details and indices? Assuming a sane setup, the DB should be able to pull the counts you require fast enough to suit your needs! For example (net of indices and keys, that somewhat depend on what DB engine you're using), given two tables:

CREATE TABLE submissions (subid INT, when DATETIME, etc etc)
CREATE TABLE likes (subid INT, when DATETIME, etc etc)

you can get the top 33 all-time popular submissions as

SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33

and those voted for within a certain time range as

SELECT *, COUNT(likes.subid) AS score
FROM submissions
JOIN likes USING(subid)
WHERE likes.when BETWEEN initial_time AND final_time
GROUP BY submissions.subid
ORDER BY COUNT(likes.subid) DESC
LIMIT 33

If you were storing "votes" (positive or negative) in likes, instead of just counting each entry there as +1, you could simply use SUM(likes.vote) instead of the COUNTs.

Alex Martelli
+6  A: 

Why not browse the reddit source code?

Shawn Simon
+1 Very cool! Didn't know this existed.
Jas Panesar
That's very nice
ZelluX
A: 

For stable list like alltime, lastweek, because they are not supposed to change really fast so that I think you should save the list in your cache with expiration time is around 1 days or longer.

If you concern about correct count in real time, you can check at every page view by comparing the page with lowest page in the cache.

All you need to do is to care for synchronizing between the cache and actual database.

thethanghn

thethanghn
the target of my approach is to reduce as much database query as it can since you dont need to get the top from database all the time
thethanghn
+1  A: 

There's a brief, easy-to-follow explanation of a Digg-style algorithm (not actually the Digg algorithm) here. (Don't click the link on that page, just scroll down.)

Bill the Lizard
A: 

Queries where the order is some function of the current time can become real performance problems. Things get much simpler if you can bucket by calendar time and update scores for each bucket as people vote.

Jason Watkins