ansaurus

Question

Implementing a popularity algorithm in Django

Answer 1

+6 A:

On Hacker News, only the 210 newest stories and 210 most popular stories are paginated (7 pages worth * 30 stories each). My guess is that the reason for the limit (at least in part) is this problem.

Why not drop all the fancy SQL for the most popular stories and just keep a running list instead? Once you've established a list of the top 210 stories you only need to worry about reordering when a new vote comes in since relative order is maintained over time. And when a new vote does come in, you only need to worry about reordering the story that received the vote.

If the story that received the vote is not on the list, calculate the score of that story, plus the least popular story that is on the list. If the story that received the vote is lower, you're done. If it's higher, calculate the current score for the second-to-least most popular (story 209) and compare again. Continue working up until you find a story with a higher score and then place the newly-voted-upon story right below that one in the rankings. Unless, of course, it reaches #1.

The benefit of this approach is that it limits the set of stories you have to look at to figure out the top stories list. In the absolute worst case scenario, you have to calculate the ranking for 211 stories. So it's very efficient unless you have to establish the list from an existing data set - but that's just a one-time penalty assuming you cache the list someplace.

Downvotes are another issue, but I can only upvote (at my karma level, anyway).

John Debs 2009-12-27 07:45:45

That's a pretty nifty idea. I never expect this site to get very large so there would have to be some major tweaking of those numbers but I like where it's going.

TheLizardKing 2009-12-27 09:05:55

A few months late but i'm back at it. You wouldn't mind elaborating on this further would you? How would you keep track of your running list? Another database table? Would it be calculated on the fly on every Http Request? Don't worry about the downvotes, I've come to the conclusion that downvoting is a bad idea in most cases. buildingreputation.com has some great reads.

TheLizardKing 2010-05-08 07:00:37

Answer 2

+3 A:

popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
    select = {
        'karma_total': 'SUM(vote.karma_delta)',
        'popularity': '(karma_total - 1) / POW(2, 1.5)'
    },
    order_by = ['-popularity']
)

Or select some sane number, sort the selection using python in any way you like, and cache if its going to be static for all users which it looks like it will - set cache expiration to a minute or so.

But the extra will work better for paginated results in a highly dynamic setup.

kibitzer 2009-12-27 08:41:45

I really think this is what I am going for although for some reason I am continually get the same damn error of:Caught an exception while rendering: column "karma_total" does not existLINE 1: SELECT ((karma_total - 1) / POW(2, 1.5)) AS "popularity", (S...Which is strange because I thought karma_total was already defined just a line above it!

TheLizardKing 2009-12-27 09:07:29

well just substitute it then if you dont need it in the final result: `'popularity': '(SUM(vote.karma_delta) - 1) / POW(2, 1.5)'`

kibitzer 2009-12-27 12:15:53

Man, this is the particular answer I wish would work the most. The other answers are great but I want this one to work.Caught an exception while rendering: missing FROM-clause entry for table "vote"LINE 1: SELECT ((vote.karma_total - 1) / POW(2, 1.5)) AS "popularity...Is my latest error. This wouldn't have anything to do with my usage of 1.1.1 instead of the development version would it?

TheLizardKing 2009-12-29 18:36:04

your vote table is not added to the query by select_related; you can manually add it by modifying extra and adding `tables = ['vote']` if your votes table is called `vote`

kibitzer 2009-12-30 08:04:50

View my edit 3 code. I am getting so close I can taste it! I am getting an error referring to the fact that I do not have the appropriate values in my group by. Not sure how to add them considering django usually took care of that.

TheLizardKing 2009-12-31 01:10:40

Why do you need links_link there? You can add group by appending last line to this, `popular_links.query.group_by = 'links_link.id'` if you are sure thats what you want

kibitzer 2009-12-31 09:48:30

Answer 3

+2 A:

Seems like you could overload the save of the Vote class and have it update the corresponding Link object. Something like this should work well:

from datetime import datetime, timedelta

class Link(models.Model):
 category = models.ForeignKey(Category)
 user = models.ForeignKey(User)
 created = models.DateTimeField(auto_now_add = True)
 modified = models.DateTimeField(auto_now = True)
 fame = models.PositiveIntegerField(default = 1)
 title = models.CharField(max_length = 256)
 url = models.URLField(max_length = 2048)

 #a field to keep the most recently calculated popularity
 popularity = models.FloatField(default = None)

 def CalculatePopularity(self):
  """
  Add a shorcut to make life easier ... this is used by the overloaded save() method and 
  can be used in a management function to do a mass-update periodically
  """
  ts = datetime.now()-self.created
  th = ts.seconds/60/60
  self.popularity = (self.user_set.count()-1)/((th+2)**1.5)

 def save(self, *args, **kwargs):
  """
  Modify the save function to calculate the popularity
  """
  self.CalculatePopularity()
  super(Link, self).save(*args, **kwargs)


 def __unicode__(self):
     return self.title

class Vote(models.Model):
 link = models.ForeignKey(Link)
 user = models.ForeignKey(User)
 created = models.DateTimeField(auto_now_add = True)
 modified = models.DateTimeField(auto_now = True)
 karma_delta = models.SmallIntegerField()

 def save(self, *args, **kwargs):
  """
  Modify the save function to calculate the popularity of the Link object
  """
  self.link.CalculatePopularity()
  super(Vote, self).save(*args, **kwargs)

 def __unicode__(self):
     return str(self.karma_delta)

This way every time you call a link_o.save() or vote_o.save() it will re-calculate the popularity. You have to be a little careful because when you call Link.objects.all().update('updating something') then it won't call our overloaded save() function. So when I use this sort of thing I create a management command which updates all of the objects so they're not too out of date. Something like this will work wonderfully:

from itertools import imap
imap(lambda x:x.CalculatePopularity(), Link.objects.all().select_related().iterator())

This way it will only load a single Link object into memory at once ... so if you have a giant database it won't cause a memory error.

Now to do your ranking all you have to do is:

Link.objects.all().order_by('-popularity')

It will be super-fast since all of you Link items have already calculated the popularity.

JudoWill 2009-12-27 18:37:29

Does this mean that everytime a link is voted on, every single link row is updated?

TheLizardKing 2009-12-29 16:11:41

no ... it would only update the link that was voted on ... so it wouldn't be that processor intensive

JudoWill 2009-12-29 20:43:54

Answer 4

A:

Here was the final answer to my question although many months late and not exactly what I had in mind. Hopefully it will be useful to some.

def hot(request):
    links = Link.objects.select_related().annotate(votes=Count('vote')).order_by('-created')[:150]
    for link in links:
        delta_in_hours = (int(datetime.now().strftime("%s")) - int(link.created.strftime("%s"))) / 3600
        link.popularity = ((link.votes - 1) / (delta_in_hours + 2)**1.5)

    links = sorted(links, key=lambda x: x.popularity, reverse=True)

    links = paginate(request, links, 5)

    return direct_to_template(
        request,
        template = 'links/link_list.html',
        extra_context = {
            'links': links
        })

What's going on here is I pull the latest 150 submissions (5 pages of 30 links each) if you need more obviously you can go grab'em by altering my slice [:150]. This way I don't have to iterate over my queryset which might eventually become very large and really 150 links should be enough procrastination for anybody.

I then calculate the difference in time between now and when the link was created and turn it into hours (not nearly as easy as I thought)

Apply the algorithm to a non-existant field (I like this method because I don't have to store the value in my database and isn't reliant on surrounding links.

The line immediately after the for loop was where I also had another bit of trouble. I can't order_by('popularity') because it's not a real field in my database and is calculated on the fly so I have to convert my queryset into an object list and sort popularity from there.

The next line is just my paginator shortcut, thankfully pagination does not require a queryset unlike some generic views (talking to you object_list).

Spit everything out into a nice direct_to_template generic view and be on my merry way.

TheLizardKing 2010-05-11 00:46:22

ansaurus

tags:

views:

answers:

Implementing a popularity algorithm in Django

related questions