views:

364

answers:

2

Short version:

I have a similar setup to StackOverflow. Users get Achievements. I have many more achievements than SO, lets say on the order of 10k, and each user has in the 100s of achievements. Now, how would you recommend (to recommend) the next achievement for a user to try for?

Long version:

The objects are modeled like this in django (showing only important parts) :

class User(models.Model):
    alias = models.ForeignKey(Alias)

class Alias(models.Model):
    achievements = models.ManyToManyField('Achievement', through='Achiever')

class Achievement(models.Model):
    points = models.IntegerField()

class Achiever(models.Model):
    achievement = models.ForeignKey(Achievement)
    alias = models.ForeignKey(Alias)
    count = models.IntegerField(default=1)

and my algorithm is just to find every other user that has a shared achievement with the logged in user, and then go through all their achievements and sort by number of occurrences :

def recommended(request) :
    user = request.user.get_profile()

    // The final response
    r = {}

    // Get all the achievements the user's aliases have received 
    // in a set so they aren't double counted
    achievements = set()
    for alias in user.alias_set.select_related('achievements').all() :
        achievements.update(alias.achievements.all())

    // Find all other aliases that have gotten at least one of the same
    // same achievements as the user
    otherAliases = set()
    for ach in achievements :
        otherAliases.update(ach.alias_set.all())

    // Find other achievements the other users have gotten in addition to
    // the shared ones.
    // And count the number of times each achievement appears
    for otherAlias in otherAliases :
        for otherAch in otherAlias.achievements.all() :
            r[otherAch] = r.get(otherAch, 0) + 1

    // Remove all the achievements that the user has already gotten
    for ach in achievements :
        r.pop(ach)

    // Sort by number of times the achievements have been received
    r = sorted(r.items(), lambda x, y: cmp(x[1], y[1]), reverse=True)

    // Put in the template for showing on the screen
    template_values = {}
    template_values['achievements'] = r

But it takes FOREVER to run, and always returns the whole list, which is unneeded. A user would only need the top few achievements to go after.

So, I'm welcome to recommendations on other algorithms and/or code improvements. I'll give you an achievement in my system for coming up with the recommendation algorithm :)

+2  A: 

Hello,

One method you can recommend which achievements to go for is to see how many of your users already have those achievements and recommend those popular ones. When they have achieved those you go down the list and recommend slightly less popular ones. However, this has a naive assumption that everyone wants to go for popular achievements. It might cause popular achievements to be even more popular and less popular ones, well... A consolation is that this doesn't take up much resources and is likely to run very fast. (Just keep a list of achievements + number of times it's achieved)

Another method (which attempts to guess which achievements the user is likely to go after based on what achievements he already had) is to use some machine learning algorithms. I think the k-nearest neighbor algorithm will perform quite well here. Select a threshold and just output everything that is above this threshold. Now, I don't know if this will run faster than what you already have, but you should just run the recommendation engine once every time the user has made a new achievement, store the top (let's say) five, and just output it back to the user whenever a recommendation is needed.

I hope this helps. =)

wai
The first is a little too simplistic to me as there are many types of users getting different achievements. Any recommendations on a "fast" (non-bruteforce) kNN python library or code? All I found was an unanswered usenet post http://coding.derkeiler.com/Archive/Python/comp.lang.python/2007-08/msg01615.html
Paul Tarjan
How about improving the first method by having different general groups of users (and keeping a list for each type)? Like in MMORPGs, they have achievements based on whether the person is an explorer, a power-player (power-leveler?), a social person etc... You can guess which general category they fall on based on a few of the achievements they already had. Unfortunately, kNN is not going to be fast, other than the simplest of cases. However, I think it would give great recommendation and you can use it to help create lists for the first method. Sorry, I don't know Python, yet.
wai
That seems similar to clustering the aliases offline, and then showing them achievements from their nearest cluster and then walk down the clusters in order of proximity. Maybe i'll see if I can code that up and how fast it will be.
Paul Tarjan
I guess you're worried about the thing slowing your server down while you are generating the recommendations. Just an idea, how about performing this task on another machine (like an old laptop on a backup of live database) then have it upload those data into your server database? You can set a cron for it to run every 24 hours. That way, the performance of your web server is relatively unaffected and your list can be relatively up-to-date. Good luck! =)
wai
+1  A: 

I would suggest that you do the first three steps (achievements, otherAliases, count) as one single SQL statement. As it is now, you are issuing a lot of queries and summarising thousands of rows in Python which is a task you should delegate to the DB. For example the code

for otherAlias in otherAliases : #For every single other user
    for otherAch in otherAlias.achievements.all() : #execute a query
        r[otherAch] = r.get(otherAch, 0) + 1

Does thousands of huge queries.

Instead, you can use SQL to do this by joining Achiever on itself based on Alias id being different and achievement id being the same. You then group by achievement id and run a count.

In the query below, the table "B" is other user's achievements and "Achiever" is our achievements. If any other user shares an achievement, they appear once in "B" for each achievement they share. We then group those by alias_id and count the number of times they appeared so you get a nice id, count table out.

Very very rough code (no SQL available here)

SELECT B.Alias_id, COUNT(B.achievement_id) 
  FROM Achiever, Achiever as B 
  WHERE Achiever.achievement_id == B.achievement_id 
     AND Achiever.Alias_id == <insert current user alias here>;
  GROUP BY B.Alias_id

If that works the way I think it will, you will get a table of other user aliases, along with the number of achievements they share with the current user.

The next thing you do is an SQL statement that uses the one above as an "inner select" - call it users. You join that with your achievements table and your Achiever table for the current user. You might want to ignore all but the top 10 users who are similar to the current user.

I don't have time to write up a good query right now, but look at the JOIN statement for your DB that joins on achievement_id between the nominated 10 users and the current user - setting that id to NULL if it doesn't exist. The filter only to rows where it turned up NULL (unachieved achievements).

Tom Leys
Yes, this is exactly how I would do it if I was coding in PHP with an SQL connector. Is there a good way to do this in a django object model, or should I just fall back to SQL?
Paul Tarjan
I would follow the instructions in http://docs.djangoproject.com/en/dev/topics/db/sql/ to execute raw SQL and then encapsulate that in a method in User or Alias - This code is far too performance critical to do in Python (or any web layer) since it suits the DB so well.
Tom Leys