views:

99

answers:

1

For finding trending topics, I use the Standard score in combination with a moving average:

z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]

(Thank you very much, Nixuz)

Until now, I do it as follows:

Whatever the time is, for the historic trends I simply go back 24h. Assuming we have January 12, 3:45pm now:

current_trend = hits [Jan 11, 3:45 - Jan 12, 3:45]

historic_trends = hits [Jan 10, 3:45 - Jan 11, 3:45] + hits [Jan 9, 3:45 - Jan 10, 3:45] + hits [Jan 8, 3:45 - Jan 9, 3:45] + ...

But is this really adequate? Wouldn't it be better if I always started at 00:00 o'clock? For example this way for the same data (3:45pm):

current_trend = hits [Jan 11, 0:00 - Jan 12, 0:00]

historic_trends = hits [Jan 10, 0:00 - Jan 11, 0:00] + hits [Jan 9, 0:00 - Jan 10, 0:00] + hits [Jan 9, 0:00 - Jan 9, 0:0] + ...

I'm sure the results would be different. But which approach will give you better results?

I hope you've understood my question and you can help me. :) Thanks in advance!

+1  A: 

I think that the problem you may be seeing with your current implementation is that topics that were hot 23 hours ago are influencing your rankings right now. The problem I see with your new proposed implementation is that you're wiping the slate clean at midnight, so topics that were hot late last night won't seem hot early the next morning (but they should).

I suggest you look into implementing a Digg-style algorithm (sorry for linking to Digg) where the hotness of a topic decays with age. You could do this by counting up the hits/hour for each of the last 24 hour periods then divide each period-score by how many hours ago the period took place. Add up the 24 periods to get the score.

hottness = (score24 / 24) + (score23 / 23) + ... + (score2 / 2) + score1

Where score24 is the number of "hits" that a topic got in the one-hour period that occured 24 hours ago (maybe not the hits exactly, but the normalized score for that hour).

This way topics that were hot 24 hours ago will still be counted in your algorithm, but not as heavily as topics that were hot an hour ago.

Bill the Lizard
Thank you, Bill the Lizard, for this tip. I didn't know this simple algorithm but it's really cool. Unfortunately, it isn't suitable for my purpose, i.e. finding trending topics. My algorithm filters the topics out which are always hot. Your algorithm doesn't to that, does it? ;) But it's very useful for me, though, because I filter out trending links, too. For this purpose, it's useful.But your example concerning my algorithm and the time periods is very good. So do you recommend the first approach (simply going 24h back instead of starting at 0:00)?
After going back and re-reading the question you linked to, I see the problem with this suggestion. You're right, it doesn't filter out topics that are always hot. Digg and reddit work with this algorithm because it only applies to a single link, not an entire topic, which might be represented by many hits. Of your two choices, I would favor going back 24 hours, only because I can't imagine how the system will work at 1AM if you only go back to 0:00. Maybe you could split the difference (in a way) and only go back 12 hours?
Bill the Lizard
Yes, the second approach would probably fail if some topics are hot shortly before 0:00. But the disadvantage is that I can't store the data of the last days when I always go back 24h ...