views:

1588

answers:

10

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.

I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) noone mentions should be the hottest ones.

Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.

Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".

My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.

I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?

I hope you can help me (coding examples would be great). Thanks in advance!

+2  A: 

probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.

the easiest way would be to bin the number of searched each day, so you have something like

searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]

and then find out how much it changed from day to day:

hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]

and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.

Autoplectic
Thank you. But I don't know exactly what a gradient is and how I can work with it. Sorry!
Thanks. So I have to built a vector containing the daily frequency, right? The relative values would be be better, I'm sure. Example: A growth from 100 to 110 is not as good as a growth from 1 to 9, I would say.But isn't there a vector function which I can use to find the hottest topics? Only evaluating the relative values wouldn't be enought, would it? A growth from 100 to 200 (100%) is not as good as a growth from 20,000 to 39,000!?
What kind of web site are you adding this to? @Autoplectic's suggestion to count the change in searches day to day won't scale well for something like a popular forum, where you have thousands of topics with new ones being defined each day.
Quantum7
You're right, I need an algorithm for huge amounts of data, thousands of topics per hour.
+6  A: 
erickson
Thanks. The supposedly related question doesn't help me. The article seems to be about the same problem. But I can't find anything useful there. No algorithms or calculating examples.
This is exactly what Topix.com must be doing. The related question does not give any _code_, but it definitely gives an _algorithm_.Use Demaine's algorithm, cited near the bottom of page 4 of the article, to calculate the top ten (or whatever) searches from a log of the last 24 hours.If you want to rank them you need to loop over the log again and count the occurrences of each search.It's a long and rather technical article, but it does in fact contain the information you need to do hottest topics in a scalable manner.
Quantum7
So Topix.com must me using the majority algorithm? Is the following approach correct? http://paste.bradleygill.com/index.php?paste_id=9117
Using Demaine's algorithm, do you really find the hottest topics? Or will the result be the topics which are always hot (Britney Spears, weather, ...)?
+3  A: 

Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.

This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.

Jeff Moser
Yes, Google's Hot Trends is exactly what I'm looking for. What should the historical value be? The average value of the last 7 days for example?
It depends on how volatile your data is. You could start with a 30 day average. If it's a cyclical thing (e.g. Kentucky Derby) then it might make sense to do yearly comparisons. I'd experiment and see what works best in practice.
Jeff Moser
A: 

Perhaps not an answer to the question you asked, but you might be interested in this blog post on how to calculate an overall rating based on individual user ratings.

Gary McGill
Thanks. Unfortunately, that approach doesn't consider the time of the votes. So I can't use it.
+1  A: 

You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).

Just sort all your terms by logLR and pick the top ten.

public static void main(String... args) {
    TermBag today = ...
    TermBag lastYear = ...
    for (String each: today.allTerms()) {
        System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
    }
} 

public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
    double k1 = t1.occurrences(term); 
    double k2 = t2.occurrences(term); 
    double n1 = t1.size(); 
    double n2 = t2.size(); 
    double p1 = k1 / n1;
    double p2 = k2 / n2;
    double p = (k1 + k2) / (n1 + n2);
    double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
    if (p1 < p2) logLR *= -1;
    return logLR;
}

private static double logL(double p, double k, double n) {
    return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}

PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.

Adrian
Sorry, I don't understand the code. What are TermBags? It would be great if you could explain shortly what this code does.
A TermBag is a bag of terms, ie the class should be able to answer the total number of words in the text and the number of occurrences for each word.
Adrian
A: 

The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.

So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Joshua
+35  A: 

You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.

This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.

Normalize

One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.

Derive

Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.

This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.

Regarding the article

The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.

As Nixuz points out this is also referred to as a Z or Standard Score.

Adam Davis
Good answer, +1
0xA3
I upvoted this before the edit, and come back and I wanted to upvote it again! Nice job
Simucal
Thanks! I would do pseudo code, but I don't have the time right now. Maybe later, or perhaps someone else will take these concepts and implement it...
Adam Davis
Thank you very much, Adam Davis! If Nixuz really described the same, I think I've got a solution in PHP: http://paste.bradleygill.com/index.php?paste_id=9206 Do you think this code is right?
+2  A: 

I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.

From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.

For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.

God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?

Chad Birch
Thanks, but it would be a bit too easy to order them just by their procentual increasing, wouldn't it?
+9  A: 

This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mention, but also the standard deviation of this historical data, making it more robust than just using the average.

In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.

z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]

When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.

Please see Wikipedia for more information, about z-scores.

Code

from math import sqrt

def zscore(obs, pop):
    # Size of population.
    number = float(len(pop))
    # Average population value.
    avg = sum(pop) / number
    # Standard deviation of population.
    std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
    # Zscore Calculation.
    return (obs - avg) / std

Sample Output

>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506

Notes

  • You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.

  • You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.

  • If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.

    from math import sqrt
    
    
    class zscore:
        def __init__(self, pop = []):
            self.number = float(len(pop))
            self.total = sum(pop)
            self.sqrTotal = sum(x ** 2 for x in pop)
        def update(self, value):
            self.number += 1.0
            self.total += value
            self.sqrTotal += value ** 2
        def avg(self):
            return self.total / self.number
        def std(self):
            return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
        def score(self, obs):
            return (obs - self.avg()) / self.std()
    
  • Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process tomorrow.

New Addition

Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:

Code

class fazscore:
    def __init__(self, decay, pop = []):
        self.sqrAvg = self.avg = 0
        # The rate at which the historic data's effect will diminish.
        self.decay = decay
        for x in pop: self.update(x)
    def update(self, value):
        # Set initial averages to the first value in the sequence.
        if self.avg == 0 and self.sqrAvg == 0:
            self.avg = float(value)
            self.sqrAvg = float((value ** 2))
        # Calculate the average of the rest of the values using a 
        # floating average.
        else:
            self.avg = self.avg * self.decay + value * (1 - self.decay)
            self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
        return self
    def std(self):
        # Somewhat ad-hoc standard deviation calculation.
        return sqrt(self.sqrAvg - self.avg ** 2)
    def score(self, obs):
        if self.std() == 0: return 0
        else: return (obs - self.avg) / self.std()

Sample IO

>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
Nixuz
Thank you very much! That approach looks very good. But I still have a question: :)Is the following PHP code correct?http://paste.bradleygill.com/index.php?paste_id=9205
No, your code has one small mistake, on the following line.$z_score = $hits_today-($average_hits_per_day/$standard_deviation);It should be:$z_score = ($hits_today-$average_hits_per_day)/$standard_deviation;Note the change in parentheses.
Nixuz
Ok, thanks! :) Is it correct, now? http://paste.bradleygill.com/index.php?paste_id=9206
Yes, it looks correct.
Nixuz
Thanks for the edit concerning the moving average. I've tried to implement it into my script, too. Is it correct?http://paste.bradleygill.com/index.php?paste_id=9224
I would say it is not correct. First if you change $old_average_value to $average_hits_per_day, you can get rid of the line: $old_average_value = $average_hits_per_day;. Second and more importantly, you need to do the special moving average standard deviation as well as the normal moving average. To do this you need to use the update method of calculation (see note point 3) and when you are summing the x^2's you are going to weight them in the same way as your moving average. Third starting at '1' for the moving averages might not be a good idea.
Nixuz
Just to make things clear, the standard deviation formula we use when updating is upload.wikimedia.org/math/9/… (see right hand side). And when doing the moving average z-score the average comes from the moving average of x and the average x^2 comes from the moving average of x^2. With these values the standard deviation formula becomes std = sqrt([average x^2] - [average x]^2). Once, you have the moving average and the moving standard deviation, the z-score calculation is the same.
Nixuz
Thank you, you described the problem very detailed but I didn't really understand it, though. Your formula URL isn't displayed correctly. Do I have to change the following line? $standard_deviation_temp = array_sum($deviations_on_single_days)/count($deviations_on_single_days)
Please see the changes I made to your code: http://paste.bradleygill.com/index.php?paste_id=9230The URL for the formula is http://upload.wikimedia.org/math/9/e/9/9e96be22dfa75c7a7cae85bf52bfbc11.png
Nixuz
Thank you very much, Nixuz! The edited code on codepaste works perfectly. I would like to chose you answer as the best but unfortunately, someone has combined both questions so that there is already a best answer. Or is there any possibility to chose your answer, though?
I am pretty sure stack overflow lets you change the "best answer". Just click the check beside your new selection to do this.
Nixuz
No, that's not possible. When you've already marked an answer as the best one, the checks beside the other answers disappear. Maybe the support, a mod, can help here!?
+10  A: 

Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.

One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:

a_n = a_(n-1)*b + c_n*(1-b)

Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).

The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.

EDIT

If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.

Let's say the new values are 5,0,0,1,4:

a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854

Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:

a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0

What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.

If you study the above formula, you should realize a few things about this usage:

  1. All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
  2. Recent values contribute more than older values.
  3. The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.

I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):

>>> class EMA(object):
...  def __init__(self, base, decay):
...   self.val = base
...   self.decay = decay
...   print self.val
...  def update(self, value):
...   self.val = self.val*self.decay + (1-self.decay)*value
...   print self.val
... 
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
David Berger
This is also known as an infinite impulse response filter (IIR)
Adam Davis
Hey a better version of my answer.
Joshua
@Adam Really? I'm not familiar with them. Is it a special case of an IIR? The articles I'm skimming don't seem to be providing formulas that reduce down to an exponential moving average in the simple case.
David Berger
Or I could be tired and slow..........................
David Berger
Thank you very much, David Berger! If it works, it would be a great addition to the other answers!I have some questions, though. I hope you can answer them:1) Does the factor b define how fast the old data is losing weight?2) Will this approach give approximately equivalent results compared to simply store the old data and calculate the average?3) Is this your formula in words? $average_value = $old_average_value * $smoothing_factor + $hits_today * (1-$smoothing_factor)
Points 1 and 3 are correct. See my edit for a bit of a nuanced discussion of 2.
David Berger
Maybe I am missing something, but I fail to understand how you can reasonably use a moving average to solve this problem. Once you have calculated your moving average for your trends, how do you know which trend is rising the fastest in comparison to the others? Could you add some more information about how this solves the initial problem asked. Thanks.
Nixuz
@Nixuz This contributes to solving the original problem; it does not solve it. It probably shouldn't have been so long an answer, but there were several responses which indicated the use of some trend average, and it seemed from the original question that the intent had been only to use data from the previous day. Call it a subproblem optimization.
David Berger
I incorporated the floating average idea into the z-score idea, in my answer. Thanks for the idea David.
Nixuz