views:

891

answers:

4

What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?

There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.

In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:

# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope  * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error

I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.

But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?

And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?

Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?

Thanks in advance for any answers (or discussion!).

+1  A: 

The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.

The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.

error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.

Martin Hohenberg
Neither I nor Google could find where in that Wikibook baseline trend is introduced. Do you have a pointer?
Yang
That book is handling basic statistics which one should understand those before trying to work with the more esoteric concepts.
Martin Hohenberg
A: 

The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.

Try it out at a few different scales in Excel, you'll get a good view of how it operates.

Hope that helps.

Simon
I don't follow. The log scale factor is clearly inflating the score of the popular item. In Python:>>> [(d, d*math.log(1.+t)) for (d,t) in [(1000,100),(1000,100000)]][(1000, 4615.1205168412598), (1000, 11512.935464920229)]
Yang
+1  A: 

another way to look at it is this:

suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.

+1  A: 

As the in-line comment goes, this is a simple "baseline trend algorithm", which basically means before you compare the trends of two different pages, you have to establish a baseline. In many cases, the mean value is used, it's straightforward if you plot the pageviews against the time axis. This method is widely used in monitoring water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.

In OP's case, the slope of pageviews is weighted by the log of totalpageviews. This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance between two pages with very different totalpageviews. For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000. A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times). If you only consider the slope, A is less popular than B. But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive: though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.

As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)). It roughly translates into : the more data points, the more accurate the trend. I don't see it is an exact math formula, just a brute trend analysis algorithm, anyway the relative value is more important in this context.

In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)

Dingle
Right, I understood the mechanics of it. I just disagree that it's intuitive to say that B's growth is should be weighed less than A's - although B isn't as popular, there's also something to be said about its relative and sudden surge in clicks; conversely, A's growth falls well within its standard deviation, and should be seen as less significant. I suppose this particular formula is really more of a measure of baseline popularity. As for books - I was really hoping for specific recommendations!
Yang