ansaurus

Question

Understanding algorithms for measuring trends

Answer 1

+1 A:

The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.

The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.

error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.

Martin Hohenberg 2009-10-28 07:56:25

Neither I nor Google could find where in that Wikibook baseline trend is introduced. Do you have a pointer?

Yang 2009-10-28 08:37:10

That book is handling basic statistics which one should understand those before trying to work with the more esoteric concepts.

Martin Hohenberg 2009-10-28 13:21:07

Answer 2

A:

The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.

Try it out at a few different scales in Excel, you'll get a good view of how it operates.

Hope that helps.

Simon 2009-10-28 08:12:35

I don't follow. The log scale factor is clearly inflating the score of the popular item. In Python:>>> [(d, d*math.log(1.+t)) for (d,t) in [(1000,100),(1000,100000)]][(1000, 4615.1205168412598), (1000, 11512.935464920229)]

Yang 2009-10-28 08:35:14

Answer 3

+1 A:

another way to look at it is this:

suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.

2009-10-28 08:14:54

Answer 4

+1 A:

As the in-line comment goes, this is a simple "baseline trend algorithm", which basically means before you compare the trends of two different pages, you have to establish a baseline. In many cases, the mean value is used, it's straightforward if you plot the pageviews against the time axis. This method is widely used in monitoring water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.

In OP's case, the slope of pageviews is weighted by the log of totalpageviews. This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance between two pages with very different totalpageviews. For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000. A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times). If you only consider the slope, A is less popular than B. But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive: though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.

As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)). It roughly translates into : the more data points, the more accurate the trend. I don't see it is an exact math formula, just a brute trend analysis algorithm, anyway the relative value is more important in this context.

In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)

Dingle 2009-11-04 17:44:01

Right, I understood the mechanics of it. I just disagree that it's intuitive to say that B's growth is should be weighed less than A's - although B isn't as popular, there's also something to be said about its relative and sudden surge in clicks; conversely, A's growth falls well within its standard deviation, and should be seen as less significant. I suppose this particular formula is really more of a measure of baseline popularity. As for books - I was really hoping for specific recommendations!

Yang 2009-11-05 21:46:44

ansaurus

tags:

views:

answers:

Understanding algorithms for measuring trends

related questions