ansaurus

Question

Algorithm to determine most popular article last week, month and year?

Answer 1

+1 A:

There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.

For example

// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR 
//      days since editor recommendation OR 
//      average of both OR 
//      ...
score = (x[1]w[1] + ... + x[n]w[n])/age

Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.

// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)  
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)

jball 2010-10-14 15:23:28

Thanks. I'll look into the age-parameter and see if I get better results. I'm inclined to agree with Mark that the most difficult part of this is how I'll weight each variable. I'll see what results I come up with and look into the 'genuineness' of the articles by inspecting the logging procedures.

AmITheRWord 2010-10-14 16:35:28

Answer 2

+2 A:

I think the weighted means approach is a good one. But I think there are two things you need to work out.

How to weigh the criteria.
How to prevent "gaming" of the system

How to weigh the criteria

This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:

    c1     c2    c3   ...

c1  1      4      2

c2  1/4    1     1/2

c3  1/2    2      1

...

This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.

Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.

Preventing gaming

This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.

Mark Peters 2010-10-14 15:50:24

Thanks! I'll get on building that table. :) There's already a system in place to prevent gaming (to a degree), but I'll look into improving that too.

AmITheRWord 2010-10-14 16:31:25

Answer 3

+1 A:

You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.

A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.

In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.

An example in R:

alt text

Code :

#Test data frame generated at random
test <- data.frame(
  quoted = rpois(100,12),
  seen = rbinom(100,60,0.3),
  download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
  editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article

# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)

# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
  pch=19,col=c("darkgreen","red"))

hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)

par(op)

Joris Meys 2010-10-15 16:36:46

ansaurus

tags:

views:

answers:

Algorithm to determine most popular article last week, month and year?

How to weigh the criteria

Preventing gaming

related questions