views:

77

answers:

3

Under the user generated posts on my site, I have an Amazon-like rating system:

   Was this review helpful to you: Yes | No

If there are votes, I display the results above that line like so:

   5 of 8 people found this reply helpful.

I would like to sort the posts based upon these rankings. If you were ranking from most helpful to least helpful, how would you order the following posts?

   a) 1/1 = 100% helpful
   b) 2/2 = 100% helpful
   c) 999/1000 = 99.9% helpful
   b) 3/4 = 75% helpful
   e) 299/400 = 74.8% helpful

Clearly, its not right to sort just on the percent helpful, somehow the total votes should be factored in. Is there a standard way of doing this?

UPDATE:

Using Charles' formulas to calculate the Agresti-Coull lower range and sorting on it, this is how the above examples would sort:

   1) 999/1000 (99.9%) = 95% likely to fall in 'helpfulness' range of 99.2% to 100%
   2) 299/400 (74.8%) = 95% likely to fall in 'helpfulness' range of 69.6% to 79.3%
   3) 3/4 (75%) = 95% likely to fall in 'helpfulness' range of 24.7% to 97.5%
   4) 2/2 (100%) = 95% likely to fall in 'helpfulness' range of 23.7% to 100%
   5) 1/1 (100%) = 95% likely to fall in 'helpfulness' range of 13.3% to 100%

Intuitively, this feels right.

UPDATE 2:

From an application point of view, I don't want to be running these calculations every time I pull up a list of posts. I'm thinking I'll either update and store the Agresti-Coull lower bound either on a regular, cron-driven schedule (updating only those posts which have received a vote since the last run) or update it whenever a new vote is received.

+4  A: 

This question is probably better asked on http://stats.stackexchange.com .

I guess you still want to order by increasing of 'helpfulness'.

If you want to know how precise a given number is, the simplest way is to use the square root of the variance of the Binomial distribution with n equal to the total number of responses and p the fraction of responses which were 'helpful'.

Andre Holzner
+1 for stats.stackexchange.com
Thilo
+1  A: 

A very simple solution would be to ignore everything with less than a cut-off amount of votes, and then sort by percentage.

For example (require at least five votes)

   1.  99.9% (1000 votes)
   2.  74.8%  (400 votes)
   3-5.  waiting for five votes
Thilo
+3  A: 

For each post, generate bounds on how helpful you expect it to be. I prefer to use the Agresti-Coull interval. Pseudocode:

float AgrestiCoullLower(int n, int k) {
  //float conf = 0.05;  // 95% confidence interval
  float kappa = 2.24140273; // In general, kappa = ierfc(conf/2)*sqrt(2)
  float kest=k+kappa^2/2;
  float nest=n+kappa^2;
  float pest=kest/nest;
  float radius=kappa*sqrt(pest*(1-pest)/nest);
  return max(0,pest-radius); // Lower bound
  // Upper bound is min(1,pest+radius)
}

Then take the lower end of the estimate and sort on this. So the 2/2 is (by Agresti-Coull) 95% likely to fall in the 'helpfulness' range 23.7% to 100%, so it sorts below the 999/1000 which has range 99.2% to 100% (since .237 < .992).

Charles
For ties (especially those at 0), I suggest breaking in favor of largest number of upvotes, then smallest number of downvotes.
Charles
wow, Charles, this is hard core. very impressive. i'll run it on my examples and see how they sort (after i spend a few minutes educating myself on Agresti-Coull at wikipedia!)
mitchf
Let me know how it goes. I can give more information and/or references as needed.
Charles
+1 for this elegant solution (ordering by the lower end of the confidence interval). Just out of curiosity: how does the size of the interval behave for number of upvotes = 0 or = number of answers ? (the plain Binomial variance goes to zero in these cases)
Andre Holzner
@Andre: Asymptotically, it decreases like 1/n, or rather C/n where C depends on the chosen confidence.
Charles