views:

416

answers:

4

I have an music items that are scored by users between 1 to 5, and I need a formula to get the 5 most scored items.

But obviously an item that get 3.5 average score from 1000 different users will be more scored then an item thet get 4.9 average score from only 5 users... in other words I think that if an item get attention from people to score it, this indicates that the item is interesting. so in the calculation the votesCount parameter need to have a power. (how much power? I don't sure, and I asking it you to get ideas).

I think that we need the following parameters in the function: votesAverage, votesCount.

+4  A: 

A simple way to balance the system is to add a fixed number of hypothetical users (say the count is H) who all vote for the long-term average A of all your pieces. Say that average is 3; then the formula becomes

Score = (votesCount x votesAverage + H x A) / (votesCount + H)

Now when votesCount grows, the relative impact of the hypothetical average-voters diminishes.

You can set H experimentally, or by thinking about it. E.g. if you think that 20 votes is sufficient to establish relatively strong rating, you could set H=5. Say.

antti.huima
+1 For very interesting answer. I don't think it good to my case, because I don't need to show rating, what I need to do is to get the 5 that need to win.
Mendy
Well you can sort according to this modified score and show the 5 highest
antti.huima
+3  A: 

The reddit scoring algorithm is probably the best bet if you really want to do it the right way. It's explained in detail here and at a high level by xkcd author Randall here.

The problem is it doesn't really work for five-star ratings which is what you're going for. You should be able to generalize reddit's sorting system to use ratings. Heck, it's probably done somewhere already. I'm going to look for it.

Welbog
Since Robert has provided a good example of a five-star rating sorting system (and since I can't find one based on statistical confidence), I'm just going to leave this here. Worst case, you count ratings of 3 and higher as a positive and ratings of 2 and lower as a negative and use those results as your inputs to the Wilson score interval.
Welbog
The point of the reddit algo is to find a lower bound 90% confidence interval on the actual rating. It ought to be fairly easy to generalise this from yes/no ratings to a 5 star system.
Nick Johnson
+6  A: 

Weighted voting for 5-star systems with lots of voters

You can use Bayesian estimates to calculate weighted voting.

IMDb (Internet Movie Database) uses this calculation to determine its IMDb Top 250. (Note: IMDb uses 10 stars but the formulas are identical using 5 stars).

The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

where:

  • R = average for the movie (mean) = (Rating)
  • v = number of votes for the movie = (votes)
  • m = minimum votes required to be listed in the Top 250 (currently 3000)
  • C = the mean vote across the whole report (currently 6.9)

IMDb Reference

Wikipedia Reference

Robert Cartaino
+1 I'll take a look on it, but what you think, is it good to my case? I have about 35,000 votes, 700 to 1800 per each.
Mendy
It sounds like an *ideal* match to me. Try it out with some sample (or real) data and see if the results meet your requirements.
Robert Cartaino
Thanks! I'll try this.
Mendy
Just a note for completeness that here WR = (Rv + Cm) / (v+m), which is exactly my solution also (below) when you set H=m
antti.huima
@Robert when m=0 the formula is WR = R = votesAverage. But I said that I want to have the votesCount in the formula too...
Mendy
A: 

The term for this is bayesian estimate.

One common example:

Bayesian rating = (v*R + m*C)/(v+m)
where:
R = average rating of song
v = number of votes for the song
m = minimum votes required to be listed (ex. 10)
C = average vote across all songs

BlueRaja - Danny Pflughoeft
But when `m=0` => `Bayesian rating = R`. And I looking to keep `v` in the function.
Mendy
@Mendy... so don't set m to 0. The whole point is that you want to list the top-10 rated songs; a song with only 5 or 6 votes does not have enough votes to decide (statistically) if it is better or worse than a song with 1000 votes, even if the second one has an average of 3.0 stars and the first has all 5's
BlueRaja - Danny Pflughoeft