tags:

views:

769

answers:

9

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.

I tried using average star rating but that algorithm fails when there is a small number of ratings.

Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.

Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?

+4  A: 

You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.

You could use a mode to the same effect, but median is probably a better idea.

If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.

Welbog
If I were to use the median method how would you determine which should be rated better 5x 5 star ratings with 4x 2 star ratings or 5x 5 star ratings with 4x 1 star ratings? Both would come up with 5 for the rating.
Vizjerai
That would be up to you at that point. It depends on which you think it's superior. Maybe you sort first by median, then by mean. Or maybe first by median, then by total number of votes.
Welbog
Weighted median: Sort by median first, then by mean. Total number of votes improves the reliability (confidence level) of the score, but says nothing about the score itself.
richardtallent
+18  A: 

For their Top 250 movies list IMDB uses a Bayesian estimate. This is a nice way to take number of voters into consideration.

From here:

The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

where:

* R = average for the movie (mean) = (Rating)
* v = number of votes for the movie = (votes)
* m = minimum votes required to be listed in the Top 250 (currently 1300)
* C = the mean vote across the whole report (currently 6.8)

for the Top 250, only votes from regular voters are considered.

Martin Harris
A: 

Obviously, the low number of ratings puts this problem at a statistical handicap. Never the less...

A key element to improving the quality of an aggregate rating is to "rate the rater", i.e. to keep tabs of the ratings each particular "rater" has supplied (relative to others). This allows weighing their votes during the aggregation process.

Another solution, more of a cope out, is to supply the end-users with a count (or a range indication thereof) of votes for the underlying item.

mjv
A: 

It could be something like this:

weight = rating * (votes + 100000000000000)

Rating 5 with 1000 votes:   weight: 500000000005000
Rating 5 with 1 vote:       weight: 500000000000005
Rating 4 with 100000 votes: weight: 400000000400000
Rating 2 with 100 votes:    weight: 200000000000200
Rating 1 with 50000 votes:  weight: 100000000050000
Rating 1 with 100 votes:    weight: 100000000000100

But, as you can see a single 5 star vote would put it immediately heavier than all 4 star items, don't matter if they were voted 1000000 times. Maybe it needs adaptations to take it in consideration. Something like a minimal requirement to be listed properly, lets say, 300 votes.

Havenard
A: 

I'd highly recommend the book Programming Collective Intelligence by Toby Segaran (OReilly) ISBN 978-0-596-52932-1 which discusses how to extract meaningful data from crowd behaviour. The examples are in Python, but its easy enough to convert.

Andiih
Even though I can recommend that book to everyone who is interested in that field, your answer does not provide a solution the question asked.
Christian Stade-Schuldt
+1  A: 

Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.

But enough of making it complex. Let’s make it simple.

Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.

So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).

So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like: Y = 1/(factor^RatingCount) and (utilizing the fact that X must be equal to 1-Y) X = 1 – (1/(factor^RatingCount)

Then we can adjust "factor" to fit the range that we're looking for.

I used this simple C# program to try a few factors:

        // We can adjust this factor to adjust our curve.
        double factor = 1.5;  

        // Here's some sample data
        double RatingAverage1 = 5;
        double RatingCount1 = 1;

        double RatingAverage2 = 4.5;
        double RatingCount2 = 5;

        double RatingAverage3 = 3.5;
        double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.

        // Do the calculations
        double modfactor = Math.Pow(factor, RatingCount1);
        double modRating1 = (3 / modfactor)
            + (RatingAverage1 * (1 - 1 / modfactor));

        double modfactor2 = Math.Pow(factor, RatingCount2);
        double modRating2 = (3 / modfactor2)
            + (RatingAverage2 * (1 - 1 / modfactor2));

        double modfactor3 = Math.Pow(factor, RatingCount3);
        double modRating3 = (3 / modfactor3)
            + (RatingAverage3 * (1 - 1 / modfactor3));

        Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}", 
            RatingAverage1, RatingCount1, modRating1));
        Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
            RatingAverage2, RatingCount2, modRating2));
        Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
            RatingAverage3, RatingCount3, modRating3));

        // Hold up for the user to read the data.
        Console.ReadLine();

So you don’t bother copying it in, it gives this output:

RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50

Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.

Beska
+1  A: 

You can look at this page to get a good analysis:

http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

Basically you want to estimate the probability that given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against.)

See the article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).

Greg
A: 

If you just need a fast and cheap solution that will mostly work without using a lot of computation here's one option (assuming a 1-5 rating scale)

SELECT Products.id, Products.title, avg(Ratings.score), etc
FROM
Products INNER JOIN Ratings ON Products.id=Ratings.product_id
GROUP BY 
Products.id, Products.title
ORDER BY (SUM(Ratings.score)+25.0)/(COUNT(Ratings.id)+20.0) DESC, COUNT(Ratings.id) DESC

By adding in 25 and dividing by the total ratings + 20 you're basically adding 10 worst scores and 10 best scores to the total ratings and then sorting accordingly.

This does have known issues. For example, it unfairly rewards low-scoring products with few ratings (as this graph demonstrates, products with an average score of 1 and just one rating score a 1.2 while products with an average score of 1 and 1k+ ratings score closer to 1.05). You could also argue it unfairly punishes high-quality products with few ratings.

This chart shows what happens for all 5 ratings over 1-1000 ratings: http://www.wolframalpha.com/input/?i=Plot3D%5B%2825%2Bxy%29/%2820%2Bx%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D

You can see the dip upwards at the very bottom ratings, but overall it's a fair ranking, I think. You can also look at it this way:

http://www.wolframalpha.com/input/?i=Plot3D%5B6-%28%2825%2Bxy%29/%2820%2Bx%29%29%2C%7Bx%2C1%2C1000%7D%2C%7By%2C0%2C6%7D%5D

If you drop a marble on most places in this graph, it will automatically roll towards products with both higher scores and higher ratings.

Jordan Reiter
A: 

One option is something like Microsoft's TrueSkill system, where the score is given by mean - 3*stddev, where the constants can be tweaked.

Yuliy