This text describes the ranking strategy used in the post Revisiting the Michelin Stars. This last aims at analyzing whether the Michelin-starred restaurants did actually get good user reviews in community-based platforms such as Google Places, TripAdvisor and Verema.

While the following ranking description must be taken as a part of the restaurants study, it has been written so it can be digested standalone. Nevertheless, I encourage you to read the original post as well in order to help you get into the context.

### Ratings, Scores and Rankings

A *ranking* is a list of items sorted by a certain criterion. In the case of restaurants, a *score* that represents users opinions of each dining place usually implements this criterion: the better the restaurant, the higher the score, and hence the higher the restaurant is ranked. For instance, a restaurant with a score of 4 is ranked higher than a restaurant with 3.5, and hence considered better. A score can be obtained using different techniques, but most of them are based on processing the user *ratings*. A rating is the grade that they assign to the restaurant usually in a 1-to-5 scale.

Note that the crucial point is how the score is computed as the value must be a fair representation of the users’ opinion about that particular restaurant. There are several ways to compute such value and the most common are described following.

### Simple Average

The simplest way to compute the restaurant score is to average the ratings of all the users that evaluated it. For instance, if 2 users rate a restaurant with 3 and 5 star, the score of the restaurant is (3+5)/2 = 4. In the formal definition, the score *S* is the average of the ratings *rn* from *N* users:

As described in Evan Miller’s post, the simple average is very easy to implement but it has an important limitation because it does not take into account the number of ratings. For example, let’s assume that restaurant A has a score of 4.5 from 100 users and that restaurant B has a score of 5 from only 1 user. In a ranking system that uses the simple average to compute the scores, restaurant B would be rated better than restaurant A, and only with the opinion of 1 user compared to the 100 users of the first. As you might have realized, this ranking is not completely fair.

### Lower Bound of the Confidence Intervals

In order to solve the simple average limitation, one can assign the restaurants score to the lower bound of the confidence interval around the average. This way the influence of the number of ratings can be introduced into the computation of the score.

A confidence interval around the average can be understood as a ‘credible’ range of values where the true average lies. For instance, if the range [3.9, 4.7] is the confidence interval of 95% around the average value of 4.3, it means that there is a 95% chance that the confidence interval range really contains the true average value. In consequence, the smaller the interval, the more accurate the average value is. In this case the score *S* is computed by substracting the confidence interval from the average *R*.

Note that what interests us most is the fact that the size of the interval is related to the number of samples: a high number of samples results in a small interval and hence an accurate average calculation. In our particular case, a restaurant with a large number of ratings will have a small confidence interval around the average. On the contrary, a restaurant with a few ratings will have a large confidence interval.

If the technique sorts by the lower bound of the confidence interval, those restaurants with a few ratings will be ranked lower even if the rating is high. In the example of the previous section, restaurant A with 100 ratings is now ranked higher than restaurant B with only 1 rating because A’s confidence interval is much smaller than B’s.

Using the lower bound of the confidence interval as the score to construct the ranking is fine when the number of ratings is large (as happens with restaurant A). The main problem, though, arises in those restaurants with a few ratings because they are ranked very low alongside other ‘bad’ restaurants with many ratings. This could be the case of restaurant B that has only one, but high, rating. If the lower bound is used as the restaurant score, B will be ranked at the bottom of the restaurants list because its confidence interval is very large. Note that, eventually restaurant B gets more ratings, hence the confidence interval becomes smaller and the score gets closer to the real user ratings average. This effect is depicted in the next diagram.

In conclusion, ranking by lower bound is a good solution for items with a high number of ratings. However, at any time there will be restaurants with a few ratings so scoring by the lower bound of the confidence interval seems not to be the right solution neither.

### True Bayesian Average

Although the previous lower bound scoring technique works well for restaurants with a large number of user ratings (like A), it still represents a bad score calculation for those restaurants less evaluated (like B). Recall that, conceptually, the effect of the lower bound technique can be represented as B’s score starting with a very low value and growing up towards the real B’s average as more user opinions are received.

Intuitively, however, you might want to rank a restaurant with a few ratings near the global average score instead of sending it at the bottom of the list. This can be achieved using the True Bayesian Average technique as the restaurant score calculation. Easily explained, the score for restaurants with a few ratings is set close to the average of all restaurants ratings (restaurant B in next picture). Instead, for restaurants with a high number of ratings, the score is set to its real average (restaurant A in next picture).

This technique actually treats the score of a given restaurant as a weighted aggregation of (1) the contribution from its own ratings and (2) the contribution from the ratings of all the restaurants. This way, it can balance between (1) giving a bigger importance to its own ratings (i.e. restaurant A with 100 ratings), and (2) shifting the score towards the global ratings average (i.e. restaurant B with 1 rating). For instance, observe how in the next picture restaurant B with only 1 rating gets a score near the average of all restaurants. As more user ratings for restaurant B are received, its score slowly shifts towards the real ratings average.

Formally, this is the calculation used to calculate a restaurant score is a weighted combination of the averages *R* and *C*:

*S*= score of the restaurant*R*= average of user ratings for the restaurant*C*= average of user ratings for all restaurants*w*= weight assigned to*R*and computed as*v/(v+m)*, where v is the number of user ratings for that restaurant, and*m*is average number of reviews for all restaurants.

Observe that when there are a lot of ratings for a restaurant (large *v*), *w* gets close to 1 and the score pays more attention to *R* rather than *C*. On the contrary, in a restaurant with a few ratings, *v* becomes small and *w* gets close to 0, which implies that the score places a lot of weight on *C*.

Let’s now review our example with restaurants A and B to see how this Bayesian technique scores and ranks them. Let’s first assume that the average of user ratings for all restaurants (*C*) is 4.2 and the average number of reviews for all restaurants (*m*) is 40. Then recall that restaurant A has an average rating of 4.5 coming from 100 user opinions; and restaurant B has an average rating of 5 coming from only 1 user opinion.

First, observe the effect of the weighting in both cases: in A, the score 4.4 is mainly a contribution of *Ra*; in B, instead, the score primarily comes from *C*. Also note how now restaurant A is indeed ranked higher than restaurant B (4.4 > 4.2), which is fairer and more representative than with the other techniques described.

The only limitation of this Bayesian-based scoring technique is that it assumes that the restaurant ratings come from a normal distribution centered at their average. The restaurants data I am considering seems to be normally distributed (see the post Retrieving Restaurants Ratings for further details on the data sources used). I have conducted some normality tests that were based on checking the *68-95-99.7* rule. While more accurate normality tests should be performed, so far the normality assumption seems a fair enough estimate that allows the use of the True Bayesian Average technique.

However, in order to overcome the normality restriction, some alternatives would be extending the Bayesian approach and use a technique based on a multinomial distribution over ratings using Dirichlet priors.

Hi. How do the value of C=4.2 and m=40 calculated.??

Hi Narayan,

these were just example values.

– C=4.2 represents the average of all the rating of all restaurants (e.g. arithmetic mean of all the values)

– m=40 represents the average number of reviews per restaurant (e.g. 100 restaurants and 500 reviews, m=500/100=5)

edu

Thanks. That made it clear enough. But what if its a new website and the above 2 are the only restaurants being graded. The values will be m = (100+1)/2 = 50.5 and C = (4.5+5)/2 = 4.75

Then in that case , Restaurant A scores = 4.58 and restaurant B scores = 4.75

Restaurant A still ranks lower than restaurant B in-spite of having more ratings.

Thoughts?

Hi Narayanan, good point!

First, C would not be the average of the two averages Ra and Rb, but the average of all the 101 ratings (in the original example I assumed it was 4.2; in the case with only two restaurants I’ll assume C=4.6, somewhere between Ra and Rb, but much closer to Ra because A has 100 reviews).

With this, Sa=4.534 and Sb=4.603. As you said, A is still rated worse than B. Note that the bayesian technique introduces a kind of magnet towards C (representing the global average of the entire system), and B score gets reduced from 5 to 4.6 (much more fair but not enough). In the case of having only two restaurants I am afraid this won’t ever work because C will always be slightly larger than Ra.

Maybe you can add a term to S calculation that penalizes harder the low number of reviews so you end up with a Sb

So what if the the number of restaurants is greater than 2 ?…..say 3,4 , etc. Will the bayesian technique give results as per the logic mentioned in the main example? What i’ve observed using my own reserach is is one cannot be sure. For example check IMDB top 250 for e,g Sometimes some movies despite having higher number of ratings end up ranking lower than movies that are having fewer ratings. But there are enough movies in the top 250 where the technique normalises and places movies with higher ratings higher than those with lower ratings. So my conclusion is it works in some cases, while in some it doesnt.

Your thoughts ?

http://www.imdb.com/chart/top?ref_=nv_mv_250_6

Also ,

Since your earlier reply were a little late, i posted the question on the following website. Here are the links.

There are 3 questions i had asked.

My concern was the reliability of the technique in terms of the question i posted here

Q1. Why is a restaurant with more ratings still ranks lower than a restaurant with fewer ratings ?

My other concerns were,

Q2.What should be the minimum number of products in the database for the technicue to yield results as in question 1.

Q3.What should be the optimal value of m.i i.e the threshold given a set of products. This set could be anywhere between 2 and infinity. What if we get a very low value of m?

http://stats.stackexchange.com/questions/250920/implementation-of-bayes-weighted-average-formula-for-my-new-website

http://stats.stackexchange.com/questions/248555/bayes-weighted-average-understanding-with-an-example-and-2-additional-questions/249755?noredirect=1#comment475428_249755

http://stats.stackexchange.com/questions/248555/bayes-weighted-average-understanding-with-an-example-and-2-additional-questions/249755#249755

A gentleman called tim told me that the technique is as it is and is reliable as long as there are atleast 2 restaurants in the database. The value of c CAN be the average of the two. Also, there is nothing more than this logic that one can work on.

continued…..

As regards the value of optimal m, i didnt get a definite answer….

I tried another technique to normalise the ratings. But the gap between the final scores are quite apart. But it does normalize and place products with higher ratings higher than those with lower ratings.

Here is the link.

http://stats.stackexchange.com/questions/247221/recalculate-ratings?noredirect=1#comment470655_247221

your thoughts on the above ?

Hi Narayan! I have been, and I still am, quite busy and I do not have the time I would like to reply. I’ll check all the links you provide (but note the last one is broken, 404…)

Would like to connect with you to discuss the above. I’m working on a gigantic project w=for which my understanding of the bayesian estimate is critical.

my mail – id : narayanan.iyer86@gmail.com