This text describes the ranking strategy used in the post Revisiting the Michelin Stars. This last aims at analyzing whether the Michelin-starred restaurants did actually get good user reviews in community-based platforms such as Google Places, TripAdvisor and Verema.
While the following ranking description must be taken as a part of the restaurants study, it has been written so it can be digested standalone. Nevertheless, I encourage you to read the original post as well in order to help you get into the context.
Ratings, Scores and Rankings
A ranking is a list of items sorted by a certain criterion. In the case of restaurants, a score that represents users opinions of each dining place usually implements this criterion: the better the restaurant, the higher the score, and hence the higher the restaurant is ranked. For instance, a restaurant with a score of 4 is ranked higher than a restaurant with 3.5, and hence considered better. A score can be obtained using different techniques, but most of them are based on processing the user ratings. A rating is the grade that they assign to the restaurant usually in a 1-to-5 scale.
Note that the crucial point is how the score is computed as the value must be a fair representation of the users’ opinion about that particular restaurant. There are several ways to compute such value and the most common are described following.
The simplest way to compute the restaurant score is to average the ratings of all the users that evaluated it. For instance, if 2 users rate a restaurant with 3 and 5 star, the score of the restaurant is (3+5)/2 = 4. In the formal definition, the score S is the average of the ratings rn from N users:
As described in Evan Miller’s post, the simple average is very easy to implement but it has an important limitation because it does not take into account the number of ratings. For example, let’s assume that restaurant A has a score of 4.5 from 100 users and that restaurant B has a score of 5 from only 1 user. In a ranking system that uses the simple average to compute the scores, restaurant B would be rated better than restaurant A, and only with the opinion of 1 user compared to the 100 users of the first. As you might have realized, this ranking is not completely fair.
Lower Bound of the Confidence Intervals
In order to solve the simple average limitation, one can assign the restaurants score to the lower bound of the confidence interval around the average. This way the influence of the number of ratings can be introduced into the computation of the score.
A confidence interval around the average can be understood as a ‘credible’ range of values where the true average lies. For instance, if the range [3.9, 4.7] is the confidence interval of 95% around the average value of 4.3, it means that there is a 95% chance that the confidence interval range really contains the true average value. In consequence, the smaller the interval, the more accurate the average value is. In this case the score S is computed by substracting the confidence interval from the average R.
Note that what interests us most is the fact that the size of the interval is related to the number of samples: a high number of samples results in a small interval and hence an accurate average calculation. In our particular case, a restaurant with a large number of ratings will have a small confidence interval around the average. On the contrary, a restaurant with a few ratings will have a large confidence interval.
If the technique sorts by the lower bound of the confidence interval, those restaurants with a few ratings will be ranked lower even if the rating is high. In the example of the previous section, restaurant A with 100 ratings is now ranked higher than restaurant B with only 1 rating because A’s confidence interval is much smaller than B’s.
Using the lower bound of the confidence interval as the score to construct the ranking is fine when the number of ratings is large (as happens with restaurant A). The main problem, though, arises in those restaurants with a few ratings because they are ranked very low alongside other ‘bad’ restaurants with many ratings. This could be the case of restaurant B that has only one, but high, rating. If the lower bound is used as the restaurant score, B will be ranked at the bottom of the restaurants list because its confidence interval is very large. Note that, eventually restaurant B gets more ratings, hence the confidence interval becomes smaller and the score gets closer to the real user ratings average. This effect is depicted in the next diagram.
In conclusion, ranking by lower bound is a good solution for items with a high number of ratings. However, at any time there will be restaurants with a few ratings so scoring by the lower bound of the confidence interval seems not to be the right solution neither.
True Bayesian Average
Although the previous lower bound scoring technique works well for restaurants with a large number of user ratings (like A), it still represents a bad score calculation for those restaurants less evaluated (like B). Recall that, conceptually, the effect of the lower bound technique can be represented as B’s score starting with a very low value and growing up towards the real B’s average as more user opinions are received.
Intuitively, however, you might want to rank a restaurant with a few ratings near the global average score instead of sending it at the bottom of the list. This can be achieved using the True Bayesian Average technique as the restaurant score calculation. Easily explained, the score for restaurants with a few ratings is set close to the average of all restaurants ratings (restaurant B in next picture). Instead, for restaurants with a high number of ratings, the score is set to its real average (restaurant A in next picture).
This technique actually treats the score of a given restaurant as a weighted aggregation of (1) the contribution from its own ratings and (2) the contribution from the ratings of all the restaurants. This way, it can balance between (1) giving a bigger importance to its own ratings (i.e. restaurant A with 100 ratings), and (2) shifting the score towards the global ratings average (i.e. restaurant B with 1 rating). For instance, observe how in the next picture restaurant B with only 1 rating gets a score near the average of all restaurants. As more user ratings for restaurant B are received, its score slowly shifts towards the real ratings average.
Formally, this is the calculation used to calculate a restaurant score is a weighted combination of the averages R and C:
- S = score of the restaurant
- R = average of user ratings for the restaurant
- C = average of user ratings for all restaurants
- w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that restaurant, and m is average number of reviews for all restaurants.
Observe that when there are a lot of ratings for a restaurant (large v), w gets close to 1 and the score pays more attention to R rather than C. On the contrary, in a restaurant with a few ratings, v becomes small and w gets close to 0, which implies that the score places a lot of weight on C.
Let’s now review our example with restaurants A and B to see how this Bayesian technique scores and ranks them. Let’s first assume that the average of user ratings for all restaurants (C) is 4.2 and the average number of reviews for all restaurants (m) is 40. Then recall that restaurant A has an average rating of 4.5 coming from 100 user opinions; and restaurant B has an average rating of 5 coming from only 1 user opinion.
First, observe the effect of the weighting in both cases: in A, the score 4.4 is mainly a contribution of Ra; in B, instead, the score primarily comes from C. Also note how now restaurant A is indeed ranked higher than restaurant B (4.4 > 4.2), which is fairer and more representative than with the other techniques described.
The only limitation of this Bayesian-based scoring technique is that it assumes that the restaurant ratings come from a normal distribution centered at their average. The restaurants data I am considering seems to be normally distributed (see the post Retrieving Restaurants Ratings for further details on the data sources used). I have conducted some normality tests that were based on checking the 68-95-99.7 rule. While more accurate normality tests should be performed, so far the normality assumption seems a fair enough estimate that allows the use of the True Bayesian Average technique.
However, in order to overcome the normality restriction, some alternatives would be extending the Bayesian approach and use a technique based on a multinomial distribution over ratings using Dirichlet priors.