Using Bayesian Averages for User Reviews

Oscar Syu
Building Niche
Published in
13 min readDec 1, 2021

--

A look behind the scenes at creating a fair system for aggregating user reviews.

At Niche.com, one of our most critical standout features is the ability for users to leave reviews on colleges, K-12 schools, and neighborhoods. It’s part of what allows us to fulfill our mission of helping people find where they belong: by getting real reviews to inform our users’ understanding of what it’s really like to study or live there. Collecting reviews, taking an average, and posting them onsite - logically, it should be pretty easy to do. But accurately representing that information in a way that is fair for all of our colleges, schools, and neighborhoods is a bit more challenging, especially as we have entities (our term for colleges, schools, and neighborhoods) of varying sizes and ages.

Niche’s reviews are collected from users who want to leave their feedback on their college, K-12 school, or neighborhood. Users leave a review that rates their entity on a scale between 1 and 5, with higher values indicating a perception of better quality. Then we aggregate those reviews and use them in our rankings as noted on our methodology pages (under Methodologies, you can find our factor breakdowns by percentage for each of our rankings). When Niche first started collecting reviews and using them in our rankings, we quickly ran into the issue of best representing those ratings on site, as we realized that taking a simple mean of all the reviews for an entity wouldn’t be enough. Chiefly, large entities, like a big historical university, would be more likely to have more reviews compared to a smaller and newer college. For smaller colleges, there would be more uncertainty on students’ opinions, as there is a smaller sample size, making the impact of each review on a college’s score greater when compared to those of a large university.

Imagine if a relatively new and small college has a theoretical average rating of 3.5 score on a 5-point scale if every possible review over time were collected. If only 4 students who overwhelmingly loved their experience actually responded on Niche, all with scores of 5, this school would be rated much higher than it should be. This puts big and historical schools at a disadvantage, as they would need many more reviews to be comparatively rated. Conversely, a small school’s rating would be greatly impacted if a few negative reviews came in and dragged their average far below what it should be, putting them at a disadvantage in comparison to large schools who can rely on a large number of existing reviews as a buffer. As a result, simply using the arithmetic average, where we divide the sum of all the user ratings for an entity by the number of user ratings for that entity, is not enough. It’s a problem that mirrors many challenges in statistics; how do we build something that retains the information we gain from raw data, but also reflect common sense?

To mitigate this issue, Niche tried a couple of approaches before settling on our current approach. The first approach involved setting a threshold for the number of reviews needed in order for an entity’s reviews to be considered in the rankings process. This approach, while easy to implement, introduced a few problems. If the threshold were too high, it would exclude a large number of entities that we otherwise have enough data points to reasonably ascertain user opinion about. Conversely, if it were too low, we would see substantial volatility year over year on ranking eligibility and consequently on our rankings. For example, if a neighborhood had 4 five-star reviews in 2020, given a 5 review threshold, it wouldn’t be ranked. But if it received 2 more five-star reviews in 2021, not only would the neighborhood’s reviews be eligible for consideration in the rankings process, it would also put this entity near the top of the list. A situation where an unranked entity potentially becomes a top ranked entity introduces a level of uncertainty that makes rankings lists difficult to compare from year to year and erodes trust in our rankings, as constantly changing lists do not reflect reality.

As a result, Niche had to move on to a different solution that takes into account response count in addition to average score. So, how do we solve this issue? With a Bayesian average! As a quick and very simplified overview, the driving philosophy behind Bayesian statistics, as opposed to frequentist thinking, is that a pre-existing belief (a prior) is incorporated into a calculation, and beliefs are updated as more information comes in (a more detailed guide is here for more information). This idea models how people in general make decisions when they have limited information; we start with a general idea of what the answer will be and refine that notion when we learn more facts.

The Bayesian average, (x̄ᵦ) for any one entity is defined as such:

The Bayesian Average Formula

Where C is a constant, m is a prior mean, xᵢ is the rating user i gives an entity, and n is the number of total reviews submitted for an entity. To use the Bayesian average, we would plug in a prior belief for what the entity’s score may be in m, and a reasonable value for C, which essentially controls how much effect the prior belief has on the overall Bayesian average. The choice of C can also be interpreted as adding C number of reviews with a score of m to the average. Initially, Niche chose its constant based on the value of a percentile of the number of reviews submitted per entity for colleges and K-12 schools. While it does make some sense to tie the value of C to the quantity of reviews we have, ultimately, it made comparing rankings year over year difficult as the value of C naturally changes during a year period as more reviews are submitted. As a result, the current value of C was selected by analyzing the respective distributions of colleges, K-12 schools, and entities, and is seldom changed for the purpose of consistency. Though, we recognize it as a lever for creating a more equitable platform and consequently regularly assess it. For more information on choosing C, check out the Wikipedia page for a quick overview.

No worries if this looks confusing! The Bayesian average certainly looks a little different than the typical mean we take. Let’s break down this equation from its original form and understand its components. Rewriting this equation we get:

Splitting the Bayesian average into two fractions
Splitting the Bayesian average into two fractions

Or, since the sum of all the scores is also equal to the mean of the scores multiplied by the number of reviews, is the arithmetic average of the submitted reviews.

Replacing the summation with the arithmetic average
Replacing the summation with the arithmetic average

And one more step! Bringing the m and terms outside of their respective fractions, we get:

Shuffle some terms around
Shuffle some terms around

We can clearly see that the Bayesian average is a weighted average of m and . If we treat C as the number of reviews of score m we pre-populate the entity with, we would in effect have C + n number of reviews. This would mean we are essentially weighting m by C/(C + n), or the proportion of all reviews that have a score of m and by the proportion of the total C + n number of reviews that are user-submitted.

The Bayesian average provides a number of advantages compared to using a simple threshold. For one, it allows all entities to have their reviews considered in the rankings process. It also gives us extra confidence that a few extreme reviews won’t have a drastic effect on an entity’s rating. It more accurately represents students’ and users’ experiences and creating a fairer platform for our schools, universities, and neighborhoods. It essentially strikes a balance between what we observe and what we expect, and updates that statistic as more information becomes available. In addition, it provides extra flexibility, allowing us to add complexity, such as a time decay to de-emphasize older reviews.

Our implementation of Bayesian averages was partly inspired by Evan Millers’ series of posts on both Bayesian averages and other techniques around ranking using ratings.. We would highly recommend reading them for a detailed look at the math behind these techniques.

Putting it in Practice

Let’s try some examples out with a Bayesian average. One dimension we can analyze is the relationship between the arithmetic mean and Bayesian average, with all else controlled. Below, we can see 5 plots of varying m and combinations that show the distance between the arithmetic average and the Bayesian average (y-axis) as the number of reviews increases (x-axis).

Different scenarios for different combinations of m and the arithmetic mean
Different scenarios for different combinations of m and the arithmetic mean

There’s a few things to note here:

  • As expected, we see that as the number of reviews increases, the distance between the arithmetic average and the Bayesian average decreases.
  • The closer the two initial averages are, the faster they converge together.
  • For choices of m that lean towards the extremes of 1 or 5, naturally, more reviews are needed for values of that are far from m to converge. While that’s a simple observation, its an important one that acts as a sanity check that confirms that using the average of all the entities’ ratings as the prior (which for Niche is towards the center of the 1–5 scale), is a good choice that minimizes the distance needed for the Bayesian average to converge to the arithmetic average.

Above, we looked at the relationship between m and , but in the real world, we have control over m, but no ability to set . Let’s try to simulate what would happen in the real world by running 1000 trials of users submitting up to 500 reviews for an entity. We’ll look at what happens to the Bayesian average as more reviews are generated and what happens when we tweak our parameters.

First let’s define our variables. We’ve already defined C, , x̄ᵦ, and n, and now we’ll add ε, which will be the neighborhood of the arithmetic average we deem acceptable. Essentially, we’re saying, “if the Bayesian average is within ε of the arithmetic average, they are essentially similar enough.” Note: It is possible, albeit extremely unlikely, that the Bayesian average leaves the epsilon range after entering it. But for the purposes of this exercise, we will choose to ignore that possibility. The choice of ε depends on use case and personal preference, so we’ll use .25, which is 5% of our scale. This creates a buffer region of 0.5 around the arithmetic average, or 10% of the 1 to 5 scale.

Another factor to consider is the standard deviation of responses. The standard deviation, as a quick recap, is a measure of how varied our data is. The smaller the standard deviation, the more we expect to get values close to the mean, and the larger it is, the more spread out the values are. We’ll set the standard deviation to 0.5. Let’s also set C as 10, n ranging from 0 to 500 reviews, and we’ll use m = 3.5, a reasonable estimate for the average rating for entities.

To simulate our experiment, let’s first define a few functions. choose_scores simulates a set of scores an entity may receive. The ordering of the array also represents the order in which the reviews are submitted. bayesian_average calculates the Bayesian average of our reviews.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def choose_scores(review_mean, review_sd, num_reviews):
"""
Returns an array of review scores sized num_review generated from a normal distribution N(review_mean, review_sd)
"""
return np.random.normal(review_mean, review_sd, num_reviews)
def bayesian_average(C, m, scores):
"""
Calculates the Bayesian average for an array of review scores
"""
return ((C*m) + np.sum(scores))/(C + len(scores))

run_one_simulation is the function that will perform one trial of an entity receiving a set of reviews.

def run_one_simulation(C, pop_mean, epsilon, review_mean, review_sd, num_reviews):
"""
Runs one simulation of the an entity receiving a set of 500 scores
"""
bayesian_averages= np.array([])
reviews = choose_scores(review_mean, review_sd, num_reviews)
cumulative_arithmetic_average = np.cumsum(reviews)/np.arange(1, len(reviews) + 1)
arithmetic_mean = np.mean(reviews)
for n in np.arange(num_reviews):
reviews_in = reviews[:n] #grabs all cumulative reviews
avg = bayesian_average(C, pop_mean, reviews_in)
averages = np.append(bayesian_averages, avg)
cross_threshold = np.nonzero(np.where((bayesian_averages> arithmetic_mean - epsilon) & (bayesian_averages < arithmetic_mean + epsilon), True, False))[0][0]
return bayesian_averages, cross_threshold, review_sd, arithmetic_mean, cumulative_arithmetic_average

Let’s run this one time. Take for example a school that has an arithmetic average of 4.5 if all 500 reviews were collected. Running a simulation, we could get something like this:

one_run, threshold, sd, arithmetic_mean, cumulative_arith_avg = run_one_simulation(10, 3.5, epsilon, 4.5, 0.5, 500)
fig, ax = plt.subplots(figsize=(15,15))
ax.plot(np.arange(1, 501), one_run, label = "Bayesian Average")
ax.plot(np.arange(1, 501), cumulative_arith_avg, label = "Running Arithmetic Mean")
ax.plot([0, 499], [arithmetic_mean, arithmetic_mean], label = 'True Arithmetic mean', color = 'steelblue')
ax.axhspan(arithmetic_mean - epsilon, arithmetic_mean + epsilon, alpha=0.25, color='skyblue', label = 'epsilon range')
ax.legend()
A single simulation of a school receiving a set of reviews over time
A single simulation of a school receiving a set of reviews over time

Here, we can see that as the number of reviews increases, the Bayesian average approaches the arithmetic average as more reviews are submitted, ending with a Bayesian average of 4.49. At 38 reviews, the Bayesian average also crosses into the epsilon region with a score of 4.27. What we see is that in relatively few reviews, the two averages can be considered the same as the Bayesian averages gets quickly dragged into the epsilon range before leveling off as it approaches the arithmetic average.

Now that we have done a 1 trial simulation, let’s repeat this process 1000 times, playing around with our different parameters. For the three plots underneath, we’ve averaged our metrics across 1000 trial runs and plotted them against reviews submitted from 1 to 500. As we can see, averaging them out over this many runs creates a very smooth looking plot. We’ve chosen to use a non-parametric method as it doesn’t assume any models (other than the generation of values from a normal distribution) and gives us a sense of how reviews are submitted in the real world. We’ve plotted 3 different types of plots; the first is the Bayesian average against the number of reviews, (compared against the arithmetic average with the epsilon range highlighted in a light blue), the second being the absolute distance between the Bayesian average and the arithmetic average against the number of reviews submitted, and the final being the percent of trials where the Bayesian average manages to be within epsilon of the arithmetic average.

Below, we’ve held everything constant from the one-trial example, except set C to range between 10 and 50. As expected, the Bayesian average takes longer to converge to the arithmetic average as C increases. As we can see, the choice of C makes a big difference, since adding 50 reviews means that our data has to overcome this prior. Striking a balance between protecting the average from outliers and allowing the Bayesian average to reflect reality is a tough problem to handle. While there’s tips and formulas on how to pick C, many times it’s more of an art than a science.

Bayesian average performance against the arithmetic average for different values of C
Bayesian average performance against the arithmetic average for different values of C

Below, we now vary m between 0 and 5, to see what its effect would be. While we do see that values of m farther from our arithmetic mean of 4.5 take longer to reach the epsilon region, the difference between different settings of m are not very high and converge to the arithmetic average. It’s only when we go very far from 4.5, around m = 1 and m = 1.5, where we see a substantial need for more reviews for the two averages to be considered the same. We also see a large initial jump (and in all three of our experiments) of % of reviews that enter the epsilon region very quickly. The biggest takeaway from these plots is that our choice of m depends upon our prior expectation of the entity’s rating, which without more information, is the average across all entities. And as a sanity check, we see that by picking this value, the Bayesian average accurately reflects an entities’ score in a relatively quick timeframe.

Bayesian average performance against the arithmetic average for different values of m
Bayesian average performance against the arithmetic average for different values of m

Finally, we try varying the standard deviation of the reviews from 0.1 to 1. Quickly, we see that our choice of SDs, when averaged across 1000 trials, doesn’t actually affect how the Bayesian average converges to the arithmetic average. While that doesn’t mean that the SD doesn’t have an effect on the Bayesian average, because it does to a substantial extent, but when looking at it from a macro perspective, it’s clear that it isn’t the driving force. This is evident in that the Bayesian average and distance from arithmetic mean graphs plot the same line over each other, yet the % of trials in the epsilon region plot does have some variation. As the SD increases, we see that it is harder for the Bayesian average to converge to the arithmetic average. Overall, the main takeaway is that while we do have to acknowledge the effect of the standard deviation, at the end of the day, we have no control over this aspect of the review submitting process, nor is it a driving cause of differences between the Bayesian and arithmetic averages.

Bayesian average performance against the arithmetic average for different values of SDs
Bayesian average performance against the arithmetic average for different values of SDs

Summary

From our exploration, we demonstrate a few key findings:

  • We see that in our experiments with C, m, and the standard deviation that the most important parameter to determine is C, as we effectively choose how many reviews of value m we choose to pre-populate our Bayesian average with.
  • By deconstructing the Bayesian average formula, we see that using this tool in an effective manner is an act of balancing m, our prior belief, with x, our observed data.
  • While C is a difficult number to pick, what we do have to keep in mind is our data, our use case, and the importance of being consistent with our choice.
  • As we expect, the more reviews we have, the better and more accurate our Bayesian average is.

As a result, the Bayesian average is a very useful tool to alleviate our problem of balancing raw user data and incorporating our expectations to build a fair platform for our entities. While the Bayesian average can be tweaked to fit the use cases we are interested in (for example, with time decays), in its simplest form, a Bayesian average is an easy to use and intuitive tool to better serve our users and entities by helping provide accurate representations of peoples’ experiences. Like many things, sometimes difficult and challenging problems only require a simple solution.

--

--