Using Euclidean Distances to Cluster Similar Graduate Programs: A Niche.com Case Study

Victoria Zhang
Building Niche
Published in
6 min readNov 4, 2019

--

Our Goals and Mission

At Niche, we try to help people discover the schools that are right for them. We’re still in the early stages, and suggesting users other schools that may be interesting to them based on what they’re currently looking at is a good first step. One feature that is currently on our Colleges and K-12 verticals is the “similar entities” block, which shows users similar schools to the ones they are looking at. Users have remarked that they like how we give them similar schools to the ones they are currently looking at via a similar entities block. In addition, this is not only a user benefit, but it is also an SEO benefit as interlinking between pages helps Google bots crawl our website and improves our organic search standings.

An example of our similar schools block on our graduate schools vertical

Our methods for selecting similar schools in the past relied heavily on user behavior as well as grouping categories of schools together when user behavior was not enough. This led us to the eventual question of, what do we do when we know next to nothing about the entities?

We spent a lot of time to collect proprietary data on graduate schools, the programs they offer, and what specific business, law, and medical schools existed. We named these schools “level2 graduate schools” as their level1 counterparts were reported by the government. For example, The University of Pittsburgh is the Level 1 School, and Katz School of Business is the Level 2 School. We also matched the programs that each school offers, so we would know that “The Katz School of Business” offers 90 business degrees. The problem was this: we had a group of around 1000 entities that we knew next to nothing about. How could we get them in groups of similar entities to show them on the website?

The Method Identification Process

Every Monday, a group of product analysts get together for an “Analyst meeting” where we discuss different quantitative analyses we are thinking through and working on. One such problem we discussed was the problem of adding groups of similar entities to profiles that we knew very little about. I thought about the data we currently had, user data telling us what colleges were similar, associations between the colleges and graduate schools, and of course, the information we painstakingly collected — each of the programs that was offered at each school.

After discussing the information we had, I brought this question to the group, and we settled on trying a straight euclidean distance on quantitative factors. One difficulty of using euclidean distances on sparse matrices was just that — the sparseness of the data. We needed to do some tricky data manipulation to make the information we collected on programs be specific enough to count but generalizable enough to have significant similarities between graduate schools. To make this problem work, I rolled up each of the programs a given graduate school offered into its corresponding “sublevel” which is a mapping we make to help users who are browsing on a more general level. For example, Business Analytics rolls up into the sublevel of Business.

Code snippets for program rollups

Another way a colleague suggested I also try to improve the accuracy of the results was to select candidates.

The Candidate Selection Process

For a given graduate school, we looked at what corresponding colleges were similar to its corresponding college. We then compared the schools and colleges associated with that graduate school to other schools and colleges associated with the similar graduate schools. For example, if we know that graduate school A is similar to graduate school B based on the college data, we can compare all of the Level 2 entities X, Y, Z, R, W, and V to each other.

Something I found in my notebook to visually describe what’s going on in my head.

Knowing what the level 1 school associated with the level2 school is, we immediately know which candidates to select from. For example, for the level2 entities associated with The University of Michigan- Ann Arbor, there were 65 candidates to select from.

When we tried to use this candidate selection process however, it turned out that the selection of candidates we were selecting from was too narrow to provide accurate or sensible results to users. The University of Michigan School of Social Work was being matched to schools that were not recognizably related like the Weinberg College of Arts and Sciences. The tradeoff, however, was that the measure of prestige that was similar between a candidates chosen from the selection process was missing in the general process.

The Data Manipulation Process

I wrote out a couple steps so that I would keep all the data manipulations straight:

  1. First I had to get all the level2 entities associated with the level1 entities.
  2. I had to roll up major counts into sublevels and get those as percents of each student body
  3. Then I had to associate every level2 entity with one row of program data
  4. Then I joined the similar level2s to the level1s based on the relationship of these level1 schools to their colleges and those similar colleges
  5. Finally, with that data, I could use a euclidean distance to match on the percent of each major and get a matrix that compares every school in each candidate group with every other school in that group.

After getting the data frame with all of the programs and level2 schools, I ran Philentropy’s euclidean distance function on the data frame.

Some of the “resourceful” code I ended up writing.

It returned a matrix of every school compared to every other school, with the distance between each school and itself being zero.

The code below amalgamates this data and returns a data frame where every school gets matched with the top six most similar schools to the other.

Timeframe & Constraints

We’re a small company committed to doing a ton of things, and as a product manager and product analyst, I knew this analysis did not have tons of time to come to fruition. I threw this together in a couple hours because our goals were to improve interlinking on these pages while giving users important content. It wasn’t about being perfect. That being said, I still tried two methods, and found that the lack of candidate selection — especially as the group was not overly large — worked. Additionally, this method gave us a better distribution of links amongst the level2 graduate schools, which we assume gives us a better organic search boost.

There are plenty of other clustering or machine learning ways that I could have used. I had thought about using categorical variables and coding something with cosine similarities. However, this was a computationally relatively simple calculation, and it was fairly straightforward.

The Results

We got a list! We put it in the data, and we put it on the site! In the future, I would definitely want to come up with more comprehensive candidate selection measures so that we can improve these similar entities.

--

--