A Case for Visual Regression Testing

Brendan Murray
Building Niche
Published in
6 min readDec 8, 2017

--

A visual diff of my alma mater’s scatterplot

The Niche platform is always expanding; our sitemap contains more than a million pages and is growing. Every day that we push code or data deployments, our team is responsible for making sure nothing breaks. Depending on how smoothly we are moving, we might do this multiple times a week.

Deploying a web application of this size can seem intimidating. We have always had a comprehensive regression test plan, but brevity has never been its strong suit.

A year ago when we launched the new Niche platform every deployment required about three hours of manual testing.

For context of how repetitious this process is: in the first three months of the platform, we deployed code changes to production dozens of times. Doing this manually was becoming a liability: the time to verify deployments was increasing as we added new features. With this in mind, we began to leverage automated tools, and today manual testing plays a significantly reduced role in our deployment process.

Verifying that a million pages are ready to push is impossible unless you are able to focus on a representative set of them. Thankfully we have found a strategy for doing this in a programmatic way, and visual regression testing helps make this difficult process routine.

Waldo

Named after Where's Waldo, we built a visual regression test suite that is dynamically generated at run time.

This tool was built with wdio-visual-regression-service, which is part of the WebdriverIO project.

WebdriverIO

It leverages a form of Metaprogramming first implemented by our VP of Engineering, Geoff Misek, and was pair programmed with our Senior Software Engineer, Shawn Rancatore.

Stay tuned for more details on the actual architecture of Waldo in a future post; for now we will dive into why we built it.

Niche profiles

University of Pittsburgh

Niche profile pages change more often than any other type of page on our site. Profiles are built from “blocks” of content; each block has its own theme.

For example: college profiles can have a report card, a section on campus life, and another on the price of admission. Some of these blocks draw their data from persistent databases that change infrequently; others have content that can change dynamically when users submit new information.

We often update profiles with new informational fields, and each time we do, thousands of pages change. Whenever we make updates like this we want to make sure we get it right. Visual regression tests are an excellent safety net to help us identify exactly what changes.

What does a visual diff look like?

These tests are simple: capture an image of an environment and compare it to a reference image. If any difference between the two exists, flag the test and generate a new image that highlights any changes.

The following animation shows three versions of an Academics block:
1. A reference image of Niche.com
2. A screenshot of our staging environment
3. A generated image highlighting their differences

Tredyffrin-Easttown School District’s Academics block

This diff has a few changes: in the time between our staging “snapshot” of production and the time our reference image was taken, users submitted new SAT and ACT scores. There were also a few users from this district that expressed interest in schools in the “Popular Colleges” list.

These are all the sort of changes we call false failures, but nonetheless they assure us that nothing in this section changed unexpectedly.

How can we leverage diffs?

If a page passes a pixel-perfect image comparison, we can deploy it with a high level of confidence.

If it does not, we have a generated diff image to look over, and any changes on the page are clear immediately.

There are two invaluable wins with this strategy: we can focus our attention only on pages that change and see exactly where they change.

Verifying profile updates

This is a section of a full document profile diff that was generated during a deploy that had significant profile updates.

It is clear that something on this page changed. After looking closer, it looks like the Admissions block has changed in two ways: it has a new fact (Early Decision/Early Action), and a new Call to Action (Improve Your Test Scores link). These additions also increase the total height of the block, and are easier to parse in this block level diff:

These changes are what we were expecting. Unfortunately because they increase the height of the block, they obfuscate any potential changes below.

This illustrates a weakness of full document diffs, which are generally only useful for finding the first y-axis difference on a page.

Generating meaningful diffs

One way we deal with the brittleness of full document diffs is by creating redundant tests that represent a subset of existing coverage.

If a full document profile diff passes, our test runner moves on to the next profile.

If that test fails, the test runner goes through each block on the page and diffs them individually.

With these somewhat redundant tests, we can generate meaningful diffs on elements near the bottom of the page even if elements above them change significantly. This is helpful for pinpointing the precise location of any changes, but it increases run time significantly.

For example, if Carnegie Mellon’s full document profile diff fails, our test runner will end up running 19 additional tests. Here’s a look at what reference images for this single profile look like:

Costs

While we do generate false failures, they are not hard to spot; spot checking highlighted diffs is significantly easier than trying to stumble across subtle regressions. If manual testing is analogous to looking for a needle in a haystack, visual regression tests are like using a metal detector. False failures are worth this inversion of scope.

With a mismatch tolerance of 1 out of every 10,000 pixels, our tests have to ignore any elements on the page that implicitly have high variability. For this reason, we actually omit scatterplots in diffs, which is why one of the reference images above is blank.

Learnings

We initially tried implementing visual regression testing on maps and quickly realized our profiles were amenable to these tests as well. After expanding our coverage to include full document profile diffs, we found we could benefit from granular, block level tests.

Waldo has over 500 tests and writing these manually would have created an absurd amount of boilerplate code. Instead of doing that, we wrote a few data-driven test templates. Finally, we implemented a test runner that uses these templates to build tests for each profile at run time.

Thanks to this strategy, we effectively wrote more tests than lines of code!

Implementing tests with such comprehensive coverage was possible because we already had robust, granular, and focused test suites. Adding visual regression tests alongside them has been a great learning experience and as a result we are now more confident in our deployments than ever.

If we’re speaking your language, check out our hiring page.

--

--