Testing Blog
Where do our flaky tests come from?
Monday, April 17, 2017
author: Jeff Listfield
When tests fail on code that was previously tested, this is a strong signal that something is newly wrong with the code. Before, the tests passed and the code was correct; now the tests fail and the code is not working right. The goal of a good test suite is to make this signal as clear and directed as possible.
Flaky (nondeterministic) tests, however, are different. Flaky tests are tests that exhibit both a passing and a failing result with the same code. Given this, a test failure may or may not mean that there's a new problem. And trying to recreate the failure, by rerunning the test with the same version of code, may or may not result in a passing test. We start viewing these tests as unreliable and eventually they lose their value. If the root cause is nondeterminism in the production code, ignoring the test means ignoring a production bug.
Flaky Tests at Google
Google has around 4.2 million tests that run on our continuous integration system. Of these, around 63 thousand have a flaky run over the course of a week. While this represents less than 2% of our tests, it still causes significant drag on our engineers.
If we want to fix our flaky tests (and avoid writing new ones) we need to understand them. At Google, we collect lots of data on our tests: execution times, test types, run flags, and consumed resources. I've studied how some of this data correlates with flaky tests and believe this research can lead us to better, more stable testing practices. Overwhelmingly, the larger the test (as measured by binary size, RAM use, or number of libraries built), the more likely it is to be flaky. The rest of this post will discuss some of my findings.
For a previous discussion of our flaky tests, see John Micco's
post
from May 2016.
Test size - Large tests are more likely to be flaky
We categorize our tests into three general sizes: small, medium and large. Every test has a size, but the choice of label is subjective. The engineer chooses the size when they initially write the test, and the size is not always updated as the test changes. For some tests it doesn't reflect the nature of the test anymore. Nonetheless, it has some predictive value. Over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky
[1]
. There's a clear increase in flakiness from small to medium and from medium to large. But this still leaves open a lot of questions. There's only so much we can learn looking at three sizes.
The larger the test, the more likely it will be flaky
There are some objective measures of size we collect: test binary size and RAM used when running the test
[2]
. For these two metrics, I grouped tests into equal-sized buckets
[3]
and calculated the percentage of tests in each bucket that were flaky. The numbers below are the r2 values of the linear best fit
[4]
.
Correlation between metric and likelihood of test being flaky
Metric
r2
Binary size
0.82
RAM used
0.76
The tests that I'm looking at are (for the most part) hermetic tests that provide a pass/fail signal. Binary size and RAM use correlated quite well when looking across our tests and there's not much difference between them. So it's not just that large tests are likely to be flaky, it's that the larger the tests get, the more likely they are to be flaky.
I have charted the full set of tests below for those two metrics. Flakiness increases with increases in binary size
[5]
, but we also see increasing linear fit residuals
[6]
at larger sizes.
The RAM use chart below has a clearer progression and only starts showing large residuals between the first and second vertical lines.
While the bucket sizes are constant, the number of tests in each bucket is different. The points on the right with larger residuals include much fewer tests than those on the left. If I take the smallest 96% of our tests (which ends just past the first vertical line) and then shrink the bucket size, I get a much stronger correlation (r2 is 0.94). It perhaps indicates that RAM and binary size are much better predictors than the overall charts show.
Certain tools correlate with a higher rate of flaky tests
Some tools get blamed for being the cause of flaky tests. For example,
WebDriver
tests (whether written in Java, Python, or JavaScript) have a reputation for being flaky
[7]
. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool.
Flakiness of tests using some of our common testing tools
Category
% of tests that are flaky
% of all flaky tests
All tests
1.65%
100%
Java WebDriver
10.45%
20.3%
Python WebDriver
18.72%
4.0%
An internal integration tool
14.94%
10.6%
Android emulator
25.46%
11.9%
All of these tools have higher than average flakiness. And given that 1 in 5 of our flaky tests are Java WebDriver tests, I can understand why people complain about them. But correlation is not causation, and given our results from the previous section, there might be something other than the tool causing the increased rate of flakiness.
Size is more predictive than tool
We can combine tool choice and test size to see which is more important. For each tool above, I isolated tests that use the tool and bucketed those based on memory usage (RAM) and binary size, similar to my previous approach. I calculated the line of best fit and how well it correlated with the data (r2). I then computed the predicted likelihood a test would be flaky at the smallest bucket
[8]
(which is already the 48th percentile of all our tests) as well as the 90th and 95th percentile of RAM used.
Predicted flaky likelihood by
RAM
and
tool
Category
r2
Smallest bucket
(48th percentile)
90th percentile
95th percentile
All tests
0.76
1.5%
5.3%
9.2%
Java WebDriver
0.70
2.6%
6.8%
11%
Python WebDriver
0.65
-2.0%
2.4%
6.8%
An internal integration tool
0.80
-1.9%
3.1%
8.1%
Android emulator
0.45
7.1%
12%
17%
This table shows the results of these calculations for RAM. The correlation is stronger for the tools other than Android emulator. If we ignore that tool, the difference in correlations between tools for similar RAM use are around 4-5%. The differences from the smallest test to the 95th percentile for the tests are 8-10%. This is one of the most useful outcomes from this research: tools have some impact, but RAM use accounts for larger deviations in flakiness.
Predicted flaky likelihood by
binary size
and
tool
Category
r2
Smallest bucket
(33rd percentile)
90th percentile
95th percentile
All tests
0.82
-4.4%
4.5%
9.0%
Java WebDriver
0.81
-0.7%
14%
21%
Python WebDriver
0.61
-0.9%
11%
17%
An internal integration tool
0.80
-1.8%
10%
17%
Android emulator
0.05
18%
23%
25%
There's virtually no correlation between binary size and flakiness for Android emulator tests. For the other tools, you see greater variation in predicted flakiness between the small tests and large tests compared to RAM; up to 12% points. But you also see wider differences from the smallest size to the largest; 22% at the max. This is similar to what we saw with RAM use and another of the most useful outcomes of this research: binary size accounts for larger deviations in flakiness than the tool you use.
Conclusions
Engineer-selected test size correlates with flakiness, but within Google there are not enough test size options to be particularly useful.
Objectively measured test binary size and RAM have strong correlations with whether a test is flaky. This is a continuous function rather than a step function. A step function would have sudden jumps and could indicate that we're transitioning from one type of test to another at those points (e.g. unit tests to system tests or system tests to integration tests).
Tests written with certain tools exhibit a higher rate of flakiness. But much of that can be explained by the generally larger size of these tests. The tool itself seems to contribute only a small amount to this difference.
We need to be more careful before we decide to write large tests. Think about what code you are testing and what a minimal test would look like. And we need to be careful as we write large tests. Without additional effort aimed at preventing flakiness, there's is a strong likelihood you will have flaky tests that require maintenance.
Footnotes
A test was flaky if it had at least one flaky run during the week.
I also considered number of libraries built to create the test. In a 1% sample of tests, binary size (0.39) and RAM use (0.34) had stronger correlations than number of libraries (0.27). I only studied binary size and RAM use moving forward.
I aimed for around 100 buckets for each metric.
r2 measures how closely the line of best fit matches the data. A value of 1 means the line matches the data exactly.
There are two interesting areas where the points actually reverse their upward slope. The first starts about halfway to the first vertical line and lasts for a few data points and the second goes from right before the first vertical line to right after. The sample size is large enough here that it's unlikely to just be random noise. There are clumps of tests around these points that are more or less flaky than I'd expect only considering binary size. This is an opportunity for further study.
Distance from the observed point and the line of best fit.
Other web testing tools get blamed as well, but WebDriver is our most commonly used one.
Some of the predicted flakiness percents for the smallest buckets end up being negative. While we can't have a negative percent of tests be flaky, it is a possible outcome using this type of prediction.
35 comments
Code Health: Google's Internal Code Quality Efforts
Monday, April 03, 2017
By
Max Kanat-Alexander
, Tech Lead for Code Health and Author of
Code Simplicity
There are many aspects of good coding practices that don't fall under the normal areas of testing and tooling that most Engineering Productivity groups focus on in the software industry. For example, having readable and maintainable code is about more than just writing good tests or having the right tools—it's about having code that can be easily understood and modified in the first place. But how do you make sure that engineers follow these practices while still allowing them the independence that they need to make sound engineering decisions?
Many years ago, a group of Googlers came together to work on this problem, and they called themselves the "Code Health" group. Why "Code Health"? Well, many of the other terms used for this in the industry—engineering productivity, best practices, coding standards, code quality—have connotations that could lead somebody to think we were working on something other than what we wanted to focus on. What we cared about was the processes and practices of software engineering in full—any aspect of
how software was written
that could influence the readability, maintainability, stability, or simplicity of code. We liked the analogy of having "healthy" code as covering all of these areas.
This is a field that many authors, theorists, and conference speakers touch on, but not an area that usually has dedicated resources within engineering organizations. Instead, in most software companies, these efforts are pushed by a few dedicated engineers in their extra time or led by the senior tech leads. However, every software engineer is actually involved in code health in some way. After all, we all write software, and most of us care deeply about doing it the "right way." So why not start a group that helps engineers with that "right way" of doing things?
This isn't to say that we are prescriptive about engineering practices at Google. We still let engineers make the decisions that are most sensible for their projects. What the Code Health group does is work on efforts that
universally
improve the lives of engineers and their ability to write products with shorter iteration time, decreased development effort, greater stability, and improved performance. Everybody appreciates their code getting easier to understand, their libraries getting simpler, etc. because we all know those things let us move faster and make better products.
But how do we accomplish all of this? Well, at Google, Code Health efforts come in many forms.
There is a Google-wide Code Health Group composed of
20%
contributors who work to make engineering at Google better for everyone. The members of this group maintain internal documents on best practices and act as a sounding board for teams and individuals who wonder how best to improve practices in their area. Once in a while, for critical projects, members of the group get directly involved in refactoring code, improving libraries, or making changes to tools that promote code health.
For example, this central group maintains Google's code review guidelines, writes internal publications about best practices, organizes tech talks on productivity improvements, and generally fosters a culture of great software engineering at Google.
Some of the senior members of the Code Health group also advise engineering executives and internal leadership groups on how to improve engineering practices in their areas. It's not always clear how to implement effective code health practices in an area—some people have more experience than others making this happen broadly in teams, and so we offer our consulting and experience to help make simple code and great developer experiences a reality.
In addition to the central group, many products and teams at Google have their own Code Health group. These groups tend to work more closely on actual coding projects, such as addressing technical debt through refactoring, making tools that detect and prevent bad coding practices, creating
automated code formatters
, or making systems for automatically deleting unused code. Usually these groups coordinate and meet with the central Code Health group to make sure that we aren't duplicating efforts across the company and so that great new tools and systems can be shared with the rest of Google.
Throughout the years, Google's Code Health teams have had a major impact on the ability of engineers to develop great products quickly at Google. But code complexity isn't an issue that only affects Google—it affects everybody who writes software, from one person writing software on their own time to the largest engineering teams in the world. So in order to help out everybody, we're planning to release articles in the coming weeks and months that detail specific practices that we encourage internally—practices that can be applied everywhere to help your company, your codebase, your team, and you. Stay tuned here on the Google Testing Blog for more Code Health articles coming soon!
3 comments
Labels
TotT
103
GTAC
61
James Whittaker
42
Misko Hevery
32
Code Health
31
Anthony Vallone
27
Patrick Copeland
23
Jobs
18
Andrew Trenk
13
C++
11
Patrik Höglund
8
JavaScript
7
Allen Hutchison
6
George Pirocanac
6
Zhanyong Wan
6
Harry Robinson
5
Java
5
Julian Harty
5
Adam Bender
4
Alberto Savoia
4
Ben Yu
4
Erik Kuefler
4
Philip Zembrod
4
Shyam Seshadri
4
Chrome
3
Dillon Bly
3
John Thomas
3
Lesley Katzen
3
Marc Kaplan
3
Markus Clermont
3
Max Kanat-Alexander
3
Sonal Shah
3
APIs
2
Abhishek Arya
2
Alan Myrvold
2
Alek Icev
2
Android
2
April Fools
2
Chaitali Narla
2
Chris Lewis
2
Chrome OS
2
Diego Salas
2
Dori Reuveni
2
Jason Arbon
2
Jochen Wuttke
2
Kostya Serebryany
2
Marc Eaddy
2
Marko Ivanković
2
Mobile
2
Oliver Chang
2
Simon Stewart
2
Stefan Kennedy
2
Test Flakiness
2
Titus Winters
2
Tony Voellm
2
WebRTC
2
Yiming Sun
2
Yvette Nameth
2
Zuri Kemp
2
Aaron Jacobs
1
Adam Porter
1
Adam Raider
1
Adel Saoud
1
Alan Faulkner
1
Alex Eagle
1
Amy Fu
1
Anantha Keesara
1
Antoine Picard
1
App Engine
1
Ari Shamash
1
Arif Sukoco
1
Benjamin Pick
1
Bob Nystrom
1
Bruce Leban
1
Carlos Arguelles
1
Carlos Israel Ortiz García
1
Cathal Weakliam
1
Christopher Semturs
1
Clay Murphy
1
Dagang Wei
1
Dan Maksimovich
1
Dan Shi
1
Dan Willemsen
1
Dave Chen
1
Dave Gladfelter
1
David Bendory
1
David Mandelberg
1
Derek Snyder
1
Diego Cavalcanti
1
Dmitry Vyukov
1
Eduardo Bravo Ortiz
1
Ekaterina Kamenskaya
1
Elliott Karpilovsky
1
Elliotte Rusty Harold
1
Espresso
1
Felipe Sodré
1
Francois Aube
1
Gene Volovich
1
Google+
1
Goran Petrovic
1
Goranka Bjedov
1
Hank Duan
1
Havard Rast Blok
1
Hongfei Ding
1
Jason Elbaum
1
Jason Huggins
1
Jay Han
1
Jeff Hoy
1
Jeff Listfield
1
Jessica Tomechak
1
Jim Reardon
1
Joe Allan Muharsky
1
Joel Hynoski
1
John Micco
1
John Penix
1
Jonathan Rockway
1
Jonathan Velasquez
1
Josh Armour
1
Julie Ralph
1
Kai Kent
1
Kanu Tewary
1
Karin Lundberg
1
Kaue Silveira
1
Kevin Bourrillion
1
Kevin Graney
1
Kirkland
1
Kurt Alfred Kluever
1
Manjusha Parvathaneni
1
Marek Kiszkis
1
Marius Latinis
1
Mark Ivey
1
Mark Manley
1
Mark Striebeck
1
Matt Lowrie
1
Meredith Whittaker
1
Michael Bachman
1
Michael Klepikov
1
Mike Aizatsky
1
Mike Wacker
1
Mona El Mahdy
1
Noel Yap
1
Palak Bansal
1
Patricia Legaspi
1
Per Jacobsson
1
Peter Arrenbrecht
1
Peter Spragins
1
Phil Norman
1
Phil Rollet
1
Pooja Gupta
1
Project Showcase
1
Radoslav Vasilev
1
Rajat Dewan
1
Rajat Jain
1
Rich Martin
1
Richard Bustamante
1
Roshan Sembacuttiaratchy
1
Ruslan Khamitov
1
Sam Lee
1
Sean Jordan
1
Sebastian Dörner
1
Sharon Zhou
1
Shiva Garg
1
Siddartha Janga
1
Simran Basi
1
Stan Chan
1
Stephen Ng
1
Tejas Shah
1
Test Analytics
1
Test Engineer
1
Tim Lyakhovetskiy
1
Tom O'Neill
1
Vojta Jína
1
automation
1
dead code
1
iOS
1
mutation testing
1
Archive
►
2025
(1)
►
Jan
(1)
►
2024
(13)
►
Dec
(1)
►
Oct
(1)
►
Sep
(1)
►
Aug
(1)
►
Jul
(1)
►
May
(3)
►
Apr
(3)
►
Mar
(1)
►
Feb
(1)
►
2023
(14)
►
Dec
(2)
►
Nov
(2)
►
Oct
(5)
►
Sep
(3)
►
Aug
(1)
►
Apr
(1)
►
2022
(2)
►
Feb
(2)
►
2021
(3)
►
Jun
(1)
►
Apr
(1)
►
Mar
(1)
►
2020
(8)
►
Dec
(2)
►
Nov
(1)
►
Oct
(1)
►
Aug
(2)
►
Jul
(1)
►
May
(1)
►
2019
(4)
►
Dec
(1)
►
Nov
(1)
►
Jul
(1)
►
Jan
(1)
►
2018
(7)
►
Nov
(1)
►
Sep
(1)
►
Jul
(1)
►
Jun
(2)
►
May
(1)
►
Feb
(1)
▼
2017
(17)
►
Dec
(1)
►
Nov
(1)
►
Oct
(1)
►
Sep
(1)
►
Aug
(1)
►
Jul
(2)
►
Jun
(2)
►
May
(3)
▼
Apr
(2)
Where do our flaky tests come from?
Code Health: Google's Internal Code Quality Efforts
►
Feb
(1)
►
Jan
(2)
►
2016
(15)
►
Dec
(1)
►
Nov
(2)
►
Oct
(1)
►
Sep
(2)
►
Aug
(1)
►
Jun
(2)
►
May
(3)
►
Apr
(1)
►
Mar
(1)
►
Feb
(1)
►
2015
(14)
►
Dec
(1)
►
Nov
(1)
►
Oct
(2)
►
Aug
(1)
►
Jun
(1)
►
May
(2)
►
Apr
(2)
►
Mar
(1)
►
Feb
(1)
►
Jan
(2)
►
2014
(24)
►
Dec
(2)
►
Nov
(1)
►
Oct
(2)
►
Sep
(2)
►
Aug
(2)
►
Jul
(3)
►
Jun
(3)
►
May
(2)
►
Apr
(2)
►
Mar
(2)
►
Feb
(1)
►
Jan
(2)
►
2013
(16)
►
Dec
(1)
►
Nov
(1)
►
Oct
(1)
►
Aug
(2)
►
Jul
(1)
►
Jun
(2)
►
May
(2)
►
Apr
(2)
►
Mar
(2)
►
Jan
(2)
►
2012
(11)
►
Dec
(1)
►
Nov
(2)
►
Oct
(3)
►
Sep
(1)
►
Aug
(4)
►
2011
(39)
►
Nov
(2)
►
Oct
(5)
►
Sep
(2)
►
Aug
(4)
►
Jul
(2)
►
Jun
(5)
►
May
(4)
►
Apr
(3)
►
Mar
(4)
►
Feb
(5)
►
Jan
(3)
►
2010
(37)
►
Dec
(3)
►
Nov
(3)
►
Oct
(4)
►
Sep
(8)
►
Aug
(3)
►
Jul
(3)
►
Jun
(2)
►
May
(2)
►
Apr
(3)
►
Mar
(3)
►
Feb
(2)
►
Jan
(1)
►
2009
(54)
►
Dec
(3)
►
Nov
(2)
►
Oct
(3)
►
Sep
(5)
►
Aug
(4)
►
Jul
(15)
►
Jun
(8)
►
May
(3)
►
Apr
(2)
►
Feb
(5)
►
Jan
(4)
►
2008
(75)
►
Dec
(6)
►
Nov
(8)
►
Oct
(9)
►
Sep
(8)
►
Aug
(9)
►
Jul
(9)
►
Jun
(6)
►
May
(6)
►
Apr
(4)
►
Mar
(4)
►
Feb
(4)
►
Jan
(2)
►
2007
(41)
►
Oct
(6)
►
Sep
(5)
►
Aug
(3)
►
Jul
(2)
►
Jun
(2)
►
May
(2)
►
Apr
(7)
►
Mar
(5)
►
Feb
(5)
►
Jan
(4)
Feed
Follow @googletesting