BRO SRE

Reliability practices, infrastructure, automation

← Back to articles

The Real Cost of CI Flakiness and How We Got It Under 0.5%

2026-04-12 · CI/CD, Developer Productivity, Testing

Eighteen months ago our CI flake rate was 7.3%. About one in fourteen pipeline runs would fail for reasons unrelated to the change being tested. Engineers had a Pavlovian response to red builds — they would simply hit "retry" without investigating. The integrity of the signal had collapsed. We did not know it then, but a flaky CI is one of the most expensive forms of technical debt a team can carry.

This article documents the program that took us from 7.3% to 0.4% over four quarters, what we measured, what worked, and what was an expensive distraction.

What Flakiness Actually Costs

The obvious cost of flakiness is wasted compute. We measured it: at 7.3% flake rate, with 4,200 PR pipelines per week and an average pipeline cost of $0.41, we burned $52,000 per year on reruns. Real money, but not enough to justify a serious investment.

The hidden cost is engineering time. A flake mid-pipeline means a developer is interrupted from whatever they were doing. They check the failure, decide it is unrelated, hit retry, and try to remember what they were doing. We measured this with a pop-up survey: average context-switch cost per flake was 11 minutes. Across the org that was 8.4 FTE-years annually.

The deepest cost is signal degradation. When developers expect failures to be flakes, they treat real failures as flakes too. We found seven instances of actual production-breaking bugs where the failing test had been "retried" four or more times before someone investigated. The flaky CI had trained the team to ignore the alert. This was the cost that finally made us invest.

Measurement First

Before fixing anything, we needed to know what was flaking and how often. Our CI emitted only pass/fail at the job level. We instrumented test-runner output to emit per-test results, then attributed failures using the following rule:

A test is flaky if it fails on commit C and then passes on the same commit C
without any change to the test, application code, or test dependencies.

A pipeline run is flaky if any of its tests are flaky AND the pipeline
would have failed without retries.

The distinction is important. A test that legitimately fails because the developer broke something is not flaky. A test that fails because of a resource contention with another test on the same runner is flaky. The data pipeline we built classifies each failure into one bucket.

The output is a weekly leaderboard of the top 20 flakiest tests, ranked by total developer-time-cost (frequency × estimated context-switch cost). This single artifact drove most of the subsequent work.

Categories of Flake

Once we had data, the same patterns kept appearing. We classified every fix by root cause:

Fixes That Worked

Eliminate Fixed Sleeps

For the timing category, the fix was almost always replacing sleep(2) with a poll-until-condition pattern with a generous upper bound:

// Before
await sleep(2000);
expect(queueProcessed).toBe(true);

// After
await waitFor(() => queueProcessed === true, { timeout: 30_000, interval: 50 });

The 30-second upper bound is intentionally much larger than the expected duration. The test fails fast in normal cases and tolerates slow CI runners without flaking.

Database Per Test, Not Per Suite

For shared-state flakes, we adopted "transaction rollback at end of test" for databases that support it (Postgres, MySQL) and "fresh schema per test class" for those that do not. The cost is some test runtime (about 8% in our case); the benefit is the elimination of an entire flake category.

The harder part was applying the same discipline to non-database shared state. Tests that wrote to a temp directory, tests that registered metrics with a global Prometheus registry, tests that mutated environment variables — each needed an explicit cleanup that was not being run reliably. We added a test framework hook that asserts a list of "global state" properties are unchanged after each test, and fails the test if they are not.

Aggressively Mock External Dependencies

Every test that hit an external service became a candidate for either mocking or moving to a separate "integration" suite that runs on a different cadence. The unit/integration split is religious in our codebase — a "unit test" cannot make a network call, period.

For services we own, we provide mockable interfaces with a fake implementation maintained alongside the real one. For third-party services, we use WireMock or VCR to record-and-replay. The replay files are checked into the repo, so the test is deterministic across machines and time.

Quarantine, Don't Ignore

When a test became chronically flaky and could not be fixed within a sprint, we moved it to a quarantine job. The quarantine job runs the same tests but does not block the PR. The tests are still executed and tracked; failures generate tickets but do not stop merges.

Quarantine is explicitly temporary. A test in quarantine for more than 60 days is automatically deleted with a notification to the team that owned it. This sounds harsh but is necessary. A quarantined test that nobody fixes provides no value and prevents people from forgetting about it.

What Did Not Work

Two interventions sounded good in theory and produced minimal results:

Generic retries at the job level. Configuring CI to auto-retry failed jobs masks flakes from developers but does not fix them. We tried it for two months and the flake rate continued to climb, because the data pipeline was now classifying retried jobs as "passed." We rolled it back and added retries only at the individual-test level, with mandatory metadata explaining why a test is retry-eligible.

Beefier CI runners. We hypothesized that resource contention would mostly go away if we doubled the CPU and RAM per runner. It moved the needle by maybe 0.4 percentage points and increased CI cost by 60%. The problem was almost always test design, not host capacity.

The Cultural Piece

Tooling and measurement got us most of the way. The last piece was cultural: changing how the team responded to flakes.

We introduced a weekly 30-minute "flake review" attended by a rotating engineer from each team. The agenda is the top 10 flakes from the leaderboard. The output is owner assignments and target fix dates. Attendance is mandatory; missing it twice in a quarter triggers a conversation with the engineering manager.

This meeting was unpopular for the first quarter. By the third quarter the leaderboard was sparse enough that the meeting often ended in 10 minutes. By the fourth quarter we cut the cadence to every other week.

The 0.4% flake rate is not steady-state — every new feature introduces some flakes and the curve will trend upward if we stop investing. The team understands the maintenance cost now, and the cost is much lower than the cost of the alternative.