June 5, 2026
What to Measure in CI When You Want to Catch Test Instability Before Merge
Learn which CI metrics reveal test instability early, how to track flake rate, merge confidence, and pipeline observability, and how to stop noisy tests from reaching main.
If you only look at pass or fail in CI, you miss the real signal. A test suite can be “green” and still be unstable, slow to recover, expensive to rerun, and unreliable enough that engineers stop trusting it. The useful question is not just whether tests pass, but whether the system can tell you, early and with enough confidence, that a test is becoming flaky before it damages the merge process.
That is where measurement matters. When teams talk about test instability in CI, they often jump straight to “fix the flaky test.” That is necessary, but it is not the first step. First you need a small set of metrics that show whether instability is increasing, where it lives, and how much it affects merge confidence. If those metrics are visible before merge, they let you act while the change is still cheap to reverse.
This article is a practical guide to what to measure, how to interpret it, and how to wire those measurements into a CI pipeline without creating more noise.
What test instability in CI actually means
Test instability in CI is not the same as a test failure caused by a real product regression. Instability is when the result varies for reasons unrelated to the code under test, or when the test environment makes the result unreliable. Common causes include timing issues, shared state, test data collisions, race conditions, network dependence, and environment drift.
A useful mental model is this:
- A stable test produces the same outcome under the same conditions.
- A flaky test produces different outcomes under similar conditions.
- A brittle test passes only when the environment is unusually favorable.
A flaky suite does not just create red builds. It erodes the meaning of green builds.
This distinction matters because the right metrics are different. You are not measuring quality alone, you are measuring trust in the pipeline itself.
The metrics that matter most
There are many things you could measure in CI, but only a few are directly useful for catching instability before merge. The best metrics are the ones that answer one of three questions:
- How often do tests fail without a code change that should have broken them?
- How quickly can we identify unstable behavior?
- How much does instability reduce our confidence in a merge decision?
Here are the core metrics that serve those goals.
1. Flake rate
Flake rate is the single most important metric for test instability in CI. It answers, “How often does this test or suite fail non-deterministically?”
A simple definition:
- Run a test multiple times under similar conditions.
- Count failures that disappear on rerun without code changes.
- Divide flaky failures by total executions.
You can track it at several levels:
- per test
- per file or suite
- per branch
- per build pipeline
The exact formula depends on your setup, but the idea is consistent. If a test passes on retry, that does not prove it is flaky, but it is a strong signal worth tracking.
Why it matters:
- It tells you where to spend debugging time.
- It identifies tests that are poisoning merge confidence.
- It helps you prioritize fixes by impact, not by loudness.
A useful rule is to watch both absolute flake counts and flake rate percentage. A suite with 2 flakes out of 20 runs is obviously more alarming than 2 flakes out of 2,000 runs, but a large suite with a low percentage can still cause many developer interruptions.
2. Retry frequency
Retry frequency measures how often CI had to rerun a test, job, or step before concluding success or failure. This is not identical to flake rate. A flaky test may never be retried if your pipeline fails fast, and a stable but slow job may be retried due to infrastructure problems.
Track:
- number of manual retries
- number of automated retries
- percentage of jobs that required any retry
- average retries per build
Retry data is useful because it reflects operational reality. Even if the final build is green, retries cost time, create uncertainty, and often hide problems that should have been visible.
A pipeline that needs retries to stay green is not reliable, it is masked.
3. First-failure signal quality
A useful CI system should tell you quickly whether a failure is probably real, probably flaky, or probably environmental. First-failure signal quality measures how useful the first failure event is before rerun noise starts.
This is harder to reduce to one formula, but you can assess it with questions like:
- Does the failure log point to a consistent assertion or timeout?
- Are environment details captured automatically?
- Can you tell whether the failure was local to one test or systemic across many tests?
- Does the failure happen in the same stage repeatedly?
If first-failure logs are poor, your team spends more time investigating and less time fixing. That makes instability more expensive even when the flake rate is modest.
4. Failure clustering
Failure clustering looks at whether unstable tests fail together, in the same branch, on the same runner, or after the same code paths. Clusters usually reveal shared root causes such as:
- shared fixtures
- data collisions
- parallel execution contention
- a bad environment image
- a flaky external dependency
This is one of the most actionable metrics because it helps you decide whether the problem is in a test, a test group, or the pipeline environment itself.
Track clusters by:
- test name
- module or directory
- job type
- runner image
- time window
- git commit or changed files
5. Time to instability detection
This measures how long it takes from the introduction of instability to the point where the pipeline flags it.
The shorter this time is, the better.
Why it matters:
- catching instability after merge means more people are affected
- catching it before merge limits blast radius
- shorter detection windows make root cause analysis easier
In practice, you want instability to show up during the same CI run that introduced it, not hours later in a nightly job or a downstream environment.
6. Merge confidence score
Merge confidence is not a standard industry metric, but it is a useful concept for engineering teams. It represents how much trust you should place in a “green” result before merging.
A simple merge confidence model can incorporate:
- current run result
- recent flake history for changed tests
- retry count
- environmental anomalies
- test duration variance
- whether the test touched high-risk code
You do not need a perfect score. You need a policy that says: “Green on a test with repeated flake history is not the same as green on a stable test.”
This is especially useful for PR gates, where the decision is binary but the evidence is not.
7. Pipeline duration variance
When test instability appears, duration often becomes noisy before failures become obvious. A test that alternates between 30 seconds and 90 seconds may be suffering from intermittent waits, resource contention, or unstable dependencies.
Track:
- median duration
- p95 duration
- duration variance over time
- duration by test and by stage
If duration spikes correlate with flakiness, you have a strong clue that the issue is not a product assertion but a timing or environment problem.
8. Quarantine volume and age
Quarantined tests are a reality in many orgs. The metric that matters is not just how many tests are quarantined, but how long they stay there.
Track:
- number of quarantined tests
- age of each quarantine item
- percentage of flaky failures that get quarantined within a defined SLA
- percentage of quarantined tests reintroduced successfully
A large quarantine backlog is a strong indicator that the team is managing instability, not reducing it.
What to measure at each stage of the pipeline
Not every metric belongs everywhere. Good pipeline observability means collecting the right data at the right stage.
During unit tests
Measure:
- failure rate by test
- retry count
- duration variance
- module-level clustering
Unit tests should be the cleanest signal in CI. If they are unstable, your merge gate is weaker than it looks.
During integration tests
Measure:
- flake rate by environment
- external dependency errors
- resource contention signals
- setup and teardown failures
- test order sensitivity
Integration tests often fail because of the environment, not because the application code is wrong. That means observability should include service health, container startup time, database readiness, and network error categories.
During end-to-end tests
Measure:
- step-level failure points
- locator or assertion retries
- page load and API latency correlation
- environment-specific failures
- cross-browser or cross-device variance
End-to-end suites are usually slower and more expensive, so even a small amount of instability can have a big effect on merge confidence.
Build the metrics from raw signals, not just status labels
CI systems often give you pass, fail, and maybe retry. That is not enough. To detect test instability in CI early, collect structured events from the pipeline.
Useful raw signals include:
- test name or ID
- commit SHA
- branch or PR number
- job name
- runner image or host
- start and end times
- attempt count
- failure category
- error message hash
- stack trace signature
- changed files in the PR
This allows you to compute metrics over time instead of relying on anecdotal reports.
A basic event schema might look like this:
{ “testId”: “checkout.spec.ts::submits-payment”, “commit”: “a1b2c3d”, “branch”: “feature/payment-retry”, “job”: “e2e-chrome”, “attempt”: 2, “status”: “passed”, “durationMs”: 48213, “failureCategory”: “timeout”, “runner”: “ubuntu-22.04” }
With this kind of structure, you can build dashboards, alerts, and trend analysis that distinguish flaky tests from one-off failures.
How to distinguish flaky tests from real regressions
One of the hardest parts of test instability in CI is not collecting data, it is interpreting it.
A real regression often shows these patterns:
- failure is reproducible on rerun
- failure maps to the changed code path
- failure affects related tests consistently
- failure persists across environments
A flaky test often shows these patterns:
- failure disappears on rerun
- failure appears in unrelated changes
- failure is clustered around timing or environment variation
- failure shifts between assertions without code changes
That said, retries can be misleading. A test that passes on rerun may still be revealing a legitimate race condition in production code. So the right response is not “rerun means flaky.” The right response is “rerun means unstable enough to investigate.”
Treat retries as a diagnostic clue, not as proof of harmlessness.
A practical threshold strategy
Teams often ask for a precise threshold, such as, “At what flake rate should we block merge?” There is no universal number. Thresholds should reflect team size, test volume, and cost of false confidence.
A more practical strategy is to define tiers:
- Informational: one-off failure, no historical pattern
- Warning: repeated failure pattern or retry needed
- Action required: test has a recent flake history or multiple failures in a rolling window
- Merge blocked: a critical path test is unstable enough that the confidence in the gate is low
This approach avoids overreacting to noise while still preventing unstable tests from silently becoming part of your definition of “green.”
Sample implementation: collecting flaky-test telemetry in CI
A simple CI setup can emit test metadata into logs or a metrics sink after each job. The goal is to make flaky behavior visible without adding much overhead.
Here is an example GitHub Actions workflow that stores test results as artifacts and allows later analysis:
name: ci
on: pull_request:
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test – –json –outputFile=test-results.json - uses: actions/upload-artifact@v4 with: name: test-results path: test-results.json
If you use Playwright, you can also capture retry information and trace output to help classify failures:
import { defineConfig } from '@playwright/test';
export default defineConfig({ retries: 1, use: { trace: ‘on-first-retry’, }, reporter: [ [‘list’], [‘json’, { outputFile: ‘playwright-results.json’ }], ], });
The point is not the specific tool. The point is that your pipeline should preserve enough context to answer:
- what failed
- whether a retry was needed
- whether the failure repeats
- whether the same failure appears across commits or runners
Using change-based analysis to raise merge confidence
The most useful instability metrics are the ones that are aware of change scope.
If a PR only touches frontend styling, but your API tests start failing, that is a useful clue. If the PR modifies a shared test fixture, and unrelated tests become flaky, that is even more important. Change-based analysis helps you connect instability to the files, modules, or services that could plausibly have caused it.
You can track:
- failures in tests associated with changed files
- failures in tests that depend on changed shared fixtures
- failures in dependent suites after a library or API change
- instability introduced after a runner or dependency image update
This helps answer a critical merge question, “Is this failure likely caused by the change under review, or is it background noise?”
Alerting without creating alert fatigue
Pipeline observability is only useful if engineers trust the alerts. If every intermittent failure becomes a pager event, people stop paying attention.
Good alerting for test instability in CI should be:
- aggregated over time, not per event only
- scoped by severity and critical path
- aware of recent history
- separated from product runtime alerts
For example, a nightly report might summarize:
- tests with rising flake rate
- jobs with increased retry frequency
- environments with unusual failure clustering
- quarantines older than a defined threshold
This is usually better than interrupting people on every transient failure.
Common mistakes when measuring instability
Measuring only suite pass rate
A suite can keep passing because retries hide instability. Pass rate alone is too blunt.
Ignoring retries
Retries are evidence. If you do not track them, you miss the most obvious sign of a shaky gate.
Treating all failures equally
A timeout, a data collision, and a genuine assertion failure should not be lumped together. Failure category matters.
Not segmenting by environment
If instability only appears on one runner image or one browser, the fix may be infrastructure-related, not test-related.
Letting quarantine become permanent
Quarantine should be a temporary containment strategy, not a storage system for unresolved problems.
A minimal scorecard for engineering teams
If you only have room for a small dashboard, start with this set:
- flake rate by test and by suite
- retry frequency by job
- failure clustering by runner and by module
- duration variance for critical tests
- quarantine age and count
- merge confidence indicators for changed tests
This is enough to spot whether your pipeline is becoming less trustworthy before merge.
If you want a slightly more mature version, add:
- failure category breakdown
- environment health signals
- change-based failure correlation
- time to detection for instability
Example policy for PR gating
A practical PR policy might look like this:
- Block merge if any critical-path test has failed twice in the last N runs without a known product regression.
- Warn if retry count exceeds a threshold for the current PR.
- Flag if the test touched by the PR has a recent flake history.
- Require quarantine or fix assignment for tests that exceed an age limit.
This kind of policy turns metrics into decisions. Without that final step, dashboards are easy to ignore.
How QA, DevOps, and SRE can split responsibilities
Managing test instability in CI is a shared job, but the responsibilities differ.
- QA leads usually own test design, flaky test triage, and quarantine policy.
- DevOps teams usually own runner stability, execution environment consistency, and CI observability.
- SREs usually help with reliability patterns, alert design, and production-like dependency health.
- Engineering managers usually decide what threshold is acceptable and what gets blocked.
The best outcome is when no single team has to “own” the flake problem alone. Instability often sits at the boundary between test design and infrastructure.
A useful operational loop
If you want this to be sustainable, use a recurring loop:
- Detect instability early with telemetry.
- Classify the failure pattern.
- Route the issue to the right owner.
- Quarantine only when needed.
- Fix the root cause.
- Remove quarantine and verify stability over time.
This turns flaky test handling from reactive cleanup into a reliability practice.
Final takeaway
If your goal is to catch test instability in CI before merge, the most valuable metrics are the ones that improve trust, not just visibility. Start with flake rate, retry frequency, failure clustering, duration variance, and quarantine age. Add merge confidence and change-based analysis so your pipeline can distinguish a real regression from a noisy test.
The practical test is simple: when a PR turns a green pipeline red, can you tell quickly whether the merge is unsafe, the test is unstable, or the environment is lying? If the answer is no, you do not have enough observability yet.
Measure the instability that affects decisions, not just the failures that happen to be visible.