What to Measure in CI When You Want to Catch Test Instability Before Merge

If you only look at pass or fail in CI, you miss the real signal. A test suite can be “green” and still be unstable, slow to recover, expensive to rerun, and unreliable enough that engineers stop trusting it. The useful question is not just whether tests pass, but whether the system can tell you, early and with enough confidence, that a test is becoming flaky before it damages the merge process.

That is where measurement matters. When teams talk about test instability in CI, they often jump straight to “fix the flaky test.” That is necessary, but it is not the first step. First you need a small set of metrics that show whether instability is increasing, where it lives, and how much it affects merge confidence. If those metrics are visible before merge, they let you act while the change is still cheap to reverse.

This article is a practical guide to what to measure, how to interpret it, and how to wire those measurements into a CI pipeline without creating more noise.

What test instability in CI actually means

Test instability in CI is not the same as a test failure caused by a real product regression. Instability is when the result varies for reasons unrelated to the code under test, or when the test environment makes the result unreliable. Common causes include timing issues, shared state, test data collisions, race conditions, network dependence, and environment drift.

A useful mental model is this:

A stable test produces the same outcome under the same conditions.
A flaky test produces different outcomes under similar conditions.
A brittle test passes only when the environment is unusually favorable.

A flaky suite does not just create red builds. It erodes the meaning of green builds.

This distinction matters because the right metrics are different. You are not measuring quality alone, you are measuring trust in the pipeline itself.

The metrics that matter most

There are many things you could measure in CI, but only a few are directly useful for catching instability before merge. The best metrics are the ones that answer one of three questions:

How often do tests fail without a code change that should have broken them?
How quickly can we identify unstable behavior?
How much does instability reduce our confidence in a merge decision?

Here are the core metrics that serve those goals.

1. Flake rate

Flake rate is the single most important metric for test instability in CI. It answers, “How often does this test or suite fail non-deterministically?”

A simple definition:

Run a test multiple times under similar conditions.
Count failures that disappear on rerun without code changes.
Divide flaky failures by total executions.

You can track it at several levels:

per test
per file or suite
per branch
per build pipeline

The exact formula depends on your setup, but the idea is consistent. If a test passes on retry, that does not prove it is flaky, but it is a strong signal worth tracking.

Why it matters:

It tells you where to spend debugging time.
It identifies tests that are poisoning merge confidence.
It helps you prioritize fixes by impact, not by loudness.

A useful rule is to watch both absolute flake counts and flake rate percentage. A suite with 2 flakes out of 20 runs is obviously more alarming than 2 flakes out of 2,000 runs, but a large suite with a low percentage can still cause many developer interruptions.

2. Retry frequency

Retry frequency measures how often CI had to rerun a test, job, or step before concluding success or failure. This is not identical to flake rate. A flaky test may never be retried if your pipeline fails fast, and a stable but slow job may be retried due to infrastructure problems.

Track:

number of manual retries
number of automated retries
percentage of jobs that required any retry
average retries per build

Retry data is useful because it reflects operational reality. Even if the final build is green, retries cost time, create uncertainty, and often hide problems that should have been visible.

A pipeline that needs retries to stay green is not reliable, it is masked.

3. First-failure signal quality

A useful CI system should tell you quickly whether a failure is probably real, probably flaky, or probably environmental. First-failure signal quality measures how useful the first failure event is before rerun noise starts.

This is harder to reduce to one formula, but you can assess it with questions like:

Does the failure log point to a consistent assertion or timeout?
Are environment details captured automatically?
Can you tell whether the failure was local to one test or systemic across many tests?
Does the failure happen in the same stage repeatedly?

If first-failure logs are poor, your team spends more time investigating and less time fixing. That makes instability more expensive even when the flake rate is modest.

4. Failure clustering

Failure clustering looks at whether unstable tests fail together, in the same branch, on the same runner, or after the same code paths. Clusters usually reveal shared root causes such as:

shared fixtures
data collisions
parallel execution contention
a bad environment image
a flaky external dependency

This is one of the most actionable metrics because it helps you decide whether the problem is in a test, a test group, or the pipeline environment itself.

Track clusters by:

test name
module or directory
job type
runner image
time window
git commit or changed files

5. Time to instability detection

This measures how long it takes from the introduction of instability to the point where the pipeline flags it.

The shorter this time is, the better.

Why it matters:

catching instability after merge means more people are affected
catching it before merge limits blast radius
shorter detection windows make root cause analysis easier

In practice, you want instability to show up during the same CI run that introduced it, not hours later in a nightly job or a downstream environment.

6. Merge confidence score

Merge confidence is not a standard industry metric, but it is a useful concept for engineering teams. It represents how much trust you should place in a “green” result before merging.

A simple merge confidence model can incorporate:

current run result
recent flake history for changed tests
retry count
environmental anomalies
test duration variance
whether the test touched high-risk code

You do not need a perfect score. You need a policy that says: “Green on a test with repeated flake history is not the same as green on a stable test.”

This is especially useful for PR gates, where the decision is binary but the evidence is not.

7. Pipeline duration variance

When test instability appears, duration often becomes noisy before failures become obvious. A test that alternates between 30 seconds and 90 seconds may be suffering from intermittent waits, resource contention, or unstable dependencies.

Track:

median duration
p95 duration
duration variance over time
duration by test and by stage

If duration spikes correlate with flakiness, you have a strong clue that the issue is not a product assertion but a timing or environment problem.

8. Quarantine volume and age

Quarantined tests are a reality in many orgs. The metric that matters is not just how many tests are quarantined, but how long they stay there.

Track:

number of quarantined tests
age of each quarantine item
percentage of flaky failures that get quarantined within a defined SLA
percentage of quarantined tests reintroduced successfully

A large quarantine backlog is a strong indicator that the team is managing instability, not reducing it.

What to measure at each stage of the pipeline

Not every metric belongs everywhere. Good pipeline observability means collecting the right data at the right stage.

During unit tests

Measure:

failure rate by test
retry count
duration variance
module-level clustering

Unit tests should be the cleanest signal in CI. If they are unstable, your merge gate is weaker than it looks.

During integration tests

Measure:

flake rate by environment
external dependency errors
resource contention signals
setup and teardown failures
test order sensitivity

Integration tests often fail because of the environment, not because the application code is wrong. That means observability should include service health, container startup time, database readiness, and network error categories.

During end-to-end tests

Measure:

step-level failure points
locator or assertion retries
page load and API latency correlation
environment-specific failures
cross-browser or cross-device variance

End-to-end suites are usually slower and more expensive, so even a small amount of instability can have a big effect on merge confidence.

Build the metrics from raw signals, not just status labels

CI systems often give you pass, fail, and maybe retry. That is not enough. To detect test instability in CI early, collect structured events from the pipeline.

Useful raw signals include:

test name or ID
commit SHA
branch or PR number
job name
runner image or host
start and end times
attempt count
failure category
error message hash
stack trace signature
changed files in the PR

This allows you to compute metrics over time instead of relying on anecdotal reports.

A basic event schema might look like this:

{ “testId”: “checkout.spec.ts::submits-payment”, “commit”: “a1b2c3d”, “branch”: “feature/payment-retry”, “job”: “e2e-chrome”, “attempt”: 2, “status”: “passed”, “durationMs”: 48213, “failureCategory”: “timeout”, “runner”: “ubuntu-22.04” }

With this kind of structure, you can build dashboards, alerts, and trend analysis that distinguish flaky tests from one-off failures.

How to distinguish flaky tests from real regressions

One of the hardest parts of test instability in CI is not collecting data, it is interpreting it.

A real regression often shows these patterns:

failure is reproducible on rerun
failure maps to the changed code path
failure affects related tests consistently
failure persists across environments

A flaky test often shows these patterns:

failure disappears on rerun
failure appears in unrelated changes
failure is clustered around timing or environment variation
failure shifts between assertions without code changes

That said, retries can be misleading. A test that passes on rerun may still be revealing a legitimate race condition in production code. So the right response is not “rerun means flaky.” The right response is “rerun means unstable enough to investigate.”

Treat retries as a diagnostic clue, not as proof of harmlessness.

A practical threshold strategy

Teams often ask for a precise threshold, such as, “At what flake rate should we block merge?” There is no universal number. Thresholds should reflect team size, test volume, and cost of false confidence.

A more practical strategy is to define tiers:

Informational: one-off failure, no historical pattern
Warning: repeated failure pattern or retry needed
Action required: test has a recent flake history or multiple failures in a rolling window
Merge blocked: a critical path test is unstable enough that the confidence in the gate is low

This approach avoids overreacting to noise while still preventing unstable tests from silently becoming part of your definition of “green.”

Sample implementation: collecting flaky-test telemetry in CI

A simple CI setup can emit test metadata into logs or a metrics sink after each job. The goal is to make flaky behavior visible without adding much overhead.

Here is an example GitHub Actions workflow that stores test results as artifacts and allows later analysis:

name: ci

on: pull_request:

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm test – –json –outputFile=test-results.json - uses: actions/upload-artifact@v4 with: name: test-results path: test-results.json

If you use Playwright, you can also capture retry information and trace output to help classify failures:

import { defineConfig } from '@playwright/test';

export default defineConfig({ retries: 1, use: { trace: ‘on-first-retry’, }, reporter: [ [‘list’], [‘json’, { outputFile: ‘playwright-results.json’ }], ], });

The point is not the specific tool. The point is that your pipeline should preserve enough context to answer:

what failed
whether a retry was needed
whether the failure repeats
whether the same failure appears across commits or runners

Using change-based analysis to raise merge confidence

The most useful instability metrics are the ones that are aware of change scope.

If a PR only touches frontend styling, but your API tests start failing, that is a useful clue. If the PR modifies a shared test fixture, and unrelated tests become flaky, that is even more important. Change-based analysis helps you connect instability to the files, modules, or services that could plausibly have caused it.

You can track:

failures in tests associated with changed files
failures in tests that depend on changed shared fixtures
failures in dependent suites after a library or API change
instability introduced after a runner or dependency image update

This helps answer a critical merge question, “Is this failure likely caused by the change under review, or is it background noise?”

Alerting without creating alert fatigue

Pipeline observability is only useful if engineers trust the alerts. If every intermittent failure becomes a pager event, people stop paying attention.

Good alerting for test instability in CI should be:

aggregated over time, not per event only
scoped by severity and critical path
aware of recent history
separated from product runtime alerts

For example, a nightly report might summarize:

tests with rising flake rate
jobs with increased retry frequency
environments with unusual failure clustering
quarantines older than a defined threshold

This is usually better than interrupting people on every transient failure.

Common mistakes when measuring instability

Measuring only suite pass rate

A suite can keep passing because retries hide instability. Pass rate alone is too blunt.

Ignoring retries

Retries are evidence. If you do not track them, you miss the most obvious sign of a shaky gate.

Treating all failures equally

A timeout, a data collision, and a genuine assertion failure should not be lumped together. Failure category matters.

Not segmenting by environment

If instability only appears on one runner image or one browser, the fix may be infrastructure-related, not test-related.

Letting quarantine become permanent

Quarantine should be a temporary containment strategy, not a storage system for unresolved problems.

A minimal scorecard for engineering teams

If you only have room for a small dashboard, start with this set:

flake rate by test and by suite
retry frequency by job
failure clustering by runner and by module
duration variance for critical tests
quarantine age and count
merge confidence indicators for changed tests

This is enough to spot whether your pipeline is becoming less trustworthy before merge.

If you want a slightly more mature version, add:

failure category breakdown
environment health signals
change-based failure correlation
time to detection for instability

Example policy for PR gating

A practical PR policy might look like this:

Block merge if any critical-path test has failed twice in the last N runs without a known product regression.
Warn if retry count exceeds a threshold for the current PR.
Flag if the test touched by the PR has a recent flake history.
Require quarantine or fix assignment for tests that exceed an age limit.

This kind of policy turns metrics into decisions. Without that final step, dashboards are easy to ignore.

How QA, DevOps, and SRE can split responsibilities

Managing test instability in CI is a shared job, but the responsibilities differ.

QA leads usually own test design, flaky test triage, and quarantine policy.
DevOps teams usually own runner stability, execution environment consistency, and CI observability.
SREs usually help with reliability patterns, alert design, and production-like dependency health.
Engineering managers usually decide what threshold is acceptable and what gets blocked.

The best outcome is when no single team has to “own” the flake problem alone. Instability often sits at the boundary between test design and infrastructure.

A useful operational loop

If you want this to be sustainable, use a recurring loop:

Detect instability early with telemetry.
Classify the failure pattern.
Route the issue to the right owner.
Quarantine only when needed.
Fix the root cause.
Remove quarantine and verify stability over time.

This turns flaky test handling from reactive cleanup into a reliability practice.

Final takeaway

If your goal is to catch test instability in CI before merge, the most valuable metrics are the ones that improve trust, not just visibility. Start with flake rate, retry frequency, failure clustering, duration variance, and quarantine age. Add merge confidence and change-based analysis so your pipeline can distinguish a real regression from a noisy test.

The practical test is simple: when a PR turns a green pipeline red, can you tell quickly whether the merge is unsafe, the test is unstable, or the environment is lying? If the answer is no, you do not have enough observability yet.

Measure the instability that affects decisions, not just the failures that happen to be visible.

What test instability in CI actually means

The metrics that matter most

1. Flake rate

2. Retry frequency

3. First-failure signal quality

4. Failure clustering

5. Time to instability detection

6. Merge confidence score

7. Pipeline duration variance

8. Quarantine volume and age

What to measure at each stage of the pipeline

During unit tests

During integration tests

During end-to-end tests

Build the metrics from raw signals, not just status labels

How to distinguish flaky tests from real regressions

A practical threshold strategy

Sample implementation: collecting flaky-test telemetry in CI

Using change-based analysis to raise merge confidence

Alerting without creating alert fatigue

Common mistakes when measuring instability

Measuring only suite pass rate

Ignoring retries

Treating all failures equally

Not segmenting by environment

Letting quarantine become permanent

A minimal scorecard for engineering teams

Example policy for PR gating

How QA, DevOps, and SRE can split responsibilities

A useful operational loop

Final takeaway

Further reading