Flaky UI tests are usually not random. They are just hard to observe after the fact. A test fails in CI, reruns cleanly, and leaves behind a screenshot that only proves the page looked wrong at one moment in time. What you really need is a timeline, one that shows what the browser saw, when the DOM changed, which network calls were slow, and whether the test clicked too early, found the wrong element, or got stuck behind a transition.

That is where browser session replay for flaky UI tests becomes useful. Session replay gives you a time-based reconstruction of the browser session, which you can combine with logs, traces, and timing data to build a repeatable debugging workflow. The goal is not just to watch a replay, but to answer a specific question: what caused this browser test failure, and how do I prevent it from happening again?

The best flaky-test workflow is not “rerun until green.” It is “collect enough evidence on the first failure to make the next failure cheaper.”

This guide walks through a practical workflow for debugging intermittent browser failures in Playwright, Cypress, Selenium, and CI pipelines. It also shows where replay helps, where it does not, and how editable test flows and step-by-step run history can complement replay-based debugging.

What session replay is, and what it is not

Session replay is a recording of a browser session that can show page state changes over time, user actions, network timing, DOM updates, console logs, and sometimes screenshots or full visual frames. In Test automation, replay is most valuable when it is tied to a specific automated run, because that lets you correlate what the test did with what the application did.

It is important to separate three related artifacts:

  • Test logs, which show the sequence of steps or assertions
  • Execution traces, which capture timing, DOM snapshots, network activity, and browser state
  • Session replay, which provides a visual and temporal narrative of the run

Replay alone does not explain why an element was not clickable, why a selector broke, or why an animation caused a race condition. It becomes useful when paired with evidence from the test runner and the app itself.

The workflow: collect, correlate, classify, confirm, fix

A good debugging workflow for flaky UI tests has five phases.

1. Collect the right artifacts on every failure

If the pipeline only keeps a stack trace, you are debugging blind. When a browser test fails, capture as much of the following as is practical:

  • Console logs
  • Network failures and response timing
  • DOM snapshots or trace files
  • Screenshot at failure
  • Video or session replay
  • Test step history with timestamps
  • Browser/version, viewport, and environment metadata

For Playwright, the trace viewer is especially useful because it gives a timeline of actions and snapshots. For Cypress, videos and command logs can help. For Selenium, you often need to assemble the evidence yourself using logs, screenshots, and grid metadata.

A useful rule is this, if a test can fail in CI without enough evidence to explain the failure, that test is under-instrumented.

2. Correlate the replay with the test timeline

Open the replay and answer four questions in order:

  1. Did the page load the expected route?
  2. Did the test interact with the intended element?
  3. Did the UI change before or after the assertion?
  4. Did any network call, rendering delay, or overlay explain the failure?

You are not trying to “watch the test” for entertainment. You are aligning the visible browser state with the automation timeline.

A practical trick is to annotate timestamps from the test logs, then jump to those moments in the replay. If the failure happened at step 12, and the replay shows a modal opening at 8.4 seconds while the click happened at 8.2 seconds, you already have a likely race condition.

3. Classify the failure type

Most flaky browser test failures fall into a few repeatable categories:

  • Locator instability, the selector points to the wrong element or no longer resolves
  • Timing race, the test clicks before the UI is ready
  • Animation or transition interference, the element exists but is not interactable yet
  • Network variance, data arrives later than expected, changing page state
  • Test data drift, the setup state is different from what the test assumes
  • Environment mismatch, viewport, browser, or auth state differs between local and CI

Replay helps you distinguish between these. For example, a locator problem often looks like the test targeting the wrong row, wrong card, or hidden duplicate element. A timing problem often looks like the right element appearing after the click or assertion.

4. Confirm the root cause with a minimal reproduction

Once you have a theory, reproduce the issue with one targeted change. For example:

  • Add a longer wait or better synchronization, if you suspect timing
  • Replace an unstable selector, if you suspect locator drift
  • Freeze test data or mock the API, if you suspect state variance
  • Run in the same viewport and browser as CI, if you suspect environment mismatch

The goal is to confirm the cause, not to patch the test blindly. If you make multiple changes at once, you lose the evidence chain.

5. Fix the test and improve observability

A debugging workflow should end with a better test, not just a passing rerun. Add the instrumentation or guardrails that would make the next incident easier to diagnose:

  • Better locator strategy
  • Explicit waits on meaningful conditions
  • More stable test data setup
  • Trace or replay capture on failure only, if storage is a concern
  • Step annotations in the test code or test management system

Building the evidence stack

The strongest browser test debugging setup combines replay with logs and execution traces.

Playwright example, capture trace on failure

Playwright makes this straightforward. A common pattern is to record traces only when a test fails.

import { test, expect } from '@playwright/test';
test('adds item to cart', async ({ page }, testInfo) => {
  await page.goto('https://example.com/shop');
  await page.getByRole('button', { name: 'Add to cart' }).click();
  await expect(page.getByText('Added to cart')).toBeVisible();
});

In playwright.config.ts, keep the failure artifacts:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘retain-on-failure’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

This gives you enough to compare the trace timeline with your browser session replay.

Cypress example, surface timing clues

Cypress command logs are helpful, but you still want the app state around the failure.

describe('checkout', () => {
  it('submits the form', () => {
    cy.visit('/checkout');
    cy.get('[data-testid="submit-order"]').click();
    cy.contains('Order confirmed').should('be.visible');
  });
});

If this fails intermittently, add logging around network activity or UI state changes, then compare those timestamps with the replay.

Selenium example, log browser state explicitly

Selenium often needs extra observability because it does not include the same built-in trace tooling.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome() driver.get(‘https://example.com/login’) WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.CSS_SELECTOR, ‘[data-testid=”login-button”]’)) ).click()

If this is flaky, record the URL, browser version, viewport, console output, and any visible overlay or modal state when the failure happens.

How to read replay like a debugger

Watching replay efficiently is a skill. Do not scrub from start to finish unless you have to.

Start with the failure moment

Jump to the failure timestamp and inspect what the browser showed just before the assertion or interaction failed. Look for:

  • Late-loading content
  • Toasts, tooltips, or modals
  • Layout shifts
  • Duplicate elements with the same text
  • Disabled or covered controls
  • Pending network requests

Compare visible UI with DOM assumptions

A common reason for flaky UI tests is that the DOM does not match the mental model in the test code. For example, a selector may match the first “Save” button on the page, while the user intended the one inside an active panel.

If the replay shows the wrong card, wrong row, or hidden element being targeted, the fix is usually selector design, not more waiting.

Use the network timeline to explain waiting failures

If the replay shows the browser waiting on data, inspect the associated API calls. Slow, variable, or conditional API responses often explain why the same test passes locally but fails in CI. This is especially true when tests depend on seeded data or backend jobs finishing in time.

Identify animation and layout shift problems

Modern frontends often animate panels, slides, and menus. A test may locate the element before the transition completes, then fail because the element is still moving or temporarily covered. Replay makes these cases obvious because you can see the UI state change frame by frame.

If an element is visible in the DOM but not in the replay at click time, treat it as a synchronization problem until proven otherwise.

A practical classification table

When you are triaging browser test failures, it helps to map the symptom to the most likely cause.

Symptom Likely cause First thing to inspect
Element not found Locator drift, conditional rendering DOM snapshot, selector specificity
Click intercepted Overlay, animation, hidden layer Replay at interaction time
Assertion timeout Slow data fetch, late render Network timing, app state
Pass locally, fail in CI Environment mismatch, timing variance Browser version, viewport, CPU load
Rerun passes immediately Race condition, transient state Step timestamps, replay, trace

This table is not a diagnosis tool by itself, but it keeps the investigation focused.

What to change in your tests after the first failure

A debugging workflow should feed back into test design. If the same class of failure keeps recurring, the test is missing one or more of these qualities.

Prefer user-facing locators, but make them stable

Use roles, labels, and stable test IDs when appropriate. Avoid selectors based on generated classes or brittle DOM structure unless there is no alternative.

The right selector strategy depends on the UI. A button with a clear accessible name is often better located by role. A list item with repeated text may need a scoped test ID.

Synchronize on behavior, not arbitrary timeouts

A wait like “sleep for 2 seconds” does not explain anything and often hides the real problem. Prefer waiting for a visible outcome, a network response, or a state change that matters to the user.

Examples include:

  • Element becomes visible and enabled
  • Spinner disappears
  • API call returns 200
  • Navigation completes
  • Toast message appears

Make test data deterministic

If a test depends on backend state, make that state explicit. Seed the data, reset it between runs, and avoid shared accounts or random fixtures unless the workflow is intentionally testing those conditions.

Log important state at the step level

At minimum, log route, selected item, account, response status, and visible application mode. When replay is not available, these logs become your reconstruction tool.

Using replay to debug intermittent failures in CI

CI is where flakiness becomes expensive. Tests run under different CPU, memory, network, and browser conditions than your laptop. A workflow built around replay and traces helps you stop treating each CI failure as a unique event.

A good CI setup usually includes:

  • Failure-only artifact capture to control storage
  • Run metadata, commit SHA, branch, browser, worker, and viewport
  • Artifact retention long enough to compare repeated failures
  • Clear links from the test report to replay or trace files

If the same test fails three times in a week, compare the replays side by side. Look for the same interaction happening at different states, or the same state appearing at different times. Patterns often emerge quickly.

A simple triage checklist

When a browser test fails, use this order:

  1. Confirm the failure is reproducible or intermittently similar
  2. Open the replay or trace at the failure timestamp
  3. Check whether the right UI element was present and interactable
  4. Inspect network calls and console errors around the failure
  5. Compare local and CI environment metadata
  6. Classify the failure as locator, timing, data, or environment related
  7. Apply the smallest fix that makes the root cause unlikely
  8. Add observability so the next failure is faster to diagnose

This checklist sounds basic, but it works because it prevents premature patching.

When replay is not enough

Replay is excellent for visual and timing diagnosis, but it is not a substitute for deeper test design work. It will not fix a bad architecture.

Replay alone cannot solve:

  • Tests that overreach into too many unrelated UI behaviors
  • Shared state between tests
  • Poorly isolated fixtures
  • Hard-coded assumptions about backend speed
  • Brittle selectors copied across the suite

If you keep seeing the same failure shape, the fix may be in test boundaries, application testability, or data setup, not the interaction step itself.

Where editable test flows help

Replay is strongest when the evidence lives next to the test definition. That is why step-by-step run history and editable test flows are useful complements to replay-based debugging. They let reviewers inspect the failed step, adjust the interaction, and preserve the reasoning behind the change.

For teams that prefer low-code or mixed-code workflows, Endtest’s self-healing tests are one practical option to consider. Its agentic AI approach can recover from broken locators by selecting a new match from surrounding context, while keeping the run history visible so you can see what changed. The documentation also describes how the platform reduces maintenance when UI structure shifts, which is useful when flaky failures are caused by locator drift rather than timing.

That kind of tool is not a replacement for replay analysis, but it can reduce the number of failures that need manual root-cause work in the first place.

Building a durable workflow for the team

The real payoff comes when debugging is standardized across the team, not improvised per incident. A strong workflow usually looks like this:

  • Every UI test failure produces a trace, screenshot, and log bundle
  • CI links directly to replay or execution artifacts
  • Engineers know how to jump from failure to timeline quickly
  • The team classifies flakiness by root cause, not just symptom
  • Fixes include both code changes and observability improvements
  • Repeated failures feed into a maintenance backlog

If you want the process to stick, make it part of the definition of done for browser tests. A test is not truly finished unless it is debuggable when it fails.

Final take

Browser session replay for flaky UI tests is most useful when it sits inside a broader debugging workflow. Replay shows what the browser did, traces show when it did it, logs explain what the test believed, and timing clues reveal why the two diverged. Together, they turn intermittent browser test failures from guesswork into a structured investigation.

If you build one thing after reading this, build the evidence pipeline first. Once failure artifacts are collected consistently, you can debug faster, fix the right layer, and spend less time rerunning tests just to learn the same lesson twice.