Writing Playwright Tests with Claude Code: What Works, What Breaks

Writing Playwright Tests with Claude Code: What Works, What Breaks

A practitioner guide to writing, debugging, and shipping Playwright tests with Claude Code. Patterns that work, patterns that break, and when to graduate to a dedicated tool.

Himanshu Saleria
Claude CodePlaywrightAI TestingHow-toDev Guide

Published 2026-06-14 · Last updated 2026-06-14 · 13-minute read

Most "Claude Code for testing" tutorials show you a happy-path demo against a todo app, then close the tab. They don't tell you what happens when the test has to log in, the form is dynamic, or the agent writes four different selectors across four runs.

We see the wire. Our open-source playwright-mcp server logged 355,654 tool calls from 980 distinct Claude Code users. This guide is what those numbers mean for the engineer asking Claude Code to write a Playwright test that ships.

TL;DR

  • Claude Code is the most-adopted MCP client driving browser tests, 980 distinct users and 355,654 events on our open-source playwright-mcp server.
  • Four patterns work cleanly: writing a new test from a plain-English description, debugging a failing test, refactoring selectors after a UI change, and generating API tests against an OpenAPI spec.
  • Four patterns break: flows beyond the context window, dynamic data, conditional branching, and OAuth or any redirect-heavy auth flow.
  • Agents drive by sight. Telemetry shows 2.4 screenshots per browser session vs ~0.4 DOM snapshots. Write tests that assert on visual outcomes, not deep DOM state.
  • Graduate when the suite crosses ~50 tests, runs on CI, or starts failing more from drift than from real bugs. That is the moment Claude Code becomes a worse fit, not a better one.

Bottom line. Claude Code paired with the Playwright MCP server is the fastest way to author and debug a Playwright test in 2026, and the most-adopted client doing it (980 users, 355,654 events on our open-source server). It works cleanly for first-pass test authoring, selector refactors, debug loops, and API tests. It breaks on long flows, dynamic data, deep conditionals, and OAuth. After ~50 tests, graduate to a dedicated AI testing tool.

This guide is written for the engineer who already uses Claude Code for code, wants to extend it to tests, and would rather know the failure modes up front than discover them in CI at 11pm.


Why Claude Code + Playwright is a thing in 2026

Because the install base is real, and because Claude Code is what most developers already have open when they decide to write a test.

Our open-source playwright-mcp server has handled 1.42 million agent tool calls since November 2025. The single biggest named client is Claude Code: 355,654 events from 980 distinct users, more headcount than any other coding agent. Opencode logs more volume per user (sustained autonomous loops), but Claude Code has the broadest practitioner base. Full breakdown in our Claude Code vs Cursor vs Opencode comparison.

Claude Code users average 363 tool calls per user, the signature of "verify my change inside a coding session" rather than "run a regression suite overnight." Claude Code is great at the dev loop. The autonomous-suite case is a different shape of work.

"230,000 developers pulled an open-source MCP server in twelve months. Most of them never came back." — Playwright MCP: 230K downloads, what we learned

Across all clients combined, 41% of users tried 5 events and never returned. Claude Code makes the first test easy. The work this guide protects against is the moment after.


How do you set up Claude Code with Playwright MCP?

Install the MCP server as a global npm package, register it with Claude Code, then point Claude Code at your repo and let it run the loop. Three steps, none require leaving your terminal.

# 1. install the MCP server
npm install -g playwright-mcp

# 2. register it with Claude Code
claude mcp add playwright-mcp playwright-mcp

# 3. open Claude Code in your repo
cd ~/your-app && claude

Confirm registration with claude mcp list. Then ask Claude Code: "write a Playwright test that adds an item to the cart on http://localhost:3000."

If your dev server is running, the agent uses the MCP to launch a browser, screenshot the page, decide on selectors, write the spec into your tests/ folder, run it, and iterate until it passes. Playwright docs live at playwright.dev/docs; the Claude Code MCP docs cover full configuration.

One detail worth knowing: your localhost traffic is the single most-tested URL in our telemetry. 272 distinct users hit 127.0.0.1 with 18,762 events, more than any public domain. The dev loop is the killer use case, not staging.


The four patterns that work cleanly

These four are where Claude Code earns the time it saves. Each pattern has a one-shot prompt that, in our experience and across the 980-user dataset, produces something close to publishable in one or two iterations.

Pattern 1: Write a Playwright test from a plain-English description

The single highest-leverage use case. You describe the flow, Claude Code writes the spec, runs it, fixes the failures.

A working prompt:

"Write a Playwright test for /checkout that adds two items to the cart, applies promo code SAVE10, and asserts the total reads $18.00. Use TypeScript, save to tests/checkout.spec.ts."

What Claude Code actually does: launches init_browser, takes a screenshot, locates the add-to-cart button by visible text, types the promo code into the visible input, asserts on the rendered total. The output, typically:

import { test, expect } from '@playwright/test';

test('checkout with promo code', async ({ page }) => {
  await page.goto('/checkout');
  await page.getByRole('button', { name: 'Add to cart' }).first().click();
  await page.getByRole('button', { name: 'Add to cart' }).nth(1).click();
  await page.getByLabel('Promo code').fill('SAVE10');
  await page.getByRole('button', { name: 'Apply' }).click();
  await expect(page.getByTestId('order-total')).toHaveText('$18.00');
});

Notice the locator strategy: getByRole, getByLabel, getByTestId. Claude Code defaults to Playwright's recommended user-facing locators when the page exposes them. That matters for stability. A senior practitioner told us once that the difference between a 90% pass rate and a 99% pass rate is whether the team learned to stop writing CSS selectors. Claude Code skips that lesson by default.

Pattern 2: Debug a failing Playwright test

The second-best use. Paste the failing test, paste the error, ask Claude Code to fix it.

A working prompt:

"This test fails with TimeoutError: locator.click: Timeout 30000ms exceeded. Read the test, open the page in MCP, find the actual element, and fix the selector. Don't change the assertion."

What it does: opens the URL, screenshots the page, compares the failing selector to what's actually rendered, rewrites the locator, runs the test again, reports back. The agent typically catches three common patterns: the element moved into a modal, the selector hardcoded a class that changed, or the page needs an explicit wait for a network call.

The "don't change the assertion" framing is load-bearing. Without it, Claude Code will sometimes "fix" a failing test by weakening the assertion to make it pass. That is The Green-Pipeline Lie in our state-of-AI-QA report: self-healing that silently waters down what's checked. Constrain the agent's solution space explicitly and the failure mode mostly goes away.

Pattern 3: Refactor selectors after a UI change

Designers pushed a redesign. Your suite is suddenly red. This is the cleanest single-batch use of Claude Code we see across real teams.

A working prompt:

"I redesigned the navbar. Selectors in tests/nav/ are failing. Run each failing test, screenshot the new navbar, update the selectors to the new structure, keep all assertions intact, and commit each fix as a separate commit."

What it does: walks the failing tests one by one, screenshots the current state, rewrites the locators, runs the test, moves on. The per-commit framing matters because it gives you a reviewable diff history instead of one monster commit.

This pattern is where Claude Code looks most like a productivity unlock. In our 41-team mid-market SaaS study, locator maintenance ate 20–30% of total automation time, and a single UI change typically triggered a 4–5 hour batched fix. Claude Code compresses that batch. We unpack the maintenance math in Playwright vs QAby.AI.

Pattern 4: Generate API tests against an OpenAPI spec

The agent reads the spec, writes the test. No browser involved.

A working prompt:

"Read openapi.yaml. Generate Playwright API tests for the /orders endpoints. Cover happy path, validation errors, and one auth failure case per endpoint."

Output: a tests/api/ folder with files per endpoint, using request fixtures. This is the fastest way we have found to bootstrap an API test suite from scratch. It uses no MCP browser tools, so the playwright-mcp install isn't strictly required. The Claude Code core is sufficient.


The four patterns where Claude Code breaks

Now the honest part. These are the failure modes our customer interviews surface most often, and they line up with the activation-cliff data: 41% of agent users drop off after 5 events because they hit one of these patterns and the agent flailed.

Failure 1: Long flows beyond the context window

Anything more than a 15–20 step end-to-end flow starts to lose state. Claude Code's working memory is wide, not infinite. A test that walks: login, navigate to settings, change a payment method, navigate to checkout, redeem a coupon, complete payment, verify the confirmation email, will start dropping steps or writing the wrong assertion by step 12.

The honest workaround: split into smaller specs and use test.describe.serial to chain them, or move the long-flow definition into a separate orchestration file the agent doesn't need to hold in memory at once. For comparison, the median test on QAby.AI's own platform is 8 steps, per The Anatomy of an AI-Authored Test. Short tests are not a coincidence. They're how AI authoring stays reliable.

Failure 2: Dynamic data

If the test needs a unique email per run, a fresh order ID, or a timestamp inside a date picker, Claude Code will sometimes hardcode the value from the last successful run. The agent's instinct is to read the screen, see "user-2026-06-14@test.dev", and write that into the test, not realize it should be generated.

The workaround: tell the agent explicitly. "Use crypto.randomUUID() for the email. Don't hardcode any value you see on screen." Constrained prompts hold; vague prompts drift.

Failure 3: Conditional branching

A test that has to behave differently when a feature flag is on vs off, or when the user has zero vs one items in cart, defeats Claude Code most of the time. The agent writes the test for the state it observed during authoring and assumes the same state on replay. The conditional branch goes untested.

The data backs this up. In real authoring on our platform, conditional steps are only 1.0% of all steps written, vs 39.7% click and 14.8% type (Anatomy of an AI-Authored Test). Authors avoid the pattern because the tooling, including AI, handles it badly.

Failure 4: OAuth and redirect-heavy auth flows

The single most-reported breakage in our customer interviews and the State of AI QA 2026 report. Claude Code drives the OAuth start, hits the third-party login (Google, Microsoft, Auth0), and either bounces off a captcha, a 2FA prompt, or a cross-domain cookie boundary it can't navigate.

Our MCP telemetry shows the scar: 64 distinct users targeting accounts.google.com, 49 targeting login.microsoftonline.com, plus Atlassian, Intuit, PingOne. Real demand, broken in practice. The clean workaround is to programmatically inject a session token (Playwright's storageState pattern) before the test starts, then write the rest of the flow against the authenticated state. Skip the login UI entirely. Anyone who has tried to write an end-to-end test against an SSO flow knows why.

Key takeaways

  • Claude Code wins on the dev-loop pattern: write a test, debug it, refactor selectors after a UI change. That is where the 980-user adoption is concentrated.
  • It breaks on long flows, dynamic data, deep conditionals, and OAuth. All four are predictable. Plan around them.
  • 41% of agent users drop off after 5 events. Activation, not adoption, is the unsolved problem. Pick patterns that survive the first hour.
  • Median test on a real AI testing platform is 8 steps. Short tests are how AI authoring stays reliable.

What "agents drive by sight" means in practice

The most-cited finding from our MCP data: 643,424 screenshots vs 264,268 browser inits = 2.4 screenshots per session. Combined DOM-snapshot calls totaled ~100K, less than a sixth of screenshot volume. Across 1.42M tool calls, agents preferred pixels to the DOM tree, by a 6x margin.

The implication for how you write tests with Claude Code is real and load-bearing.

If the agent makes its decisions from screenshots, your assertions should hit what the screenshot shows. That means asserting on visible text, visible roles, visible counts. Not on aria-hidden markup, not on internal class names, not on DOM structure the user never sees.

A test like:

await expect(page.getByText('Order confirmed')).toBeVisible();
await expect(page.getByTestId('order-total')).toHaveText('$42.00');

ages well. A test like:

await expect(page.locator('.OrderConfirmation_root__a8b3c')).toBeAttached();

is one Tailwind upgrade away from breaking and doesn't match how the agent sees the page in the first place. Claude Code will often "fix" the second pattern by guessing at the new class name. It works once and breaks the next time. Write the test the agent can read.

For visual outcomes that need pixel comparison rather than text matching, Playwright's toHaveScreenshot is the right primitive. Use it on the assertions that the screenshot reveals and structured text doesn't. The full guidance lives in evaluate AI testing tools.


When do you stop using Claude Code and graduate to a dedicated AI testing tool?

Three signals say it's time: the suite crosses about 50 tests, CI run-time exceeds your local dev loop, or flaky failures are now from drift instead of real bugs.

At ~50 tests. Below 50, Claude Code is faster than any vendor flow. Above 50, the tax compounds. The agent has to re-read the suite to fit each new test into the existing pattern. Setup files, fixtures, and helpers proliferate. The same prompt produces inconsistent code because the agent has too much repo state to hold. Mid-market teams in our State of AI QA 2026 study hit this around 40–60 tests.

When CI run-time exceeds the dev loop. If your push-to-green is 15 minutes and your local loop is 2 minutes, Claude Code authors faster than CI runs. Flake triage, parallelization, and selective retries become the bottleneck, none of which are Claude Code's strong suit. Dedicated tools that own the runner heal during the run, retry intelligently, and report which tests are flaky over time.

When drift is the dominant failure mode. The deepest signal. When 8 of 10 red builds are because the UI changed (not because the code is broken), maintenance has overtaken authoring. Claude Code helps with batched refactors as Pattern 3 above shows. It does not run continuously between commits, watching for drift, healing tests as the app evolves. Tools that do are the next layer.

That is where our pitch sits: AI agents discover your flows, build the tests, run them on every merge, and heal them when your UI changes. Release confidence at engineering velocity. The next step after Claude Code starts to feel like work. Detail in Playwright vs QAby.AI and our definitive AI testing guide.


Production setup checklist

A short list, ordered the way a real team adopts it.

1. CI/CD integration. Run your Claude-Code-authored tests on every PR, not just locally. The standard GitHub Actions setup using microsoft/playwright-github-action works without modification. Make sure the Claude Code MCP server is not running in CI; CI runs the tests Claude Code wrote, it does not re-invoke the agent.

2. Secrets and storage state. Bake a saved auth state into the repo (or fetch it from a vault), don't ask Claude Code to log in fresh each run. Storage state is at test-results/auth.json; Playwright reads it via use: { storageState: 'test-results/auth.json' }. Anyone who has tried OAuth in CI knows why this matters.

3. Browser binaries. npx playwright install --with-deps in CI to pull the right browser versions, pinned to your Playwright version. Cache the install path between runs to cut 2–3 minutes off cold builds. The browser is bigger than the test code, by an order of magnitude.

4. Retries and flake quarantine. Set retries: 2 in playwright.config.ts for CI runs only. Flag any test that retried but eventually passed; that is a flake-in-progress and Claude Code did not catch it during authoring. Review weekly.

5. Test reports. Use Playwright's HTML reporter on PRs and JUnit XML for CI dashboards. The HTML reporter includes screenshots and traces; Claude Code can read them when you ask it to debug the next failure.

6. Update cadence. Run npm update playwright @playwright/test playwright-mcp monthly. The playwright-mcp package ships on the qabyai/playwright-mcp GitHub; breaking changes are documented in the changelog.

7. The 50-test sanity check. Once your suite passes 40 tests, schedule the graduation conversation. By 50 you want a plan. By 60 the tax is real.


Frequently asked questions

What is the best way to write Playwright tests with Claude Code?

Describe the flow in plain English, name the URL, name the expected end state, and let Claude Code use the Playwright MCP server to drive the browser. The cleanest pattern is a one-paragraph prompt that names the route, the actions, and the assertion, then iterate based on the agent's first attempt. Avoid asking Claude Code to write very long tests in one shot; 8–15 steps is the sweet spot.

Does Claude Code support Playwright out of the box?

Yes, via the Model Context Protocol. Install the playwright-mcp package, register it with claude mcp add playwright-mcp playwright-mcp, and Claude Code can launch a browser, screenshot pages, locate elements, and write Playwright specs into your repo. The Claude Code MCP setup docs at docs.claude.com cover the full configuration surface.

Can Claude Code debug a failing Playwright test?

Yes, and this is one of its strongest patterns. Paste the failing test plus the error, ask Claude Code to open the page, inspect the actual element, and update the selector without changing the assertion. The "don't change the assertion" framing is critical; without it the agent will sometimes weaken the test to make it pass. Run with --debug or use page.pause() to step through interactively.

What's the difference between Claude Code and Cursor for Playwright testing?

Different shapes of use. Claude Code averages 363 MCP tool calls per user, with 980 distinct users in our telemetry, the signature of in-session verification inside a coding loop. Cursor-vscode averages 79 events per user across 601 users, lighter editor-integrated automation. Pick Claude Code if your loop is "verify each change before commit"; pick Cursor if the MCP is one tool alongside your IDE's other AI features. Full comparison in Claude Code vs Cursor vs Opencode.

Why do my Claude Code Playwright tests work locally but fail in CI?

Three common causes: missing storageState for auth (the test logs in locally but CI has no session), unpinned browser versions (npx playwright install --with-deps in CI), or timing assumptions that hold on a dev machine and not on a slower CI runner. Add explicit waits, use storage state for auth, and pin your playwright and browser versions. The Playwright CI docs cover the canonical setup.

How many Playwright tests can Claude Code maintain before it starts breaking down?

Around 50 tests is the soft ceiling we see across mid-market SaaS teams in our State of AI QA 2026 study. Below that, Claude Code is the fastest authoring tool available. Above that, the agent struggles to fit new tests into existing patterns, fixtures multiply, and maintenance drift starts outpacing authoring speed. Graduate to a dedicated AI testing platform that runs continuously and heals tests as the UI evolves.

Is Claude Code with Playwright MCP free?

The playwright-mcp server is open source and free (GitHub). Claude Code is a paid subscription product from Anthropic; pricing sits at claude.com. For agentic browser testing specifically, the MCP server is the part you wire up; the LLM inference cost is what you pay Anthropic.


About the author

Himanshu Saleria — Co-founder & CEO, QAby.AI. Background in QA-led product engineering at scale; running QAby.AI's customer research, telemetry analysis, and product. LinkedIn.

QAby.AI's open-source playwright-mcp server is the source of every first-party number in this guide: 1.42M agent tool calls, 6,687 distinct IDs, 5,904 distinct domains, 187 distinct MCP clients. The full methodology lives in The State of AI QA in Mid-Market SaaS 2026.


If your Playwright suite is past the 50-test mark, drift is your top failure mode, and you'd rather skip the SDET hire than ship slower, a 30-minute audit of your current QA gap is the next move. We will show you which patterns above match your suite, where the biggest leak is, and what changes if AI agents discover, build, run, and heal the tests on every merge.

Run My Audit →


Cross-reads

External: