"Just Use ChatGPT" Creates More QA Work, Not Less

41 QA teams later, the "just use ChatGPT to write your tests" advice fails on review burden, accuracy ceiling, and activation. Here is what we found.

Himanshu Saleria

•Published June 13, 2026·22 min read•

AI TestingChatGPTTest AutomationContrarian POVQA

Published 2026-06-13 · Last updated 2026-06-13 · 11-minute read

The advice is everywhere now. "Just paste your spec into ChatGPT and ask it to write the Playwright tests." LinkedIn posts, Reddit threads, conference Q&As. Engineering managers who have never written a test in their life nod along. So do CTOs trying to defer the SDET hire one more quarter.

We talked to 41 mid-market SaaS QA and engineering leaders over nine months. A non-trivial chunk had already tried it. Some were still trying it. The pattern in the data is not what the LinkedIn posts promise.

TL;DR

"Just use ChatGPT to write the tests" sounds like a one-shot productivity win. In our 41-team dataset, the teams that tried it ended up with more review, re-prompt, and re-run cycles than the manual baseline.
The accuracy ceiling is real and measurable. A Sr. Director of SDET at a US AI-agent platform put it plainly: scripts are 75 to 80% accurate with a human in the loop, but autonomously they never get beyond 60%. The 40% gap is where the work hides.
Activation data backs it up. 41% of users of an open-source agentic-testing MCP server tried five events and never returned. The "ChatGPT writes my tests" workflow is rough enough that most people quit before week two.
The pain frame this whole post sits inside: devs ship faster than QA tests. ChatGPT helps with the first draft. It does not close the gap.

Bottom line. Telling a QA team to "just use ChatGPT" to generate Playwright or Selenium tests reliably increases workload rather than reducing it. The pattern across 41 interviews and one open-source MCP server with 6,687 distinct users is consistent: a 75 to 80% accuracy ceiling with humans in the loop, 60% autonomously, a 41% activation cliff, and a review-and-re-prompt cycle that consumes the time the generation step was supposed to save.

This post is contrarian by design. It is not anti-LLM. It is anti the specific workflow where a QA engineer is told to outsource their judgment to a chat window and ship the output. We will explain what we saw, anchor the claims to the State of AI QA in Mid-Market SaaS 2026 dataset, and end with what actually closes the gap.

Does ChatGPT actually save QA time when you use it to write tests?

In most of the teams we interviewed, no. The teams that tried "ChatGPT writes my tests" reported review-and-re-prompt cycles that consumed the time the generation step saved.

The clearest verbatim came from a QA leader at a Japanese language SaaS company. She runs a 200-case regression suite, 85 of those automated, on a biweekly release cycle. Six weeks into a leadership-mandated ChatGPT experiment, she answered our check-in with one sentence we wrote down word-for-word:

"My work increased. I review, re-prompt, re-run, review again." — A QA leader at a Japanese language SaaS, in the State of AI QA 2026 dataset

That is the contrarian thesis in 11 words. The teams telling you to "just use ChatGPT" are usually engineering leaders who have not sat in the review chair. We heard the pattern enough times across the 41 calls that we stopped treating it as an outlier.

If your engineer spends 20 minutes prompting, 40 minutes reviewing, 15 minutes re-prompting, and 25 minutes fixing selectors that broke on the real app, the "AI accelerated" run is 100 minutes. A competent SDET writes the same Playwright test from scratch in 90 to 120 minutes. The accelerator is a wash on a good day and a tax on a bad one. If your bottleneck was junior-engineer authoring speed, you saved something. If your bottleneck was senior review capacity, you bought yourself more of it.

What's the 40% gap nobody is talking about?

The 40% gap is the share of an AI-generated test that humans still have to write, fix, or reason about. Assertion logic, conditional branching, and business-rule knowledge the LLM does not have.

The cleanest articulation belongs to a Sr. Director of SDET at a US AI-agent platform whose team has been shipping AI-generated automation longer than most:

"AI-generated scripts are 75 to 80% accurate, but autonomously, we never get beyond 60%. That 40% is the gap." — A Sr. Director of SDET at a US AI-agent platform, in the State of AI QA 2026 dataset

Two numbers do real work. The 75 to 80% is human-in-the-loop accuracy: a senior engineer prompting, reviewing, and correcting lands the output at roughly 80% useful on first cut. The 60% is the autonomous number. Drop the human and the quality falls 15 to 20 points. That delta is what the SDET hire was paid to close.

A QA practitioner at a publicly-traded enterprise observability SaaS said the same thing in different words. His team built their own MCP-based test generator internally. It worked for happy-path. It broke on conditional logic, multi-tab interactions, and flows requiring real session state. The 15% of his suite still manual is exactly the 15% where the business logic lives.

This pattern repeats in our own product telemetry. We dug into it in the anatomy of an AI-authored test: of 9,103 step events from real users, 8.4% are AI-driven assertions (assert-ai) and 1% are open-ended AI actions (ai-magic). Click and type still account for 54.5%. The AI is not authoring the whole test. It is filling the 12% that is hardest to maintain.

Why does the review burden eat the productivity gain?

Because LLM-generated tests fail in ways that are harder to catch than human-written ones. They look correct. They run. They assert against the wrong thing. Reviewing them takes longer than reviewing a junior engineer's code.

A QA engineer reviewing a teammate's Playwright pull request knows roughly what the teammate was trying to do. There is a Jira ticket, a Slack thread, a hallway chat. Context is shared. The reviewer scans for selector quality, race conditions, and whether the assertion matches the acceptance criteria. The review is fast because the intent is known.

A QA engineer reviewing a ChatGPT-generated test has none of that context. The LLM read the spec, made assumptions about what mattered, generated a plausible-looking selector against a button name it inferred, and asserted against a state it guessed. The reviewer has to reverse-engineer the model's assumptions and verify each one. Anyone who has reviewed a junior contractor's code without knowing them has felt this. It is slower than writing it yourself.

One QA Lead at a US AI-notes startup, we will call her Sarah, described her workflow as "babysitting." She generates a test, reads every line, runs it locally, finds three things the model got wrong, re-prompts to fix two of them, fixes the third by hand. Forty minutes per test. Her manual baseline was 45 minutes per test. She saved five minutes and added 20 minutes of context-switching between her IDE and the chat window.

The failure modes are predictable enough that we could write them on a card:

Failure mode	What the LLM does	What the human catches
Asserts presence, not behavior	`expect(button).toBeVisible()`	The test never clicks the button or verifies what it does
Wrong selector strategy	Class-based selector on a styled component	Selector breaks on the next CSS refactor
Skips the side effect	Tests the form submitted	Misses that the email was never sent
Happy path only	Generates the success case	Misses 4 of the 6 error states
Hallucinated button	References a `Submit Order` button	The actual button text is `Place Order`

Each one is recoverable. Each one is a review-cycle tax. Stack the five failure modes across 50 tests in a regression suite and the cumulative tax dwarfs the upfront generation speed.

This is the same review-burden pattern that shows up when teams adopt mediocre self-healing tooling. We unpacked the related anti-pattern (auto-skipped failures making the suite lie) in the Green-Pipeline Lie. The shape is the same: cheap authoring, expensive verification, false confidence.

Key takeaways

The "ChatGPT writes my tests" workflow trades upfront authoring time for downstream review and re-prompt time. The trade is usually a wash or a tax.

The 40% gap is the share of an AI-generated test that humans still have to write or fix. Autonomous accuracy plateaus at 60%; human-supervised at 75 to 80%.

LLM-generated tests fail in subtle ways that are harder to review than human-written ones. The review burden is the hidden cost.

The 41% activation cliff on agentic tooling tells us most teams quit before the workflow pays off.

What does the activation data say about teams who try this?

It says most teams quit before the workflow pays off. The activation curve on agentic testing tools is brutally power-law, and the "just use ChatGPT" workflow sits at the rough end of it.

This is data from our own open-source playwright-mcp server, the package that lets coding agents like Claude Code and Cursor drive a real browser. 230,105 developers pulled it from npm in the 12 months ending June 2026. 1.42 million agent tool calls ran through it. The numbers look healthy at the top of the funnel.

The activation curve does not. 41% of users (2,752 of 6,687) tried 5 events and never returned. The median user ran 8 tool calls across 3 sessions and quit. The top 1% drove 73% of the traffic. We named it in the State of AI QA 2026 report and it changed how we think about the category.

Translate that to the "ChatGPT writes your tests" advice. The engineer who tries it, hits the review burden by test number three, decides this is not the productivity win the LinkedIn post promised, and goes back to writing Playwright by hand. That engineer is in the 41%. They tell their team it does not work. The team writes the workflow off as overhyped, when what actually happened was that the chosen tool was too rough to survive contact with real work.

The number that matters is not "how many developers downloaded the agentic testing tool." It is "how many were still using it on day 30." On that metric, the gap between "agentic testing is real" and "agentic testing is productive" is wider than any vendor wants to admit. We are a vendor. We are admitting it.

The same activation cliff is why "just use ChatGPT" advice spreads faster than it works. The advice is cheap to give and feels right in the abstract. The teams who try it and bounce off rarely write the public post-mortem. The teams who succeed have usually invested in supporting infrastructure (custom prompts, a review rubric, a senior reviewer) that the advice never mentioned. The visible signal is selection bias.

So if not ChatGPT, what does actually close the gap?

What closes the gap is owning the full regression lifecycle, not just authoring. Discovery, build, run, and heal. The "just use ChatGPT" workflow only attempts step two of those four, and it does step two badly.

The pattern we kept hearing across the dataset: teams want a thing that knows what to test, builds the test, runs the test on every code change, and fixes the test when the UI moves. ChatGPT, used as a chat window, does none of those four well. It can help draft the second step if a senior engineer is in the loop. The other three are out of scope for the workflow.

This is the contrarian frame that drove how we built QAby.AI. We did not start from "let's wrap an LLM and call it a test generator." We started from the four-step lifecycle and asked which steps an agent can genuinely own end-to-end.

Lifecycle step	What "just use ChatGPT" does	What we built instead
Discover what flows to test	Nothing	Agents crawl the app and surface flows worth covering
Build the tests	Generates a first draft	Agents author tests with the same step types real users build
Run the tests	Nothing	Tests run on every merge, not three sprints later
Heal when UI changes	Nothing	Agents detect selector drift and self-update the locators

The honest framing: we are not pitching "AI writes your tests faster." We are pitching that the unit of work moves from "writing one test in a chat window" to "owning regression at the rate your team ships code." Devs ship faster than QA tests. We close the gap. Not by being a better chat window. By being a different thing.

The buyer question that matters is not "can ChatGPT generate a Playwright test." It can. The buyer questions that matter are:

Who reviews the generated test, and what is their time worth?
What happens when the UI changes and the test breaks?
How does the test get re-run when the next sprint ships?
Who is on the hook when production breaks and the regression test was never updated?

The "ChatGPT generates the test" answer to all four is "the QA engineer, the same QA engineer, the same QA engineer, the same QA engineer." That is not a productivity gain. That is the same workflow with a fancier first draft.

We unpack the broader buyer-side checklist in How to evaluate AI testing tools. The framework underneath is in the What-to-Test Gap and the Debugging Ladder.

Is there any version of "ChatGPT for testing" that does work?

Yes, but it is narrower than the LinkedIn version. Use the LLM for the boilerplate first draft of one test at a time, with a senior reviewer in the loop, on a codebase where the model has clear context. Do not use it as your regression strategy.

A few patterns from our dataset actually paid off:

Generating the boilerplate for a single new test against a known fixture. This works because the model has a tight scope, a clear input, and a human reviewer who knows the acceptance criteria. The Sr. Director of SDET at the US AI-agent platform uses this pattern. It saves him roughly the 75 to 80% accuracy he quoted. The 20 to 25% he fixes by hand is still cheaper than writing from scratch, in his specific workflow.

Drafting test descriptions and step names. LLMs are good at the prose-y parts of a test. "What is this test doing in plain English" is a one-shot prompt with no review burden because the only consumer is a human reading the test report.

Translating test cases between frameworks. Selenium to Playwright. Cypress to Playwright. Codeception to Playwright. The structural mapping is mechanical, the assertions stay roughly the same, and the LLM does a credible job on the syntax conversion.

What does not work in our dataset: telling a junior engineer with no test-design experience to "just use ChatGPT to write the suite." We saw three variants of this and all three ended with the QA leader rewriting the suite from scratch six weeks later.

The principle is consistent. LLMs amplify the judgment of the person using them. They do not substitute for judgment. If your team's bottleneck is "we have senior QA judgment but no junior authoring speed," AI authoring helps. If your team's bottleneck is "we do not have anyone who knows what to test, let alone how to assert it," the chat window is not where you find that capability.

Frequently asked questions

Can ChatGPT actually write a working Playwright test?

Yes, ChatGPT can generate a syntactically valid Playwright test for a happy-path flow about 75 to 80% of the time when a senior engineer prompts it well. The remaining 20 to 25% needs human correction: selector strategy, race conditions, assertion logic, and side effects. The output is a first draft, not a finished test. Autonomous accuracy drops to roughly 60% with no human in the loop.

What is the "40% gap" in AI-generated test automation?

The 40% gap is the difference between human-supervised AI accuracy (75 to 80%) and fully autonomous AI accuracy (around 60%) in test generation. It represents the share of a test that requires human judgment: business-rule assertions, conditional logic, multi-tab interactions, and session-state handling. The phrase comes from a Sr. Director of SDET at a US AI-agent platform in our State of AI QA 2026 dataset.

Why do QA teams say ChatGPT increases their work?

Because review and re-prompt cycles cost more than the generation step saves. A QA Lead at a Japanese language SaaS, quoted verbatim in our dataset, summarized it as: "My work increased. I review, re-prompt, re-run, review again." The LLM produces plausible output fast, but verifying it correctly takes longer than reviewing a human teammate's code because the reviewer has no shared context for the model's assumptions.

Is "AI-generated tests" the same as "agentic testing"?

No. AI-generated tests means an LLM writes a one-shot test for you to run later, the same way it writes any code. Agentic testing means an autonomous agent discovers flows, authors tests, runs them on every merge, and heals them when the UI changes. The first is a one-shot authoring help. The second is a regression strategy. Most "just use ChatGPT" advice conflates them.

How does QAby.AI differ from "asking ChatGPT to write Playwright tests"?

QAby.AI ships agents that own the regression lifecycle: discover flows, build tests, run them on every merge, and heal them when selectors drift. ChatGPT, used as a chat window, helps draft step two only, and produces a first draft a senior engineer still has to review. The pain frame: devs ship faster than QA tests. We close the gap. ChatGPT helps with one keystroke; closing the gap takes the other three lifecycle steps.

Should I tell my engineers to stop using ChatGPT for tests?

No. The narrower workflow (one test at a time, senior reviewer in the loop, on a codebase the model has clear context for) genuinely saves time on a specific kind of work. The broader advice (replace your test strategy with a chat window) does not. The line is: use it for boilerplate authoring, not for regression strategy. The 41% activation cliff on agentic-testing tools shows what happens when teams confuse the two.

What activation problem does AI testing have in 2026?

The activation problem is that 41% of users on an open-source agentic-testing MCP server tried 5 events and never returned, based on telemetry from our playwright-mcp package in the State of AI QA 2026 dataset. The median user runs 8 calls across 3 sessions and quits. The agentic-testing category has real adoption at the top of the funnel and brutal drop-off at activation. Tools rough enough to produce review burden on day one lose the user.

About the author

Himanshu Saleria is the founder of QAby.AI. He runs customer research, telemetry analysis, and product for the company. Before QAby.AI he was in QA-led product engineering at scale, watching the same review-burden pattern from the other side of the desk. LinkedIn.

So what do you do with this?

Frame	Detail
Pain	Devs ship faster than QA tests. We close the gap.
Outcome	Release confidence at engineering velocity.
Mechanism	AI agents discover your flows, build the tests, run them on every merge, and heal them when your UI changes.
Hooks	Skip the SDET hire · Run regression on every merge · Beyond generated scripts

If "just use ChatGPT" has been your team's QA strategy for the last six months and you are tired of the review cycle, the next move is a 30-minute audit of where the review burden is hiding in your current workflow. We will show you which of the 41-team patterns match your team, where the lifecycle is broken, and what changes when agents own discovery, build, run, and heal.

Run My Audit

Dig in further:

The State of AI QA in Mid-Market SaaS 2026: the 41-call dataset and the activation cliff
How to evaluate AI testing tools: buyer-side checklist
Anatomy of an AI-authored test: step-type telemetry from 9,103 real test steps
The Debugging Ladder: screenshots, video, trace, in that order
The What-to-Test Gap: why test design beats test execution

External cross-validation:

playwright-mcp on npm-stat: the download curve and activation data we cite above
GitHub Copilot research on AI code review burden: for the broader review-cost pattern across LLM-generated code