The What-to-Test Gap

Coined framework: the QA bottleneck is not writing tests, it is knowing what to test. Diagnostic, math, and a fix anchored in 41 real conversations.

Himanshu Saleria

•Published June 12, 2026·17 min read•

FrameworkAI TestingQA StrategyTest Design

Published 2026-06-12 · Last updated 2026-06-12 · 13-minute read

Every QA tool on the market sells you faster test creation. Record. Generate. Convert simple steps to code. Get your first script in minutes.

It's the wrong promise. The bottleneck isn't writing tests. It's knowing which tests to write. We've heard it on enough calls now that we named the pattern: The What-to-Test Gap.

Devs ship faster than QA tests. We close the gap.

TL;DR

The What-to-Test Gap is the structural problem where QA teams know how to write tests but can't articulate what to test. Faster authoring tools don't fix it. They make the gap more expensive.
4 of 26 structured QA calls in our dataset named test design (not execution) as the bottleneck. It's the second-largest QA pain after locator maintenance.
Locator maintenance is #1 at 9 of 26 (35%). The What-to-Test Gap is right behind. Together they account for the bulk of where automation time disappears.
Tools that speed up writing widen the gap. Agents that discover your flows close it, by inventorying what your app actually does before anyone writes a single test.
If you can't tell us, in three sentences, what your suite covers and what it doesn't, you are in the gap. The fix is a map, not another framework.

Bottom line. The What-to-Test Gap is the structural problem where QA teams know HOW to write tests but cannot articulate WHAT to test. The bottleneck isn't running tests, it's deciding which tests to write. Faster authoring tools widen the gap by making the wrong tests faster. Agents that discover your flows before anyone writes a test close it by giving the team a current inventory of the app's surface area.

The What-to-Test Gap is the structural problem where a QA team knows how to write a test but cannot articulate what to test. The work moves the moment someone says "we need a regression on the new release." It stops the moment someone has to list which flows that regression should cover.

A QA Lead at an AP/payments SaaS told us cleanly:

"Writing and figuring out what to test is where the problem is."

A senior QA leader at a high-trust enterprise SaaS, with two decades of Staff and Principal experience at enterprise infrastructure SaaS, said it sharper:

"Writing test cases was never my problem. Knowing which test cases to write was."

A fast-growing Indian SaaS scaling 30 parallel sprints over a 5-week release window described the same shape: the team kept up with writing, but every release they discovered after the fact that whole flows had drifted untested.

Three teams, three sectors, same pain. They weren't asking for a better script generator. They were asking for someone to tell them where to look.

That is the gap. No tool's marketing names it because no roadmap targets it. Buyers shop on test creation speed. Vendors ship authoring. The map stays missing.

The math on the bottleneck

We mined 26 structured QA conversations across mid-market SaaS for unprompted pain (what the QA Lead names first, without us pointing them at it).

Rank	Unprompted pain	Mentions	Share
1	Locator / selector maintenance	9 of 26	35%
2	The What-to-Test Gap	4 of 26	15%
3	Flake / triage	3 of 26	12%
4	Coverage reporting honesty	2 of 26	8%
5	Tool maturity / vendor trust	2 of 26	8%

Locator maintenance is #1 (full cost math in The State of AI QA in Mid-Market SaaS 2026). The What-to-Test Gap is right behind it, and is the more uncomfortable finding. Selector pain is mechanical, fixable with a better authoring model. Test-design pain is judgment work. No new framework helps a Lead who can't see the surface area they're defending.

4 of 26 named it unprompted. Many more nodded when it came up. The Leads who didn't name it were running suites under 25 percent automated. The gap is invisible until the suite is mature enough to expose how little of the app it covers.

Key takeaways

The What-to-Test Gap is the second-largest QA pain in our 26-call dataset (4 of 26, 15%), right behind locator maintenance.

Faster authoring tools widen the gap by making the wrong tests faster. They never name the gap because no roadmap targets it.

Tier 3 of the QA stack (mapping the app's surface area) is mostly empty. Risk matrices rank what you can already see.

The fix is a current inventory of flows. The first verb of agentic testing (discover) is the one most categories skip.

Why faster test creation doesn't fix it

Every authoring tool sits on the same assumption: the bottleneck is keystrokes. Make the keystrokes faster, the suite grows faster, the team wins. Three failure modes break that.

Faster authoring makes the wrong tests faster. A team that doesn't know what to test, writing tests 4x faster, gets a suite that misses the right things 4x faster. Coverage goes up on paper. Real risk doesn't move. One QA Lead running their own ChatGPT-powered test generator put it bluntly: "my work increased. I review, re-prompt, re-run, review again."

The gap stays invisible to the buyer. A Lead shopping on "test creation speed" never asks the vendor "how do I know what to test?" because that is a question they have privately stopped asking themselves. The shopping criterion never names the real problem. The category never has to solve it.

The suite ages into a coverage lie. The senior leader we spoke to quantified it: "real coverage is 40 percent. The tool reports 80." Reported coverage is what was instrumented. Real coverage is what was exercised. The team is surprised by every production bug. Pipeline green. Bug shipped.

The State of AI QA 2026 report calls this the second-most-felt and least-named QA pain in the mid-market dataset. "Least-named" is the trap. Buyers can't ask for what they have not named.

The hierarchy of the QA stack: where the gap lives

Most tools fight in one tier. The gap is in another.

Tier	What the tier does	Tools that play here
Tier 1: Authoring	Speed up writing a test	Playwright, Cypress, record-and-play, AI codegen
Tier 2: Organizing	Tag, group, label what you wrote	TestRail, Zephyr, test-management platforms
Tier 3: The What-to-Test Gap	Map what the app does so you can decide what to test	Almost no one (where the gap lives)

Tier 1 is crowded. Tier 2 is mature. Tier 3 is the floor of the stack, and the floor is mostly empty. There are risk-based methodologies (ISO/IEC/IEEE 29119 specifies risk considerations as integral to test planning, and the classic risk-matrix approach is how mature orgs prioritize) but a methodology is not a map. A risk matrix ranks what you already see. It doesn't surface the flows you didn't know existed.

The gap is structural, not methodological. The team is missing a current inventory of what the app does (screens, paths, branches, side-effects the running build supports). Without that inventory, every prioritization framework prioritizes the wrong list.

How to know if you're in the gap

A five-question diagnostic.

1. Can you, in three sentences, describe what your test suite covers and what it does not? If the answer is "happy paths" or "I'd have to check," you are in the gap. Teams out of it name the boundary precisely ("all of checkout, partial on settings, zero on the admin console") because someone built the map.

2. When a customer files a bug, is your first reaction "we should have caught that" more than half the time? If yes, your suite is shaped by post-mortems, not pre-release planning. The customer is your integration test.

3. When a developer refactors a shared component, can you tell them which tests will fail? Most teams can't. One AP/payments SaaS described a screen running 15,000 to 20,000 lines of code where a refactor in one place broke three others and the regression suite caught one. That is a what was supposed to be tested here problem, not a flake problem.

4. If your senior QA quit tomorrow, would your suite still represent your app's surface area? If the map lives in one person's head (the Single-Throat Bottleneck) the answer is no. Your suite is one resignation letter from coverage fiction.

5. Can you tell, before a release, which automated tests are stale versus accurate? A stale test that still passes is the worst kind. It says "this flow works" when no one has confirmed the flow even exists.

Fewer than three yeses, you are in the gap. The fix is not another framework. The fix is a current map. Run My Audit → and we'll build one for your app.

The fix: discover before you build

The discover/build/run/heal stack is the verb sequence we built QAby.AI around, and the first verb is the one most categories skip. Discover means an agent inventories your app's flows before anyone writes the first test. It walks the surface area, names the screens, identifies conditional branches, and surfaces what real users touch in production.

The output is a map. Not a suite. The suite comes next.

Once the map exists, build/run/heal does what every authoring tool has always done: generate tests, run them on every merge, repair them when the UI changes. The difference is the tests are written against an inventory, not a hunch. You stop asking "did we cover this?" and start asking "the map says we cover 70 percent of admin flows. Is that the 70 percent that matters?"

That second question has an answer. The first does not.

The 2026 report has the step-type breakdown that shows what real teams build on the platform: 1 in 8 steps is now an AI assertion rather than a click. The same model that asserts the cart contains three items can walk the app and tell you the cart even exists.

Devs ship faster than QA tests. We close the gap. Skip the SDET hire. Run regression on every merge. Beyond generated scripts. The outcome is release confidence at engineering velocity. The mechanism is discover/build/run/heal, and discover is the verb nobody else runs.

Run My Audit. We walk your app the way we walk every prospect's app: inventory the flows, flag the gaps, show you what your current suite is missing. Half an hour, no slide deck. Start here →

What this means for buyers

If you are shopping AI testing tools, here is the buyer-side checklist this framework forces:

How do you decide what to test? "The user writes the prompt" means Tier 1 (faster authoring). The gap follows you in.
Do you discover flows, or only execute them? Missing from the demo means the gap is unaddressed.
When my app changes, how do I know my map is still accurate? A credible answer means ongoing discovery. Most cannot answer.
When a test fails, do you repair it or skip it? Self-healing that silently deletes failing tests (the Green-Pipeline Lie we unpacked in the 2026 report) is the gap's worst-case outcome. A green pipeline against a stale map is theater.
Can I see per-flow coverage, not per-test count? Per-test count is vanity. Per-flow coverage tells you whether the map and the suite agree.

If the demo can't answer those, you will buy faster authoring and discover, six months later, that the gap moved with you. Full buyer-side treatment in How to evaluate AI testing tools. Cost math against the SDET hire in The SDET You Don't Have to Hire Next Quarter and Playwright vs QAby.AI.

What we're not saying

Three honest disclaimers.

Authoring tools aren't useless. They speed up work after you know what the work is. Playwright and Cypress are competent frameworks. The authoring layer is real. It is just not where the gap lives.

Not every team is in the gap. Mature 50-plus-QA orgs running 5,000-case suites at 85 percent coverage (like the publicly-traded enterprise observability SaaS team in the 2026 report) have largely solved the map problem with headcount and time. The gap is a mid-market problem, sharpest at 1 to 10 QA on a 30 to 200 engineer org.

Our tool isn't the single way to close it. A QA Lead with three weeks of runway can walk an app and build the same map by hand. We just don't think most Leads have three weeks every quarter, and the manual version is exactly the work that should not be a human's full-time job in 2026.

If your suite is accurate, your map is current, and your coverage report matches reality, this post is not for you. If you felt the pattern, the diagnostic above is the next step.

Run My Audit →

About this post

Author: Himanshu Saleria, Co-founder & CEO, QAby.AI. Background in QA-led product engineering at scale; running QAby.AI's customer research, telemetry analysis, and product. LinkedIn.

Published 2026-06-12 · Last updated 2026-06-12 · 13-minute read

Frequently asked questions

What is The What-to-Test Gap?

The What-to-Test Gap is the structural problem where a QA team knows how to write tests but cannot articulate what to test. In our 26-call dataset, 4 teams named test design as the bottleneck unprompted, making it the second-largest QA pain after the locator tax. Faster authoring tools widen it. Agents that discover flows close it.

How do you decide what to test in QA?

Most mid-market teams don't decide, formally. The pattern across 41 conversations: real test cases come from production incident tickets, customer complaints, and post-mortems, not pre-release planning. Mature teams layer a risk matrix on top of an inventory of flows. Less mature teams let the customer be the integration test.

Isn't risk-based testing the answer to The What-to-Test Gap?

Partially. ISO/IEC/IEEE 29119 and the classic risk matrix tell you how to rank what you can already see. They don't show you the flows you didn't know existed. Risk-based testing is the second step. Discovery is the first. Skip the first and the matrix prioritizes the wrong list.

What's the difference between The What-to-Test Gap and the Locator Tax?

The Locator Tax is mechanical (selectors break when the UI changes and the team pays 20 to 30 percent of automation time to fix them). The What-to-Test Gap is judgment work (the team doesn't know which flows the suite should defend). Locator Tax shows up as broken tests. The gap shows up as missing tests.

How does AI fix The What-to-Test Gap?

By inventorying flows before tests get written. An agent walks the app, names the screens, identifies branches, and produces a map of the surface area. The suite is then built against the map, not a hunch. In QAby.AI we call this the discover verb (the first of discover, build, run, heal). Authoring tools skip discover and start at build.

How do I know if my team is in The What-to-Test Gap?

If you can't describe your suite's coverage and gaps in three sentences, you're in it. Other signals: more than half your bugs are "we should have caught that," your senior QA holds the map in their head, and you can't tell which automated tests are stale. The diagnostic in the body of this post has the full five-question check.

Does AI test generation create more work or less?

It depends on whether you closed the gap first. A team that doesn't know what to test, generating tests faster, makes the wrong work faster. One QA Lead we spoke to abandoned a ChatGPT generator because the review-and-re-prompt loop ate more time than writing tests by hand. Generate against a current flow inventory, not against a hunch.