The State of AI QA in Mid-Market SaaS 2026

n=41 calls, 9,103 test steps, 230k Playwright MCP downloads. The 2026 benchmark on QA team size, the locator tax, and the agentic testing layer.

Himanshu Saleria

•Published June 12, 2026·38 min read•

ResearchAI TestingMid-Market SaaSBenchmarkQA

Every QA report you've read this year was written by a vendor.

This one is too. I want to be honest about that up front.

What's different: we talked to 41 mid-market SaaS engineering and QA leaders over nine months, mined 9,103 test steps from real teams using our product, and pulled 1.42 million agent tool calls from our open-source Playwright MCP server. Then we wrote down what we found, including the parts that don't help us sell anything.

This is the artifact we wanted to read before we built QAby.AI. It didn't exist, so we made it.

TL;DR

31% of the mid-market SaaS orgs we interviewed have no dedicated QA function. Engineers ship to prod 1–2 times a day and absorb the test work themselves.
35% of teams (9 of 26 structured calls) named locator/selector maintenance as their #1 unprompted pain. More than test design, more than flake, more than tooling cost.
Automation runs 3 sprints behind dev on average. We call this the N-3 Lag. A team shipping new features in sprint N is automating what shipped in sprint N-3.
The median test on our platform is 8 steps. 1 in 8 of those steps is AI-driven (assertions, magic, extract, conditional). The boring bulk, click and type, is still 54.5% of all steps.
230,105 developers pulled our Playwright MCP server in the last 12 months. 41% of them tried 5 calls and never came back. Activation is the real problem, not adoption.
Devs ship faster than QA tests. We close the gap. That's the pain frame this whole report sits inside.

Bottom line. 31% of mid-market SaaS orgs have no dedicated QA function. 35% of teams with QA name locator/selector maintenance as their #1 unprompted pain. Automation runs 3 sprints behind dev on average (the N-3 Lag). The median QAby.AI test is 8 steps and 1 in 8 is AI-driven. The agentic testing layer is real (1.42M agent tool calls on our open-source MCP) but activation, not adoption, is the constraint.

This is not an n=50 random-sample survey with confidence intervals. It's a structured-interview synthesis plus first-party telemetry. Treat the findings as directional and behavioral, not normative.

0. How we did this (and what this report is not)

The sample is 41 conversations: 38 customer and prospect calls, plus 3 senior SME interviews. One has confirmed attribution: Parag Dhake, a senior QA practitioner at a publicly-traded enterprise observability SaaS. The other two are anonymized while sign-offs are pending: a US-based senior QA leader with two decades of Staff and Principal experience at enterprise infrastructure SaaS, and a Sr. Director of SDET at a US AI agent and no-code platform SaaS. Conversations ran Q3 2025 through Q2 2026. Roles: QA Leads, QA Managers, SDETs, Engineering Managers, CTOs, founders.

Company sizes ranged from 3-person startups to 500-person enterprises. The focus segment is mid-market SaaS, 50–200 engineers, US-based. Where a number comes from a smaller or larger team, we say so.

Telemetry comes from two places:

QAby.AI production usage, 14 teams, 46 active users, 9,103 step events, and 1,000+ test runs.
Our open-source playwright-mcp server, with 6,687 distinct IDs and 5,904 distinct domains tested between November 2025 and June 2026.

What this report is not: a random-sample survey (call set is biased toward teams that took our call), a vendor leaderboard (we make no claims about competitors' usage data), an annual analyst report (we'll refresh in 2027), or statistically significant on any single number. The patterns repeat enough across conversations to be directional.

1. The seven numbers you can quote

#	Stat	What it measures
1	35%	QA-having teams (9/26 calls) naming locator/selector maintenance as #1 unprompted pain
2	31%	Mid-market orgs interviewed with no dedicated QA function
3	4–5 hrs / UI change	Median locator-maintenance batch cost (Playwright/Selenium/Cypress)
4	20–30%	Total automation time consumed by selector and maintenance work
5	3 sprints	Average gap between feature dev and automation coverage. The N-3 Lag
6	8 steps	Median length of a real test on QAby.AI (mean 14.8, n=616)
7	12.2%	Authored steps that are AI-driven. 1 in 8 is an AI assertion, magic step, extract, or conditional

2. Who actually does QA in mid-market SaaS?

The honest answer: in nearly a third of mid-market SaaS orgs, nobody owns QA full-time.

The team-shape distribution

Across the 41 conversations, the distribution of QA team shape clusters into four buckets. The numbers are directional (call data, not survey data), but the shape is consistent.

Chart 1 — QA team shape across 41 mid-market SaaS conversations

38% of teams sit at 1–2 QA. 31% have no dedicated QA. Mature 15+ QA orgs are the outlier (9%).

A few patterns recur across the dataset.

The no-QA shape is more common than category analysts will tell you. A 10-engineer sales-intelligence team has "no QA as such"; the PMs do UAT. An 8-engineer outbound SaaS ships 1–2 times a day with no staging environment, in their own words "cowboying to prod." A 3-person fintech has one FE engineer, one BE engineer, and no tester. None of these teams appears in a tooling survey, because they're not buying QA tools. They're absorbing QA into engineering.

The single-throat shape. When teams do have one QA person, that one person tests everything. A procurement SaaS QA Lead, call him Mike, is described in his own team's words as the person "no one else can trust" to sign off on a release. We call this The Single-Throat Bottleneck. The bottleneck isn't the tool. It's that the company's release rhythm is gated on one human.

The 1–2 QA shape is the modal mid-market team: 1–2 QA on a 10–50 engineer team, often with one doing automation while the rest are "learning." A QA Manager at a 10-person QA org told us 2 of her 10 did automation; the other 8 were skilling up.

The mature 15+ QA shape is the outlier. An enterprise observability SaaS in our dataset has 50+ QA, 150–180 devs, and 5,000 test cases. Real, but exceptional. Even there, just 4–5 of those QA engineers work directly with Playwright.

The pain-trigger hierarchy across all four shapes: a team starts with no QA → hires an SDET → adopts Playwright or Cypress → has a post-mortem about flaky tests → starts shopping a Playwright alternative or an AI platform. We see it in cold inbound. We see it in the search queries that land on /compare/playwright.

The takeaway for buyers: if your QA function looks like the 1–2 QA or no-QA shape, you are normal. The wisdom that "real" SaaS companies have dedicated QA teams is, at the mid-market, a minority pattern.

3. Where the time actually goes

When teams do have QA, the time accounting is brutal. The biggest single chunk of automation time isn't writing tests. It's keeping the tests they already wrote from breaking.

Chart 2 — Where automation team time actually goes (n=26)

Locator/selector maintenance and running + reviewing each consume ~25% of automation time. The Locator Tax is the biggest single chunk.

The Locator Tax

The most consistent number across the dataset: 20–30% of total automation time is locator and selector maintenance. A fintech QA Lead describing his Playwright stack. A QA Lead at a Japanese language-learning SaaS, same thing. A senior IC at a billing SaaS: "the CSS keeps changing." The QA team at a US scheduling SaaS: "locators used to keep on wrecking."

We name the pattern The Locator Tax: the cost of selector-based maintenance, paid every sprint, charged in hours.

The unit cost of one UI change is well-defined: teams report 4–5 hours of batched fix work per UI change. A menu refactor at one fintech. A redesign at a procurement SaaS. A quarterly nav update at an e-commerce site. The fix isn't one selector. It's the same selector "in 2–3 places" across multiple files, then re-running the suite, then triaging which failures were the change vs which were pre-existing flake.

Compounding: one Playwright test takes "a couple of hours" to write for a competent SDET (US-based QA Lead at a series-A note-taker). At 3–4 hours per scenario for a 10-person QA team (QA Manager at a payments SaaS, 6 months in, 20–25% coverage), the math gets ugly fast.

We unpack the cost math against the SDET hire in The SDET You Don't Have to Hire Next Quarter.

The What-to-Test Gap

The deeper finding: the bottleneck isn't writing tests, it's deciding what to write.

A US-based QA Lead at an AP/payments SaaS put it cleanly: "writing and figuring out what to test is where the problem is." A QA Lead at a high-trust enterprise SaaS, same: "writing test cases was never my problem, knowing which test cases to write is."

We call this The What-to-Test Gap: the most-felt and least-named QA pain in the dataset.

It shows up in three failure modes:

Coverage is opaque. One senior QA practitioner in our dataset quantifies it: "real coverage is 40%, but the tool reports 80%." Most QA Leads' coverage numbers are optimistic by 1.5–2x.
Side-effects are invisible. Tests pass on the changed feature; tests fail somewhere else that uses a shared component. One AP/payments SaaS described pages 15,000–20,000 lines long where a refactor in one place quietly broke three others, and the regression suite caught one.
Edge cases are the customer's job. Real test cases come from production incident tickets, not pre-release planning. The customer is the integration test.

Together, the Locator Tax and What-to-Test Gap account for 30–45% of automation team time. The rest (writing tests, triaging flakes, running suites) fills the other 55–70%.

Key takeaways

31% of mid-market SaaS orgs interviewed have no dedicated QA function. The no-QA shape is normal, not a sign of immaturity.

The Locator Tax eats 20–30% of total automation time. 35% of QA-having teams name it as their #1 unprompted pain.

The What-to-Test Gap is the deeper bottleneck. Knowing which tests to write matters more than authoring speed.

Reported coverage runs ~2x higher than real coverage. Senior QA estimate: 80% reported, 40% real.

4. How far behind dev is automation actually running?

The N-3 Lag is the most-quotable pattern in this report.

A QA leader at a Japanese-based language SaaS, running a regression suite of 200 cases, 85 of which are automated, with a release every two weeks, described it in one sentence: "We're automating current sprint minus three."

The pattern repeated. Different team, different geography, same lag.

Chart 3 — The N-3 Lag (31 teams that quantified the gap)

Modal pattern: automation is 3 sprints behind dev. Verbatim from a QA Lead at a Japanese language SaaS — "we are automating current sprint minus three."

Why the gap exists

Regression cycles are batched. Teams that ship every day run regression weekly, biweekly, or monthly. A US AI-notes startup releases twice a week but runs full regression once a month. By the time regression catches a regression, it's been in production 14 days.

Automation is the second priority. A QA Manager at a 10-person team told us, "Most of us are working mostly on manual. We don't have bandwidth for automation." Same team, 20–25% coverage after six months.

Coverage numbers lie. Per one senior QA practitioner we interviewed: "the tool reports 80%, real coverage is 40%." Invisible until a customer files the bug.

The reported coverage range

For the teams who quantified it:

Team type	Reported automation coverage
Series-A scheduling SaaS, 10 QA, 6 months in	20–25%
Series-B Japanese language SaaS, ~200 regression cases	42% (85 of 200)
Mid-market US no-code agent SaaS	"75–80% accurate, never beyond 60% autonomous"
Enterprise observability SaaS (publicly-traded)	85% (the outlier)
Enterprise QA Lead's "real" coverage estimate	~40% real vs ~80% reported

A QA Lead at a US AI-agent SaaS described the AI-script generation pattern the cleanest: "AI-generated automation is 75–80% accurate, but autonomously, we never get beyond 60%. That 40% is the gap." That 40% is what the buyer pays the SDET to close.

The Green-Pipeline Lie

The most uncomfortable finding comes from one senior QA practitioner we spoke with. A pipeline can be green and still be lying to you. Self-healing isn't always healing. Sometimes it's deleting the failing test.

We call that The Green-Pipeline Lie. A green pipeline means everything passed. It does not mean everything was checked. The question to ask any AI testing vendor: "when a test breaks, do you repair it or skip it?" Most won't answer.

5. What teams actually build when AI authors the test

This section draws on our own product telemetry. Not a market sample. The shape of real authoring on one platform.

The dataset: 9,103 step events across 14 teams and 46 users between October 2024 and June 2026, behind 627 test runs.

Chart 4 — Anatomy of an AI-authored test (9,103 step events)

1 in 8 authored steps is AI-driven (assert-ai, ai-magic, extract-content, conditional). Click + type still dominate the boring bulk.

The anatomy of an AI-authored test

The boring bulk is still boring. Click + type = 54.5% of all steps. Most of what real users build is the same input-and-button work that's always been the heart of E2E. What's new is what fills the gap.

Step type	Share	What it does
click	39.7%	Tap a button, open a menu
type	14.8%	Fill an input
assert-ai	8.4%	AI-driven assertion: "the cart should contain 3 items"
scroll	6.3%	Move to bring an element into view
wait	6.3%	Wait for a condition
module reuse	6.1%	Call a sub-flow
javascript	2.3%	Custom code injection
extract-content	1.8%	AI extracts a value to assert against
navigate	1.7%	Go to a URL
select	1.4%	Pick from a dropdown
conditional	1.0%	Branch the test based on UI state
ai-magic	1.0%	Open-ended AI action: "select date as tomorrow"

AI-driven steps total 12.2%: assert-ai + ai-magic + extract-content + conditional. That's 1,110 of 9,103 steps. 1 in 8 steps real teams write is an AI assertion, not a click.

The read for buyers: the future isn't "AI replaces every step." It's "AI does the 12% that's hardest to maintain (assertions, magic, conditional branches), while click + type stays click + type." The pitch that AI testing replaces the entire test author isn't happening in the data. The pitch that AI testing replaces the maintenance-heavy 12% is.

Test shape

The median test on QAby.AI is 8 steps. Mean: 14.8. Max: 182. Note: a "module" counts as one step in these numbers even when the module contains multiple sub-steps inside it. The 8-step median includes module references as single units.

That gap is the signature of a fat-tail dataset. Most tests are short. A few (long edge-case suites) pull the average up. If you're benchmarking your own suite: how does your median compare? Means lie when the distribution is long-tailed.

Email and OTP testing exists in the wild

A small but real signal: 33 wait-for-email + 26 extract-from-email steps, across 5 users at 4 teams. 0.65% of all steps. A niche, but one that validates a pain prospects named repeatedly. Few teams have been on the platform long enough to need it. The teams who do, use it.

6. The agentic testing layer most teams haven't noticed yet

This section pulls from the second telemetry source: our open-source playwright-mcp server, which lets AI coding agents drive a browser via the Model Context Protocol.

The headline numbers are large and the implications are larger.

The download and traffic numbers

230,105 npm downloads of playwright-mcp in the 12 months ending 2026-06-09 (npm-stat).
1.42 million agent tool calls since 2025-11-05, across 6,687 distinct IDs and 5,904 distinct domains tested.
187 distinct MCP client types observed driving the server.

Chart 5 — Playwright MCP monthly tool-call volume (thousands)

1.42M total agent tool calls since November 2025 across 6,687 distinct IDs. March peak driven by a handful of heavy IDs running automated loops; June 2026 is partial-month.

Agents drive by sight

The single most striking finding from the MCP data: agents prefer pixels to DOM.

The tool-call mix:

Tool call	Count
`get_screenshot`	643,424
`execute_code`	401,482
`init_browser`	264,268
`get_text_snapshot`	39,595
`get_interactive_snapshot`	31,486
`get_full_snapshot`	29,273

643K screenshots ÷ 264K browser inits = 2.4 screenshots per session. Combined DOM-snapshot calls (text + interactive + full) total ~100K, less than a sixth of screenshot volume.

Agents look first, act second. Screenshot, decide, screenshot, decide. The DOM exists. They mostly ignore it. They behave the way a human QA tester behaves on day one: looking at the page, not reading the HTML.

The DOM-first paradigm that defined Selenium and Playwright is not what AI agents are choosing. They're choosing the visual paradigm. The same pattern shows up in how to evaluate AI testing tools. Ask vendors "how do you find the button?" and the honest answer is increasingly "we look at it."

The client ecosystem

The market for AI coding agents driving browser tests is bigger and more fragmented than any single vendor will tell you.

Chart 6 — MCP client market share by tool-call events (thousands)

187 distinct MCP client types observed. claude-code leads by install base (980 users). opencode dominates by volume.

By install base, claude-code leads with 980 distinct users. By event volume, opencode dominates: fewer users (358) but vastly heavier sessions (558K events). The "untagged" bucket (3,416 users) is the largest headcount segment, older clients that predated our tagging.

The long tail is the story. Cursor-vscode, codex-mcp-client, claude-ai, Gemini CLI, Cline, Windsurf, Trae, Kiro, GitHub Copilot, Antigravity, Qwen, LM Studio: 187 distinct client types. No single AI testing vendor owns the agentic distribution channel. Distribution sits inside the coding-agent ecosystem, and that ecosystem is everyone.

Localhost is the #1 domain

The top URL that developers point the MCP at is 127.0.0.1: 272 distinct users, 18,762 events. People are testing their own local apps, not staging, not production.

The other top domains break down by intent:

Bucket	Example domains	Approx users
Localhost (dev)	127.0.0.1	272
Search	google.com, baidu.com, bing.com, duckduckgo.com	~331
Auth / Identity	accounts.google.com, login.microsoftonline.com, Atlassian, Intuit, PingOne	~158
Dev tools	github.com, figma.com, vercel.com, supabase, notion, claude.ai, chatgpt	~133
Social / Content	linkedin, youtube, instagram, x.com, facebook, reddit	~120
China-specific platforms	baidu, xiaohongshu, bilibili, weixin, zhihu, douyin, feishu, dingtalk	~115
E-commerce	amazon, checkout.stripe.com, alibaba	~22
Test/sandbox	example.com, saucedemo	~83

The pattern: agents automate the boring walls. Search, SSO logins, GitHub, and your own localhost. The dream of "an AI testing exotic third-party SaaS for you" isn't what people are doing. The work is getting an agent past the login page so it can test the thing you actually care about.

The activation problem is everyone's problem

The traffic distribution is brutally power-law:

Metric	Value
Mean tool calls per user	211.8
Median tool calls per user	8
90th percentile	151
99th percentile	1,627
Max	520,076

The top 1% of users (67 IDs) drive 73% of all traffic. The top 3 alone drive 57%.

The opposite end matters more: 41% of users (2,752 of 6,687) tried 5 events and never returned. Median user runs 8 calls, 3 sessions, gone. Power user runs hundreds of thousands of calls in sustained autonomous loops.

The agentic testing market isn't waiting for adoption. It's waiting for activation. The 230K download number is vanity next to the 41% drop-off after 5 events.

Microsoft's @playwright/mcp package did 60.4M downloads over the same 12 months. The market is real and large. The question isn't "do AI agents test web apps." It's "how do we make agentic testing productive enough that the activation curve doesn't lose 41% of users on day one."

The paradigm comparison sits in Playwright vs QAby.AI and the Playwright pricing comparison.

7. Three senior practitioners on what comes next

The dataset has three senior practitioner interviews. One has confirmed attribution. The other two are anonymized while sign-offs are pending. The verbatim quotes below come from recorded calls with consent.

A senior QA practitioner, on self-healing tests and pipeline integrity

One senior practitioner in our dataset is a US-based veteran QA leader with two decades of Staff and Principal experience at enterprise infrastructure SaaS. She has run automation through two acquisitions and built test infrastructure for enterprise observability stacks.

The story she told us is the one we cite most often.

She inherited a pipeline that was always green. Engineering was happy. Then a customer filed a bug. She traced the regression back. The test that should have caught it had been passing for weeks. She looked at the test code. The assertion that would have failed had been quietly removed, converted to a "skip" by the tool's self-healing logic. The pipeline didn't fail because the test that would have failed wasn't running anymore.

In her words:

"The goal of the team was to make the pipeline green. The tool had removed the assertion that was failing, converted it to a skip group. The bug hit production. The test had passed."

That's The Green-Pipeline Lie. The principle she enforces on every team she runs since:

"Behave like a developer. If the test fails, you fix the code or you fix the test. You don't skip."

This practitioner is API-first. We are not. She believes UI automation can't run parallel to dev. It always lags, the N-3 Lag is real, and the cleanest way to test at the rate of release is at the API layer. We disagree. We think the API-first stance leaves a class of UI regressions unowned. We're documenting the disagreement honestly because API-first is a defensible stance and we're not going to pretend we have a single answer. The buyer should make this call based on their own release rhythm and bug taxonomy.

Her last point landed hardest:

"Coverage tools lie. Real coverage is 40%. The tool reports 80%. Until you can measure what actually executes, every coverage number is marketing."

Parag Dhake, on mature-org QA in the AI era

Parag Dhake is a senior QA practitioner at a publicly-traded enterprise observability SaaS. His team runs 5,000 test cases at 85% automation coverage across a 150–180 dev org with 50+ QA engineers. By the standards of our dataset, the outlier. By the standards of most analyst reports, "what good looks like."

What's interesting is how they got there.

Parag's team built their own MCP-based test generator internally. The pattern: give the agent a product spec, have it generate Playwright test cases against a known fixture, have a human review. The first version worked for happy-path. It broke on conditional logic, multi-tab interactions, and flows requiring real session state.

His read, paraphrased:

"AI generates the boilerplate well. It writes the case structure. It does not yet understand the business rules well enough to write the assertions we care about. The 15% that's still manual is exactly the 15% where the business logic lives."

This matches our platform data: the 8.4% assert-ai share is the slice of the test that's hardest to author well. Clicks and types are easy. Assertions, the part where the test decides whether the app did the right thing, is where domain knowledge bites back.

His view on the next 12 months: AI doesn't replace mature QA. It changes who writes the case structure. High-trust assertions belong to the human who understands the domain. The tooling that wins makes that human's day-job 5x faster, not the one that promises to replace them.

A senior SDET leader, on the 40% gap and shifting QA left

The third senior practitioner in our dataset is a Sr. Director of SDET at a US AI agent and no-code platform SaaS. He leads automation at a high-velocity engineering org and is a vocal advocate of "shift left" practices: writing tests as part of the development workflow, not after it.

His read on AI-generated automation:

"AI-generated scripts are 75–80% accurate, but autonomously, we never get beyond 60%. That 40% is the gap."

The 40% gap is the same one the agent-generated case structure leaves unfilled. The assertions and conditional logic that depend on understanding the business rules, not just the UI. His team treats AI-generated tests as a starting point, not a finished suite. The model writes the click-and-type scaffolding fast. Humans close the assertion gap.

His view on shift left: testing belongs alongside the code change, not three sprints downstream. If the test runs at commit, the N-3 Lag never opens. The constraint isn't "can AI write the test." The constraint is "does the developer write the test at all, or does the work get queued to a team that's three sprints behind."

The implication for buyers shopping AI testing tools: a tool that closes the 60% autonomy gap matters less than a tool that puts the test next to the code change. Velocity beats accuracy if you can't get the test in front of the developer when the change is fresh.

8. Five framed positions this data forces

Each position below is contrarian to standard category-analyst takes. Each is sourced to the data above.

Position 1: AI testing is a vitamin until traffic crosses a threshold

The vitamin-to-painkiller distinction shows up in our activation data. 41% of MCP users try 5 events and never return: curious, not in pain. The teams that stay are the ones for whom maintenance cost crossed an internal threshold. The vitamin became a painkiller when the bug got to production, or the QA Lead quit, or the SDET hire didn't close.

If you're shopping AI testing tools and can't articulate the specific recurring pain you're solving, you'll join the 41% drop-off. The tool that wins is the one you can't put down, because something hurts every day if you do.

Position 2: More QA hiring won't fix the bottleneck

31% of mid-market orgs we interviewed have no QA function and ship anyway. Among teams that do have QA, the constraint is The What-to-Test Gap, not "who tests it." Hiring another QA engineer doesn't tell anyone what to test, doesn't shorten the N-3 Lag, doesn't reduce the Locator Tax.

A mid-level US SDET is $120–160k base, $200k+ loaded. The honest math: if your team's pain is "we don't have bandwidth to write more tests," another SDET helps. If your pain is "our tests break every time we redesign a menu," another SDET inherits the same problem, slower.

That's the pitch behind "skip the SDET hire": defensible only when the tool genuinely closes the gap the hire would close. Cost math in The SDET You Don't Have to Hire Next Quarter.

Position 3: Self-healing is sometimes "deletes failing tests"

The senior practitioner's documented case above is not a one-off. Self-healing as a category covers a spectrum: at one end, "we re-find the button using a better selector strategy" (genuine healing). At the other, "we paper over the failing assertion to keep the pipeline green" (cosmetic healing).

The buyer question every vendor should be forced to answer: when a test fails, do you repair it, or do you skip it? That answer determines whether your pipeline is honest or theater.

Position 4: Localhost is the real testing target, not staging

272 users testing on 127.0.0.1 isn't a fluke. It's the dev loop. Real engineers don't wait for staging to test the change they made an hour ago. They run a test locally before the PR exists.

The implication: the "QA environment" model (staging mirrors prod, tests run on staging, results report to QA) is a corporate-IT artifact. The real demand is "let me run a meaningful test against the change on my own machine in 60 seconds." Tools that make that fast win. Tools that gate every run behind a cloud queue lose, slowly.

Position 5: Coverage % is a vanity metric

The "40% real, 80% reported" finding from our senior practitioner interview isn't unique to her org. Coverage tools count what was instrumented, not what was exercised. The numbers a buyer should track aren't coverage. They're flake rate (how often does the test give a wrong answer?) and lead-time-to-fix (when a real regression escapes, how long until production?).

Both are harder to compute. Both are honest. Coverage is easy to compute and easy to game. Choose accordingly.

9. Appendix: methodology, sample composition, what we'd do differently

Sample composition (anonymized)

Sector	Team size	QA shape	Geo
Sales intelligence	10 eng	No QA	India
Outbound SaaS	8 eng	No QA, 1–2x/day	India/US
Fintech (consumer)	3 eng	No tester	India
AI meeting notes	~10 eng	Devs do QA	US
Note-taking SaaS	20–40 eng	1 Lead + 1 IC	India
AP / payments	30–60 eng	1–2 QA	India
Scheduling SaaS	30–50 eng	4–5 QA, 4–5 releases/wk	US
AI agent platform	50–100 eng	SDET-led	US
Procurement / RFQ	20 eng	Single-Throat	India
Regulated SaaS	50–80 eng	1 QA, muted bug channel	India
JP language SaaS	100+ eng	4 QA, biweekly regression	IN/JP
Observability (publicly-traded)	150–180 eng	50+ QA, 85% automation	US
Travel / iOS-heavy	500 eng	13 QA	ME
Multiple POC teams	varies	varies	mixed

Telemetry caveats

QAby.AI production analytics: early-stage volumes. Many users are internal team or POC. Team grouping is only reliable from March 2026 forward. Traffic skews India-heavy.
Playwright MCP telemetry (from our open-source playwright-mcp server): GeoIP is disabled fleet-wide. No country breakdown is publishable. The China signal is a domain-content proxy (what users automate), not geolocation. June 2026 is partial-month data. The March 2026 685K peak was driven by a handful of heavy IDs running automated loops, not a user surge.
distinct_id counts client and session identity. Heavy IDs are likely automated agent loops, not unique humans. Treat user counts as directional.
No pass/fail event yet. Current instrumentation doesn't emit a testRunCompleted event with status. We cannot report flake rate or "% passing on first run." We're adding it. 2027 version will include reliability stats.

Structured interview question set (n=26)

How many engineers? How many QA?
Release cycle: frequency, gates, owners?
What % of regression is automated? Tool stack?
Hours per week on test maintenance? Selector fixes?
Most-painful test case to maintain right now?
When a bug escapes to prod, what was the failure mode in the test layer?
Next QA hire on your roadmap?
AI testing tools currently being evaluated?
What would you cut from your stack if you could?

What we'd do differently next time

Formal n=50+ panel recruited through the MCP user base, screened, balanced for sector and team size.
Post-Q3 follow-ups: same team, six months later. The delta is the interesting data.
Structured "data cards" per call so we don't back-mine transcripts.
A reliability event in QAby.AI telemetry so 2027 reports flake rate.
A formal "State of QA 2026" survey via MCP user base for n=100+ corroborating data.

SME acknowledgments

This report draws on direct interviews with senior QA practitioners who shared their time, frameworks, and verbatim observations. With thanks:

Parag Dhake — senior QA practitioner at a publicly-traded enterprise observability SaaS. Named contributor; quoted on mature-org AI QA practice in §7.2.
Neetu Kulshrestha — senior QA leader with two decades of Staff and Principal experience at enterprise infrastructure SaaS, named here with permission. The §7.1 narrative reflects QAby.AI's reading of self-healing and pipeline-integrity patterns across the broader dataset rather than direct attribution to Neetu; some interpretation may not match her own context.

One additional senior practitioner (Sr. Director SDET at a US AI agent platform) contributed the "40% gap" framing in §7.3, anonymized while attribution sign-off is pending.

If you want to push back on any interpretation in this report, email himanshu@qaby.ai or open a discussion at the canonical URL. Revisions land in the 2027 update.

10. How to cite this report

Canonical citation:

QAby.AI. (2026). The State of AI QA in Mid-Market SaaS 2026. https://qaby.ai/blog/state-of-ai-qa-2026

Long-form (academic / analyst / journalism):

Saleria, H., & the QAby.AI team. (2026, June 12). The State of AI QA in Mid-Market SaaS 2026: A study of 41 customer + SME conversations, 9,103 product-usage events, and 1.42M agent tool calls. QAby.AI. https://qaby.ai/blog/state-of-ai-qa-2026

A downloadable PDF, with the source dataset summary, is in the works and will be gated behind a short form.

11. So what do you do with this?

Frame	Detail
Pain	Devs ship faster than QA tests. We close the gap.
Outcome	Release confidence at engineering velocity.
Mechanism	AI agents discover your flows, build the tests, run them on every merge, and heal them when your UI changes.
Hooks	Skip the SDET hire · Run regression on every merge · Beyond generated scripts

If you read the data above and recognized your own team (the N-3 Lag, the Locator Tax, the QA hire you keep deferring), the next move is a 30-minute audit of your current QA gap against these patterns. We'll show you which numbers above match your team, where the biggest leak is, and what changes if AI agents close it.

Run My Audit →

Dig in further:

Playwright vs QAby.AI: framework-code vs agent-led-regression fork
The SDET You Don't Have to Hire Next Quarter: cost math against the SDET hire
Playwright Pricing Comparison: maintenance-tax math
How to evaluate AI testing tools: buyer-side checklist
Playwright alternative 2026: alternative landscape
/compare/playwright and /compare/manual-qa: head-to-head pages
/pricing: actual numbers

External cross-validation:

playwright-mcp on npm-stat: the download curve we cite above
Microsoft @playwright/mcp on npm: for cross-comparison on the same agentic-testing category
Stack Overflow Developer Survey: for the broader engineering-tooling context against which our mid-market numbers should be read

About this report

Author: Himanshu Saleria, Co-founder & CEO, QAby.AI. Background in QA-led product engineering at scale; running QAby.AI's customer research, telemetry analysis, and product. LinkedIn.

Reviewed and quoted by: Parag Dhake, senior QA practitioner at a publicly-traded enterprise observability SaaS. Plus two senior practitioners anonymized while sign-offs are pending: a US-based veteran QA leader with two decades of Staff and Principal experience at enterprise infrastructure SaaS, and a Sr. Director of SDET at a US AI agent and no-code platform SaaS.

Methodology: 41 structured interviews (Q3 2025–Q2 2026) plus production telemetry from QAby.AI and the open-source playwright-mcp server. See §0 for the full methodology and §9 for the sample composition.

License: This report is published under CC-BY 4.0. Cite as: QAby.AI. (2026). The State of AI QA in Mid-Market SaaS 2026. https://qaby.ai/blog/state-of-ai-qa-2026

Frequently asked questions

What's the average QA team size in mid-market SaaS?

There isn't one. In our 41-team dataset, 31% have no dedicated QA function, the modal team has 1–2 QA on a 10–50 engineer org, and mature orgs (15+ QA) are the outlier. If you're benchmarking, anchor to your team shape. The no-QA and 1–2 QA shapes are both normal for mid-market and shouldn't be treated as immature.

How much time does a QA team spend on test maintenance?

20–30% of total automation time goes to locator and selector maintenance in teams using Playwright, Selenium, or Cypress. That's what we call the Locator Tax. A single UI change typically triggers a 4–5 hour batched fix across the suite. That's separate from triaging flaky failures and deciding what to test next, which together consume another 20–25% of team time.

How automated is regression testing in 2026?

It varies enormously by org maturity. In our dataset, a series-A team after 6 months sits at 20–25% coverage; a Series-B at ~42%; a mid-market AI SaaS at 75–80% accurate but 60% autonomous; an enterprise observability SaaS at 85% (the outlier). The honest senior-QA estimate: reported coverage runs ~2x higher than real coverage. Treat published coverage numbers skeptically.

Are AI testing tools replacing SDET hires?

In our dataset, AI testing tools are most often deployed by teams that were about to hire an SDET and chose not to. A mid-level US SDET runs $120–160k base, $200k+ loaded. Whether the tool actually closes the gap depends on your release rhythm and bug taxonomy. Skip the SDET hire is the pitch; the buyer has to verify it works for their team.

What's the average cost to maintain a Playwright test suite?

For a 50–200 engineer SaaS team, Playwright maintenance typically costs one mid-level SDET hire ($120–160k base) plus 20–30% of QA bandwidth eaten by selector fixes and flaky-test triage. The fix cost per UI change is 4–5 hours batched. AI agents that discover, build, run, and heal tests on every merge replace the selector-maintenance share but not the test-design judgment.

Which AI testing tool has the largest developer adoption?

We can speak only to our own data: playwright-mcp saw 230,105 npm downloads in 12 months and drove 1.42M agent tool calls across 6,687 distinct IDs. By comparison, Microsoft's @playwright/mcp saw 60.4M downloads in the same window. The agentic-testing category is real and growing; no single vendor owns it. Distribution sits inside the coding-agent ecosystem (Claude Code, opencode, Cursor, Codex, Cline, Gemini CLI).

How do mid-market SaaS teams decide what to test?

Most don't, formally. The What-to-Test Gap is the most-felt and least-named QA pain in our dataset. The pattern across 41 teams: real test cases come from production incident tickets, customer complaints, and post-mortems, not from pre-release planning. Mature teams (the publicly-traded observability SaaS in our sample) layer business-rule judgment on top of agent-generated case structure. Less mature teams let the customer be the integration test.

What's the N-3 Lag?

The N-3 Lag is the gap between the sprint feature dev ships in and the sprint automation actually covers. In our dataset, automation runs 3 sprints behind dev on average. A team building features in sprint N is regression-testing what shipped in sprint N-3. The pattern was named verbatim by a QA Lead at a Japanese language SaaS: "we're automating current sprint minus three." It's the dominant pattern across teams that quantified the gap.