
The Anatomy of an AI-Authored Test
9,103 real test steps from 14 mid-market SaaS teams decoded. Median test is 8 steps. 1 in 8 is an AI assertion. What AI testing actually looks like.
Published 2026-06-12 · Last updated 2026-06-12 · 15-minute read
Pull up the last regression test someone on your team wrote. Count the steps. Count the assertions.
We did this for 9,103 real test steps built by 14 mid-market SaaS teams on QAby.AI between October 2024 and June 2026. The shape of an AI-authored test is not what the marketing said it would be.
TL;DR
- The median test is 8 steps. The mean is 14.8. A handful of 50–182-step tests drag the average up.
- Click + type = 54.5% of all authored steps. The boring bulk of testing is still pressing buttons.
- 1 in 8 steps is AI-driven:
assert-ai+ai-magic+extract-content+conditional= 12.2%. That's where regression actually lives now. - Email and OTP testing exists in real usage: 59
wait-for-email+extract-from-emailsteps across 5 users at 4 teams. - 6.1% of steps are module reuse: the silent quality metric nobody talks about.
If your tests are 30+ steps long and zero modules, the patterns below tell you where the leak is.
Bottom line. A real AI-authored test on our platform has 8 steps on the median. 54.5% of steps are clicks and types. 12.2% are AI-driven (assertions, conditionals, extracts, open-ended actions). 6.1% are module reuse. AI doesn't replace the click. It replaces the selector-based assertion that broke every sprint.
How we pulled this data
Source: our production analytics for QAby.AI authoring. Every step authored fires a stepAdded event with its stepType. We aggregated all-time across the platform.
The dataset: 9,103 step events, 14 teams, 46 users, 616 tests with steps, 627 test runs, 109 modules, 73 test plans. 1,124,146 total events project-wide, October 2024 → June 2026.
This is early-stage telemetry. Some users are internal team or POC. Don't read the absolute counts as a market benchmark. Read the shape (the median test length, the step-type mix, the AI share) as directional. The broader context (team shapes, the Locator Tax, the N-3 Lag) is in The State of AI QA in Mid-Market SaaS 2026.
What does a real AI-authored test look like?
A real AI-authored test on our platform is shorter, leaner, and more click-heavy than any vendor demo.
The median test is 8 steps
Across 616 tests with at least one step, the median is 8 steps. Mean: 14.8. Max: 182.
That gap is the signature of a fat-tail distribution. Most tests are short, focused, and check one user journey. A handful of 50–180-step monsters pull the average up. Benchmark against the median, not the mean.
Eight steps is enough room for: navigate, log in, click a thing, fill a field, submit, assert. That's a real test. That's most real tests.
The broader industry guidance lands in the same place. The E2E test performance benchmarks at testdino and the DEV community discussion on long vs short tests both reach the same conclusion: scope each test to a single critical user journey, parallelize. A 30-step test is two tests pretending to be one.
Click + type still dominates
The full step-type mix from 9,103 real steps:
| Step type | Count | Share | What it does |
|---|---|---|---|
| click | 3,618 | 39.7% | Tap a button, open a menu |
| type | 1,347 | 14.8% | Fill an input |
| assert-ai | 762 | 8.4% | AI assertion: "the cart should contain 3 items" |
| scroll | 578 | 6.3% | Bring an element into view |
| wait | 578 | 6.3% | Wait for a condition |
| module (reuse) | 553 | 6.1% | Call a shared sub-flow |
| javascript | 210 | 2.3% | Custom code injection |
| extract-content | 165 | 1.8% | AI extracts a value to assert against |
| navigate | 153 | 1.7% | Go to a URL |
| comment | 151 | 1.7% | Inline note for humans |
| select | 123 | 1.4% | Pick from a dropdown |
| conditional | 92 | 1.0% | Branch based on UI state |
| ai-magic | 91 | 1.0% | Open-ended AI action: "select date as tomorrow" |
| file-upload | 77 | 0.8% | Attach a file |
| wait-for-email | 33 | 0.4% | Wait for an inbox message |
| extract-from-email | 26 | 0.3% | Pull an OTP or token out of an email |
Click + type = 54.5%. The vast majority of authoring is the same input-and-button work that's defined E2E since Selenium. AI doesn't change that.
1 in 8 authored steps is AI-driven (assert-ai, ai-magic, extract-content, conditional). Click + type still dominate the boring bulk.
Key takeaways
- Median real test = 8 steps. The mean (14.8) is dragged up by 50-182-step outliers that should probably be split.
- Click + type still dominates at 54.5% of all authored steps. AI doesn't replace the click.
- AI-driven steps total 12.2%. The biggest slice is
assert-aiat 8.4%, the selector-based assertion that broke every sprint.- 6.1% of steps are module reuse. A mature suite hits 15-25%. Most teams under-use this lever.
What the 12.2% AI slice is actually doing
AI-driven steps total 12.2%: assert-ai (8.4%) + ai-magic (1.0%) + extract-content (1.8%) + conditional (1.0%), or 1,110 of 9,103 steps. One in eight steps is an AI step, not a click. The mix inside that 12.2% is the read.
A note on unit: a "module" counts as one step in these numbers even if it contains multiple sub-steps inside it. The 8-step median includes module references as single units.
Assertions are the biggest AI slice
assert-ai alone is 8.4%: the third-most-used step type, behind only click and type. Higher than scroll, wait, or reused modules.
Instead of expect(page.locator('.cart-count')).toHaveText('3'), the user writes "the cart should contain 3 items." The agent figures out which element holds the count.
The implication: the part of a test that breaks first when your UI changes (the selector-based assertion) is the part teams are happiest to hand to AI. The click was always cheap. The assertion carried the maintenance burden, and that's the slice AI is taking.
A senior QA practitioner at a publicly-traded enterprise observability SaaS in our dataset put it well: AI generates the case structure well, but business-rule assertions are where domain knowledge bites back. Teams use assert-ai to write assertions in the language of the business, not in CSS selectors.
ai-magic is the wildcard slice
ai-magic is 1.0% (91 steps). Small. It's the open-ended slice: a step that says "select date as tomorrow" or "complete checkout with a $10 promo code." The agent figures out the steps.
The fact that it's only 1.0% is the read. Real teams are not handing over their entire test to an agent and walking away. They use AI for the parts that benefit from judgment (assertion, conditional branch, extract) and stay in control of the clicks and types.
That gap between vendor marketing and real authoring matters. The pitch that AI replaces every step isn't happening in the data. The pitch that AI does the 12% you couldn't reliably automate before is happening, every day.
Extract and conditional close the loop
extract-content(1.8%): pull a value out of the page for a later step. Order numbers, IDs, computed prices.conditional(1.0%): branch the test on UI state. If the cookie banner appears, dismiss it; if a modal shows, click through.
Both close the same gap: tests fail because the page isn't always the same. A signed-in user sees one thing, a new visitor sees another, the modal shows on Tuesdays. AI conditionals and extracts let one test handle both branches without forking.
When you evaluate an AI testing tool, ask how it handles "the modal that sometimes shows." If the answer is "you write two tests," you're paying twice.
The module-reuse signal nobody talks about
6.1% of steps are module calls: 553 of 9,103.
A module is a reusable sub-flow: "log in," "add to cart," "create a new project." Other tests call it. When the login flow changes, you update one module, not 50 tests.
Teams that reuse modules survive UI changes. Teams that don't build the same login into every test, then spend 4–5 hours fixing 50 tests when login changes (the Locator Tax documented in State of AI QA 2026).
We want that 6.1% number to climb. A mature suite reuses 15–25% of its steps via modules. Most teams are still under-using them.
The audit question: how many of your tests log in fresh, click the same nav, set up the same fixture? Each one is a missed module. Track your reuse rate, not your test count.
Email and OTP testing: the niche with telemetry
33 wait-for-email + 26 extract-from-email steps = 59 total, across 5 users at 4 teams. 0.65% of all steps. A rounding error in volume. A real pain point in real usage.
Email testing is the workflow most platforms quietly skip. Sign up needs an OTP. Magic-link login reads an inbox. Password reset, billing receipts, transactional notifications (all routed through email). The classic Playwright suite either mocks them (doesn't test the real flow) or hits a real inbox (flaky).
Mailosaur, MailSlurp, and Bugbug all document the same failure modes: shared test inboxes pile up emails, parser logic scatters, polling loops break when templates change.
4 of our 14 teams found this painful enough to use first-class step types for it. That's directional. The long-tail of "hard to automate" flows is real, named, and the teams who hit it once use the feature constantly afterward.
For teams shopping AI testing tools: ask how the vendor handles email OTPs. The honest answer is rare. The right one is "we wait for the real email and extract the code."
The 4-question audit
If you compress the data into a buyer-side audit:
- What's your median test length? Much over 15 steps = combining multiple flows into single tests. The teams shipping confidently are at 8.
- What's your AI-driven step share? If you're at 0%, your assertions are still selector-based, carrying the maintenance burden. The platform median is 12.2%.
- What's your module reuse rate? Below 5% = every test rebuilding the same login. The 4–5 hour cost-per-UI-change comes from here.
- Can your suite test an email OTP flow without mocking? If no, your signup, magic-link, and password-reset flows aren't being tested.
If three or four show patterns, the gap isn't your tool. It's the design model your tool encourages. The forks are in Mabl vs QAby.AI and Playwright vs QAby.AI. The buyer-side rubric is in How to evaluate AI testing tools.
The pain frame
Devs ship faster than QA tests. We close the gap.
The step-mix tells you how. Click + type stays. The 12% that broke every sprint (assertions and conditionals) gets handed to an agent that fixes itself when your UI changes. Release confidence at engineering velocity.
If your current suite is mostly selector-based assertions, fragile email-flow tests, and zero modules, the patterns above tell you where the leak is.
Frequently asked questions
What does an AI-authored test actually look like?
It's mostly the same shape as a traditional test: 8 steps on the median, dominated by click (39.7%) and type (14.8%). The difference is the 12.2% AI-driven slice: assert-ai, ai-magic, extract-content, and conditional steps that replace selector-based assertions and branching logic. AI does the 1-in-8 hardest part; the rest stays normal.
How many steps should a regression test have?
The data says 8 on the median, mean of 14.8 across 616 real tests. Industry benchmarks reach the same conclusion: scope each test to a single critical user journey. If a test is over 25 steps, it's usually two tests pretending to be one. Split it and parallelize for faster feedback.
What share of test steps in AI testing tools are actually AI?
In our telemetry of 9,103 real steps, 12.2% (1 in 8). The biggest slice is assert-ai at 8.4%, then extract-content, conditional, and ai-magic at ~1% each. Most of the test is still clicks and types. The AI's value sits in the assertion and branching, not in replacing the entire authoring loop.
Why is email testing so painful to automate?
Real inboxes are flaky: shared addresses pile up, templates evolve, parser logic scatters, polling loops break. Most teams either mock the email or fight the real inbox. In our data, 4 teams used first-class wait-for-email and extract-from-email steps to handle it. The fix is platform-level email handling, not custom code per test.
What is module reuse and why does it matter?
A module is a reusable sub-flow (log in, add to cart, create project) that other tests call. 6.1% of steps in our data are module calls. Teams with high reuse survive UI changes; teams with low reuse rewrite 50 tests when login changes. A mature suite reuses 15–25% of its steps via modules.
Should I trust the mean or median test length?
Median. The mean (14.8) is dragged up by a small number of 50–182-step tests that should probably be split. The median (8) is what most real tests actually look like. Anchoring to the mean makes you think your suite is too small.
How does AI handle conditional UI state like a cookie banner that sometimes shows?
In our data, conditional steps (1.0% of all steps) let one test branch on UI state ("if the banner is present, dismiss it"). Without this, teams write two separate tests or live with flakiness. Ask any AI testing vendor how it handles "the modal that sometimes shows." If the answer is "write two tests," you're paying twice.
About this post
Author: Himanshu Saleria, Co-founder & CEO, QAby.AI. Background in QA-led product engineering at scale; running QAby.AI's customer research, telemetry analysis, and product. LinkedIn.
Published 2026-06-12 · Last updated 2026-06-12 · 15-minute read
Sources and further reading
Internal:
- The State of AI QA in Mid-Market SaaS 2026: the parent artifact, n=41 calls + telemetry context
- Playwright vs QAby.AI: framework-code vs agent-led authoring fork
- The SDET You Don't Have to Hire Next Quarter: cost math against the SDET hire
- Mabl vs QAby.AI: QA-Lead-platform vs engineer-owned-agents fork
- How to evaluate AI testing tools: buyer-side checklist
- /compare/manual-qa: head-to-head page
External:
- PostHog product analytics documentation: the event-based instrumentation model behind this dataset
- E2E test performance benchmarks (testdino): industry test-length benchmarks that triangulate the median = 8 finding
