What 230,000 Playwright MCP Downloads Taught Us About AI Agents in CI/CD

230,105 npm downloads, 1.42M agent tool calls, 187 MCP clients, 5,904 domains tested. The activation cliff, the screenshot habit, and the localhost truth.

Himanshu Saleria

•Published June 13, 2026·24 min read•

Playwright MCPAI AgentsCI/CDDataAgentic Testing

Published 2026-06-12 · Last updated 2026-06-12 · 18-minute read

Looking to install playwright-mcp? The /playwright-mcp landing page has 30-second setup snippets for Claude Code, Cursor, and Claude Desktop, plus the MCP tool reference. This post is the data POV behind it.

230,000 developers pulled an open-source MCP server in twelve months. Most of them never came back.

That sentence is the real story of agentic browser testing in 2026, not the headline download count, not the launch-day spike, not the LinkedIn posts about "AI testing is here." The story is what the curve underneath the curve says about how AI agents actually test in CI/CD when nobody is watching.

We run the playwright-mcp server on npm. It's open source. It's been instrumented since November 2025. The data below is what we see: 1.42 million agent tool calls, 6,687 distinct IDs, 5,904 domains tested, 187 different MCP clients on the other end of the wire. Treat the user counts as directional (a distinct_id is a client session, not a verified human), but the patterns hold.

This post is not a comparison and it's not a launch announcement. It's a data POV. The conclusions are uncomfortable in places, including for us.

TL;DR

230,105 downloads in 12 months. 41% drop off after 5 tool calls. Activation, not adoption, is the unsolved problem.
Agents drive by sight. 643K screenshots vs ~100K DOM snapshots. Pixels are how AI agents decide, not the accessibility tree.
Localhost is the #1 site agents test. 272 users running tests against 127.0.0.1. The dev loop is the killer use case, not staging.
187 distinct MCP clients. Claude Code leads by users (980). Opencode dominates by volume (558K events from 358 users). No single coding agent owns this channel.
Pain frame: devs ship faster than QA tests. Pattern repeats: agents can build a Playwright run in minutes, but only the top 1% of users keep coming back. That gap is what most vendor decks ignore.

Bottom line. 230,105 npm downloads of playwright-mcp over 12 months produced 1.42M agent tool calls, but 41% of users dropped off after 5 events. Agents drove the browser 2.4 times more often via screenshot than via DOM. Localhost (127.0.0.1) was the single most-tested URL. The agentic testing category is real and growing, but activation, not adoption, is the unsolved problem.

For the broader mid-market QA context behind these numbers, see the State of AI QA 2026 report.

1. How big is "230,000 downloads" actually?

The honest answer: it's a respectable utility-package number that re-accelerated after a dip. It's also tiny compared to Microsoft's package.

Both things are true. Don't pick the one you like.

The download curve

playwright-mcp ships our agent-ready MCP server. Pulled via npm-stat on 2026-06-10 for the window 2025-06-09 → 2026-06-09:

Month	Downloads
2025-07 (launch)	35,973
2025-08	23,722
2025-09	15,968
2025-10	9,792
2025-11	10,491
2025-12	9,313
2026-01	15,762
2026-02	19,102
2026-03	26,167
2026-04	25,974
2026-05	25,898

The peak day was 3,561 downloads on 2025-07-25. The launch month did 35,973. Then the usual launch-spike cool-off: Oct'25 fell to 9,792. By Mar 2026, 26,167. May 2026, 25,898. The last 30 days held ~823/day.

So: 12 months in, monthly traffic is ~70% of launch month, but the shape is a recovery curve, not a flat line. Recoveries are interesting because they aren't bought with marketing. Somebody using the package told somebody else.

Chart 5 — Playwright MCP monthly tool-call volume (thousands)

1.42M total agent tool calls since November 2025 across 6,687 distinct IDs. March peak driven by a handful of heavy IDs running automated loops; June 2026 is partial-month.

Chart 6 — MCP client market share by tool-call events (thousands)

187 distinct MCP client types observed. claude-code leads by install base (980 users). opencode dominates by volume.

The category comparison

Same window, Microsoft's @playwright/mcp package shipped 60.4 million downloads. We did 226K in the comparable count. That's a ~267x gap.

Honest framing: we're not Microsoft. The point of citing it isn't to claim parity. It's to confirm the category is real. @playwright/mcp going from 332K downloads/month a year ago to 13.4M/month today is proof enough that AI agents driving browsers is now a default workflow inside coding agents. Our re-accelerating curve sits inside that broader 40x-per-year category growth.

The bigger story is what happens inside the downloads. Downloads are a vanity number. Tool calls are the truth.

2. What 1.42 million agent tool calls actually look like

We tag every tool call coming through the MCP server with a mcp_client ID. Since 2025-11-05 the server has handled 1,420,000+ tool calls from 6,687 distinct IDs, touching 5,904 distinct domains.

The shape of those calls is where the surprises live.

Agents drive by sight, not by DOM

Tool call	Count
`get_screenshot`	643,424
`execute_code`	401,482
`init_browser`	264,268
`get_text_snapshot`	39,595
`get_interactive_snapshot`	31,486
`get_full_snapshot`	29,273

643K screenshots ÷ 264K browser inits = 2.4 screenshots per session. The three DOM-snapshot calls combined total ~100K, less than a sixth of the screenshot volume.

This is the most counterintuitive finding in the dataset.

The conventional architectural advice from the MCP-as-accessibility-tree camp says agents should consume structured page representations because they're smaller, faster, cheaper in tokens, and more deterministic. That's the right advice from a token-economics standpoint. It's not what's happening in production.

Agents prefer pixels. They look at the page, decide what to do, act, look again. Closer to a human QA tester clicking around on day one than to a Selenium engineer querying the DOM.

Two reads on why:

Visual is more robust to UI changes than CSS selectors. A screenshot doesn't care if the developer renamed the class. Find-the-button-by-pixel is more forgiving than find-the-button-by-data-test-id.
Frontier multimodal models got cheap enough. When the marginal cost of a vision token dropped, the architectural cost-benefit of "give the agent a DOM tree" flipped. Vision is now the default mode of comprehension.

If you're building AI testing tooling (including us) the implication is uncomfortable: the test runner's API surface needs to assume the agent will mostly use the camera. The DOM is a fallback.

Chart 2. Tool-call mix. Donut of get_screenshot (45%) vs execute_code (28%) vs init_browser (19%) vs all DOM-snapshot calls combined (~7%). Alt: Donut chart showing screenshot calls dominate playwright-mcp traffic over DOM-snapshot calls. Image prompt (Style C): "Minimalist donut chart, large indigo arc labeled get_screenshot 45%, cyan arc execute_code 28%, neutral gray init_browser 19%, dimmed slice DOM snapshots ~7%. Center label '2.4 screenshots per session'."

The cross-client ecosystem

Every tool call carries which MCP client is on the other end. We see 187 distinct clients. The top ten:

MCP client	Events	Distinct users
opencode	558,343	358
claude-code	355,654	980
mcp (generic)	176,539	189
(null, untagged early adopters)	139,096	3,416
codex-mcp-client	62,589	316
cursor-vscode	47,721	601
Visual Studio Code	10,505	126
claude-ai	6,732	104
mcporter	3,993	200
qwen-cli-mcp-client	3,056	37

Plus a long tail of Gemini CLI, Cline, Windsurf, Trae, Kiro, GitHub Copilot, Antigravity, LM Studio.

Two read-outs that matter:

Claude Code wins on install base. Opencode wins on session intensity. Claude Code has 2.7x the users (980 vs 358) but opencode pushes 1.6x the volume (558K vs 356K). Different shapes of work. Claude Code users tend to be developers using the MCP for verification inside a coding loop (short bursts of "did this work?"). Opencode users skew toward longer autonomous loops (fewer humans, more agent miles).

No single coding agent owns this channel. If you were building product around "we're the AI testing tool for Cursor" or "we're the AI testing tool for Claude Code," you'd be wrong. The agentic-testing distribution channel is the entire coding-agent ecosystem at once. That's an opportunity for an independent player and a constraint at the same time. You have to be neutral.

The "untagged" bucket is interesting on its own. 3,416 distinct IDs predate our client-tagging instrumentation. By raw headcount it's the largest single segment, meaning early adopters of playwright-mcp were ahead of the tagging standard. The category was bottoms-up; the tooling around the category caught up later.

Key takeaways

230,105 downloads sounds like adoption. 41% drop-off after 5 tool calls is the real story: activation is the unsolved problem.

Agents drive the browser by sight: 643K screenshots vs ~100K DOM snapshots. The accessibility-tree paradigm is losing to vision.

Localhost (127.0.0.1) is the single most-tested URL, ahead of every public domain. The dev loop, not staging, is the killer use case.

No single coding agent owns the channel. 187 distinct MCP clients drive the server. Neutrality is the distribution strategy.

3. The activation cliff nobody talks about

This is the section that will not flatter anyone in the AI testing category, us included.

The distribution is brutally power-law

Metric	Value
Mean tool calls per user	211.8
Median tool calls per user	8
90th percentile	151
99th percentile	1,627
Max single ID	520,076

The mean is 26x the median. That's the signature of a distribution where the average is a lie.

The top 1% of users (67 IDs) drove 1,030,031 events. That's 73% of all traffic. The top three IDs alone drove 57%. The median user ran 8 tool calls and 3 browser sessions, then disappeared.

Chart 3. Activation cliff. Histogram of users by tool-call count, log scale on x. Modal bucket at ≤5 events (2,752 users). Long thin tail to 520K. Alt: Histogram of tool calls per user showing 41% drop-off after 5 events and a long power-law tail. Image prompt (Style C): "Minimalist histogram, log-x axis, modal indigo bar at left labeled '41% drop off ≤5 events', long thin cyan tail trailing right to '520K events / top user'. Editorial."

What the cohorts look like

Cohort (of 6,687 users)	Users	Share
Tried exactly once	876	13%
≤ 5 events total	2,752	41%
≥ 100 events	895	13%
≥ 1,000 events	118	1.8%

The 41% drop-off is the number to remember. Forty-one percent of every developer who installed playwright-mcp ran five or fewer tool calls and never came back.

That's not the failure mode you read about in MCP launch posts. Launch posts measure stars and downloads. The honest measure is: did the developer get to a successful first test in their first session, and did they come back the next day?

For most of the 41%, the answer is no.

We have hypotheses about why:

The first browser session takes too long. init_browser failing on a developer's first try (corporate proxy, missing system dep, wrong Node version, headless config) eats their patience. They never see the wow moment.
The agent runs out of context before it finds anything useful. A get_screenshot of a heavy SPA is 1–2k tokens. Three of those in a row blows past the context cap of cheaper models. The agent loop fails silently.
The killer use case isn't obvious. Developer installs, reads README, tries one thing, doesn't see the connect-to-my-real-test-suite path. Closes terminal.

The deeper read: the AI testing category in 2026 is one where adoption is essentially free (everyone runs npm install) and activation is essentially everything. The vendor that figures out how to convert the median 8-call developer into a 100+ call developer wins the category. That's not been figured out yet, by us or anyone else we can see in the data.

The corollary: if you're shopping AI testing tools and you can't get a successful test running in your first session, you'll be the 41%. Insist on a working first-session story before you put it in CI. (Same buyer-side reasoning we outlined in how to evaluate AI testing tools.)

4. Where agents actually point the browser

Agent traffic emits a pageUrl field on execute_code calls (we get it for ~81% of the data). 1,674 distinct users contributed URL data. What they automate is illuminating.

Localhost is the #1 site

The single biggest URL bucket is 127.0.0.1: 272 users, 18,762 events.

That's 16% of all URL-emitting users testing against their own local dev server. More than Google, more than GitHub, more than every Chinese platform combined.

This is the dev-loop story. The architectural assumption that AI testing happens against staging or production (the place where the test cases live, far from the developer's laptop) is wrong by the numbers. Real agents test against the localhost build of the change the developer made an hour ago.

The implication for the build of an agentic testing product: localhost as a first-class target matters more than "we have a hosted cloud grid." The dev loop is the killer use case. CI/CD orchestration is downstream.

The top public domains break down by intent

Bucket	Approx users	Example domains
Search engines	~331	google (150), baidu (93), bing (32), duckduckgo (24)
Auth / identity	~158	accounts.google (64), login.microsoftonline (49), Atlassian, Intuit, PingOne
Dev tools	~133	github (75), figma (18), vercel (8), supabase (6), notion (6), chatgpt (10), claude.ai (6)
Social / content	~120	linkedin (26), youtube (22), instagram (16), x (15), facebook (13), reddit (10)
China-specific platforms	~115	baidu, xiaohongshu, bilibili, weixin, zhihu, douyin, feishu, dingtalk
Test / sandbox	~83	example.com (71), saucedemo (5)
Government / jobs	~34	governmentjobs (10), workday (7), Epoint procurement (5)
E-commerce	~22	amazon (8), checkout.stripe (6), alibaba (7)

A few reads:

Search and auth dominate the public web traffic. Agents Google things to find pages, then get past login walls. The killer near-term capability isn't "the AI tested an exotic checkout flow", it's "the AI got past your Single Sign-On so it could test the thing behind the login."

GitHub is the #2 public domain. 75 users running automation against github.com (opening PRs, scanning diffs, navigating repos). Dev-loop automation again, but on the SaaS side: agents automating the developer's own GitHub workflow.

The China cluster is real but not measurable as geography. GeoIP is disabled fleet-wide; we cannot publish a country breakdown. What we can see is that ~13.4% of URL-emitting users automate Chinese platforms (baidu, xiaohongshu, bilibili, weixin, zhihu, douyin, feishu, dingtalk, csdn, epoint). Read it as a domain-content signal, not a geo signal: these users are testing Chinese-language web flows.

example.com is 71 users. This is the "I'm kicking the tires on this MCP server, let me see if it does anything" cohort. It overlaps heavily with the 41% drop-off bucket. Test-on-example-dot-com is a leading indicator that the developer never crossed the activation cliff.

The takeaway for buyers: if your AI testing tool can't handle Single Sign-On and can't run against localhost in one shell command, it can't handle the two highest-volume use cases in the actual data.

For more on what people then build on top of this raw access, Playwright vs QAby.AI and Playwright pricing comparison walk through the framework-vs-agent fork.

5. What "agentic testing in CI/CD" actually means in 2026

Pulling the data together: what does it tell us about how AI agents test in pipelines today?

Five positions, each forced by the data

Position 1: The CI/CD agent is mostly a verification loop, not a regression suite. The 2.4-screenshots-per-session signature looks nothing like a 1,000-step regression run. It looks like "open the page, did the change work, take a screenshot, decide." That's the developer's git push loop, not nightly regression. The vendor pitch of "AI replaces your full regression suite" is downstream of a more honest reality: AI currently augments the inner-loop verification step that used to be a manual click-through.

Position 2: The accessibility-tree paradigm is losing to the screenshot paradigm. If you're betting product on "we serve clean DOM trees to agents because they're token-efficient," the user data doesn't agree. Multimodal vision tokens got cheap enough that agents prefer to look at the page. Build for that.

Position 3: Localhost-first beats cloud-grid-first. The single most-tested URL is 127.0.0.1. Cloud test grids are a real category, but the modal use case in MCP-driven testing is the developer's own dev server. Build the localhost path before the cloud path.

Position 4: The cross-client ecosystem is the distribution channel. Claude Code, opencode, Cursor, Codex, Cline, Windsurf, Gemini CLI, Trae, Antigravity: 187 clients in our data. The product that wins distribution wins by being neutral across coding agents, not by partnering with one. If you're building, treat your MCP server as the API and the coding agent as the IDE.

Position 5: Activation is the constraint, not adoption. This is the one that matters most to anyone building or buying. Downloads are free. Stars are free. The constraint that determines whether your AI testing investment pays off is whether the developer crosses the cliff into the 13% cohort that runs ≥100 tool calls. Everything else (pricing, marketing, sales) is downstream of that single conversion rate.

How this lines up with broader mid-market data

The activation cliff in MCP usage mirrors a pattern from our customer interviews: AI testing is a vitamin until the team's pain crosses a threshold. See the State of AI QA 2026 report for the full mid-market QA context behind n=41 conversations, including the N-3 Lag and the Locator Tax that explain why most teams should be in the 13% cohort but aren't.

The vendor problem is the same as the buyer problem: the gap between "this could change how we ship" and "this changed how we ship today" is wide. Closing it is the work.

6. What this means if you're the buyer

Devs ship faster than QA tests. We close the gap.

If you're evaluating an AI testing tool (ours or anyone's) the question isn't "does it use an MCP server" or "does it support Claude Code." Those are checkbox features the entire category now has. The question is the activation question:

Can your developer get a successful test running in the first 30 minutes?
Does that test run against 127.0.0.1 without a cloud-grid detour?
Does it survive an SSO login on the second flow?
Does the run come back with a screenshot diff a human can act on?
Does it work the next morning, against the new build, without a re-prompt?

Those five questions explain why 41% of MCP users drop off after five events, and they're the same five questions that determine whether you join the 13% cohort that gets value out of it.

The outcome we promise: release confidence at engineering velocity, without hiring SDETs. The pitch is only honest if the tool clears the activation bar above.

If your team's pain is "we can't keep up with the dev velocity" (the N-3 Lag pattern we documented across 41 SaaS conversations) and you'd rather skip the SDET hire than ship slower, an audit of your current QA gap against these patterns is the first call. We'll show you which numbers above match your team, where the biggest leak is, and what changes if AI agents close it.

Run My Audit →

7. Caveats and what we can't tell you

A few honest disclosures.

distinct_id is a session identity, not a verified human. Heavy IDs are almost certainly automated agent loops, not real developers. Treat user counts as directional.
pageUrl is only emitted on execute_code (about 81% coverage). The 19% gap means top-domain counts are floors, not ceilings.
June 2026 is partial-month data. The Mar 2026 685K spike in monthly tool calls was driven by a handful of heavy IDs running autonomous loops, not by a 5x user surge.
GeoIP is disabled fleet-wide. Only 228 of 1.42M events were enriched. There is no country breakdown to publish.
MCP events carry no pass/fail field. We can't tell you the reliability story from MCP usage. Closing that instrumentation gap is on the roadmap.
Public reliability data caps at 90 days. The longer historical comparison would help; we don't have it yet.
Microsoft @playwright/mcp and our playwright-mcp are not the same product. Cross-comparison is on the category, not feature parity.

We're publishing this anyway because the directional signal (activation cliff, screenshot-over-DOM, localhost-first, cross-client distribution) holds across every cut of the data we ran. Tighter measurement is on the way.

About this post

Author: Himanshu Saleria, Co-founder & CEO, QAby.AI. Background in QA-led product engineering at scale; running QAby.AI's customer research, telemetry analysis, and product. LinkedIn.

Published 2026-06-12 · Last updated 2026-06-12 · 18-minute read

Cross-reads from this dataset

State of AI QA 2026: full mid-market QA report behind these numbers
The SDET You Don't Have to Hire Next Quarter: cost math vs the SDET hire
Playwright vs QAby.AI: framework-code vs agent-led-regression fork
Playwright alternative 2026: landscape view
Playwright pricing comparison: maintenance-tax math
How to evaluate AI testing tools: buyer-side checklist

External cross-validation:

playwright-mcp on npm-stat: the download curve we cite
Microsoft @playwright/mcp on npm: cross-comparison for category sizing
Model Context Protocol spec: the protocol both servers implement

Frequently asked questions

What is Playwright MCP and what is it used for?

Playwright MCP is a Model Context Protocol server that gives AI coding agents (Claude Code, opencode, Cursor, Codex, Cline, and 180+ other clients) a way to drive a real browser as part of their tool loop. In the wild, agents use it mostly for inner-loop verification (did my change work?) and for getting past login walls so they can test the thing behind authentication.

Why did 41% of Playwright MCP users drop off after 5 tool calls?

The data shows 2,752 of 6,687 users ran five or fewer events total. The likely causes: first-session init_browser failures, agent context blowouts on heavy screenshots, and the killer use case not being obvious from a README. Activation, not adoption, is the unsolved problem in agentic browser testing in 2026.

Do AI agents prefer screenshots or DOM snapshots for browser testing?

Screenshots, decisively. In 1.42M tool calls we logged, agents fired get_screenshot 643K times (about 2.4 per browser session). All three DOM-snapshot calls combined totaled ~100K, less than a sixth of screenshot volume. Multimodal vision tokens got cheap enough that agents prefer to look at the page rather than parse the DOM tree.

Which MCP client is most popular for browser testing: Claude Code or opencode?

Different shapes. Claude Code leads on install base (980 distinct users vs opencode's 358). Opencode leads on volume (558K events vs Claude Code's 356K). Claude Code users skew to short verification bursts; opencode users skew to longer autonomous loops. No single client dominates; we count 187 distinct MCP clients driving the server.

What websites do developers test most with Playwright MCP?

Localhost (127.0.0.1) is the #1 target (272 users, 18,762 events). After that: search engines (~331 users on google/baidu/bing), auth and SSO walls (~158 users on Google/Microsoft/Atlassian logins), GitHub (75 users), and a long tail of dev tools, social platforms, and Chinese-language platforms. The dev loop, not staging, is the modal use case.

How does Playwright MCP fit into a CI/CD pipeline?

In practice, today, it's mostly used as a per-commit verification loop inside a coding agent's session, not as a nightly regression suite. The 2.4-screenshots-per-session signature looks like "did this change work?" (a developer's git push loop) not a 1,000-step regression run. CI/CD integration via tools that wrap the agent loop in a pipeline runner is the emerging pattern.

What's the difference between playwright-mcp and Microsoft's @playwright/mcp?

Different packages, different vendors, same protocol. Microsoft's @playwright/mcp shipped 60.4M npm downloads in the last 12 months (a near-vertical adoption curve riding inside official Playwright distribution). Our playwright-mcp shipped 230,105 in the same window. The bigger story is the category itself: agentic browser automation grew ~40x year over year, validating the workflow for both packages.