Browser agents vs the desktop layer
The 5,750 tasks / 452 websites benchmark, and the layer it can't measure
If you searched for a benchmark with 5,750 tasks across 452 websites, you are looking for Web Bench. It is the most honest map of browser-agent ability published so far. It is also drawn entirely from the public web, which means its edges trace the exact line where browser agents stop and desktop automation has to take over.
Direct answer · verified 2026-06-19
The 5,750-task, 452-website benchmark is Web Bench, an evaluation dataset for AI browser agents built by Skyvern and Halluminate and published on May 29, 2025. It holds 5,750 tasks across 452 websites, of which 2,454 tasks are open-sourced. It expands the older WebVoyager benchmark, which had 643 tasks across 15 websites. At launch the strongest performer was Anthropic's Sonnet 3.7 computer-use agent.
Source: Skyvern's Web Bench announcement. Dataset and code: Halluminate/WebBench on GitHub.
What the two numbers actually encode
The headline figures are a deliberate jump in scope. WebVoyager, the benchmark Web Bench replaced, ran 643 tasks against 15 hand-picked sites. That was enough to rank agents, but narrow enough that an agent could be tuned to those 15 sites and look better than it was. Web Bench went the other direction: 452 websites sampled from the top 1,000 sites globally by traffic, spread across roughly 17 categories, with 5,750 tasks layered on top.
The tasks themselves split into two families. READ tasks ask the agent to navigate to a piece of information and pull it out. WRITE tasks ask it to change state: fill a form, log in, solve a 2FA challenge, download a file, create or update or delete a record. Across agents, READ scores sit well above WRITE scores, which is the single most useful thing the benchmark tells you. Reading the web is close to solved. Acting on it, reliably, is not.
That gap is the reason a number this large still leaves room for a product like Mediar. But the more important detail is hiding in the word "websites."
The 452 sites were sampled from public web traffic. That is the whole story.
Read the methodology and one line decides everything: the 452 sites were drawn from the top 1,000 websites globally by traffic. To enter that pool a system needs a URL and a measurable amount of public traffic. That is a perfectly reasonable way to build a browser-agent benchmark. It is also a filter that removes, by construction, every system most enterprise automation work is actually stuck on.
A SAP Business One window has no URL. A Jack Henry core-banking green-screen has no public traffic rank. An Epic chart, an Oracle EBS form, a mainframe terminal: none of them can appear in a ranking of websites because none of them are websites. So the desktop line-of-business surface is not under-represented in Web Bench. It is absent. It was never eligible.
This is why Mediar does not drive a DOM. It reads what an application exposes through the operating system's accessibility APIs, the same interfaces a screen reader uses. A window that has no web address still has an accessibility tree, and that tree is what the agent locates targets in. Here is the difference made concrete, using the open-source Terminator engine inspecting a SAP window:
There is no headless endpoint and no page to render. The targets are accessibility nodes, not DOM elements. A browser-agent benchmark has no way to express this surface, which is precisely why a high score on one says nothing about it.
“A benchmark of 452 websites is a map of the public web. The systems enterprise RPA stalls on are not on that map, because they are not websites.”
The boundary in one sentence
What a web benchmark covers vs where the desktop work lives
This is not a knock on Web Bench. It measures what it set out to measure, and it does it at real scale. The point is to read the score for what it is, then look at the column it cannot reach.
| Feature | Web Bench (452 public websites) | Mediar (desktop accessibility layer) |
|---|---|---|
| What it can navigate | 452 public websites with a URL and a rendered DOM | Any Windows app that exposes an accessibility tree, URL or not |
| How a target is located | Vision and DOM elements on a rendered web page | OS-level accessibility nodes (the interface screen readers use) |
| SAP GUI, Oracle EBS, mainframe terminals | Not in the dataset (no URL, no traffic rank) | Primary use case, runs against the desktop window directly |
| Jack Henry / Fiserv / FIS core-banking screens | Not in the dataset | Supported on the desktop layer with audit logs |
| Epic / Cerner / eClinicalWorks | Not in the dataset | Supported, HIPAA-compliant deployment |
| What a strong score proves | The agent drives the public web well | The agent completes the desktop workflow you watched it learn |
If your workflows live entirely in the browser, a browser agent is the right tool and Web Bench is the right scoreboard. The split only matters when the work sits in desktop apps.
None of these could appear in the 452
Every system below is a place real teams run hundreds of repetitive workflows a week. Not one of them has a URL or a public traffic rank, so not one of them is eligible for a benchmark built from web traffic. This is the surface Mediar was built for.
The honest counterargument: isn't everything moving to the browser?
Yes, new software ships as SaaS, and for those workflows a browser agent measured by Web Bench is genuinely the better fit. We say this plainly: if your data lives in a modern web app, you do not need an accessibility-API approach, and a browser agent will serve you well.
But the systems of record in banking, insurance, healthcare, and manufacturing are not migrating off SAP, Oracle, Jack Henry, or Epic this decade. Those platforms were installed over twenty years and carry regulatory and integration weight that makes replacement a multi-year program, not a quarter. The desktop layer is not a shrinking legacy footnote. It is where the unautomated work has collected precisely because browser tools cannot reach it. A benchmark that grows from 15 to 452 websites is real progress on one axis and zero progress on that one.
How to read the benchmark before you buy
Use Web Bench for what it is good for. A high score is solid evidence an agent can navigate and act on public websites, and the READ-versus-WRITE gap is an honest signal about how far reliable web acting still has to go. When a vendor cites it, that is a fair claim about browser work.
Then ask the question the benchmark cannot answer: where is my stuck work actually running? If the answer is a SAP transaction, a green-screen, an EHR chart, or a mainframe form, the right test is not a website score. It is whether the agent can watch that desktop workflow once and then execute it through the accessibility tree, self-healing when a label or layout shifts, with audit logs the compliance team accepts. That is the test Mediar is built to pass, and it is the one a 452-website benchmark was never designed to run. The engine behind it is open source at github.com/mediar-ai/terminator.
Bring the workflow a website benchmark can't score
Show us the SAP, banking, or EHR screen your current tools can't reach. We'll tell you on the call whether the accessibility-API approach handles it.
Frequently asked
Frequently asked questions
What is the 5,750 tasks 452 websites benchmark called?
It is Web Bench, an evaluation dataset for AI browser agents built by Skyvern and Halluminate and published on May 29, 2025. It contains 5,750 tasks spread across 452 different websites, of which 2,454 tasks are open-sourced.
How is Web Bench different from WebVoyager?
WebVoyager, the prior standard, had 643 tasks across 15 websites. Web Bench widens that to 5,750 tasks across 452 websites sampled from the top 1,000 sites globally by traffic, across roughly 17 categories. The point was breadth: testing agents on many real sites instead of a handful.
What kinds of tasks does Web Bench include?
Tasks split into a READ family (navigate to information and extract it) and a WRITE family (enter data, fill forms, log in, solve 2FA, download files, and create, update, or delete records). READ tasks tend to score higher than WRITE tasks across agents.
Which agent is state of the art on Web Bench?
At publication, Skyvern reported Anthropic's Sonnet 3.7 computer-use agent (CUA) as the strongest performer, especially on read-heavy tasks. The live leaderboard is hosted at eval.skyvern.com.
Does Web Bench measure desktop or legacy enterprise systems?
No. By construction it samples public websites with URLs and a traffic rank. SAP GUI, Oracle EBS, Jack Henry and Fiserv core-banking screens, Epic and Cerner, and mainframe terminals are desktop applications with no URL and no public traffic ranking, so they cannot appear in the dataset at all.
Why does that matter for choosing automation software?
A high Web Bench score tells you how well an agent drives the public web. It says nothing about whether it can drive a SAP transaction or a green-screen, because those were never in scope. If your stuck workflows live in desktop apps, you need automation that reads the OS accessibility tree, not the DOM.
Keep reading
Skyvern and the RPA browser boundary
Where an AI browser agent stops and the desktop accessibility layer begins.
Legacy desktop apps with no API: the moat
Why no-URL, no-API desktop systems are exactly where browser agents do not help.
Power Automate Desktop and the SAP GUI limits
What breaks when you point a record-and-replay tool at SAP screens.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.