Browser agents vs the desktop layer

The 5,750 tasks / 452 websites benchmark, and the layer it can't measure

If you searched for a benchmark with 5,750 tasks across 452 websites, you are looking for Web Bench. It is the most honest map of browser-agent ability published so far. It is also drawn entirely from the public web, which means its edges trace the exact line where browser agents stop and desktop automation has to take over.

Matthew Diakonov, Written with AI

Published June 19, 20267 min read

Direct answer · verified 2026-06-19

The 5,750-task, 452-website benchmark is Web Bench, an evaluation dataset for AI browser agents built by Skyvern and Halluminate and published on May 29, 2025. It holds 5,750 tasks across 452 websites, of which 2,454 tasks are open-sourced. It expands the older WebVoyager benchmark, which had 643 tasks across 15 websites. At launch the strongest performer was Anthropic's Sonnet 3.7 computer-use agent.

Source: Skyvern's Web Bench announcement. Dataset and code: Halluminate/WebBench on GitHub.

What the two numbers actually encode

The headline figures are a deliberate jump in scope. WebVoyager, the benchmark Web Bench replaced, ran 643 tasks against 15 hand-picked sites. That was enough to rank agents, but narrow enough that an agent could be tuned to those 15 sites and look better than it was. Web Bench went the other direction: 452 websites sampled from the top 1,000 sites globally by traffic, spread across roughly 17 categories, with 5,750 tasks layered on top.

The tasks themselves split into two families. READ tasks ask the agent to navigate to a piece of information and pull it out. WRITE tasks ask it to change state: fill a form, log in, solve a 2FA challenge, download a file, create or update or delete a record. Across agents, READ scores sit well above WRITE scores, which is the single most useful thing the benchmark tells you. Reading the web is close to solved. Acting on it, reliably, is not.

That gap is the reason a number this large still leaves room for a product like Mediar. But the more important detail is hiding in the word "websites."

The 452 sites were sampled from public web traffic. That is the whole story.

Read the methodology and one line decides everything: the 452 sites were drawn from the top 1,000 websites globally by traffic. To enter that pool a system needs a URL and a measurable amount of public traffic. That is a perfectly reasonable way to build a browser-agent benchmark. It is also a filter that removes, by construction, every system most enterprise automation work is actually stuck on.

A SAP Business One window has no URL. A Jack Henry core-banking green-screen has no public traffic rank. An Epic chart, an Oracle EBS form, a mainframe terminal: none of them can appear in a ranking of websites because none of them are websites. So the desktop line-of-business surface is not under-represented in Web Bench. It is absent. It was never eligible.

This is why Mediar does not drive a DOM. It reads what an application exposes through the operating system's accessibility APIs, the same interfaces a screen reader uses. A window that has no web address still has an accessibility tree, and that tree is what the agent locates targets in. Here is the difference made concrete, using the open-source Terminator engine inspecting a SAP window:

terminator — inspecting a no-URL desktop app

There is no headless endpoint and no page to render. The targets are accessibility nodes, not DOM elements. A browser-agent benchmark has no way to express this surface, which is precisely why a high score on one says nothing about it.

452 sites / 0 desktop LOB apps

“A benchmark of 452 websites is a map of the public web. The systems enterprise RPA stalls on are not on that map, because they are not websites.”

The boundary in one sentence

What a web benchmark covers vs where the desktop work lives

This is not a knock on Web Bench. It measures what it set out to measure, and it does it at real scale. The point is to read the score for what it is, then look at the column it cannot reach.

Feature	Web Bench (452 public websites)	Mediar (desktop accessibility layer)
What it can navigate	452 public websites with a URL and a rendered DOM	Any Windows app that exposes an accessibility tree, URL or not
How a target is located	Vision and DOM elements on a rendered web page	OS-level accessibility nodes (the interface screen readers use)
SAP GUI, Oracle EBS, mainframe terminals	Not in the dataset (no URL, no traffic rank)	Primary use case, runs against the desktop window directly
Jack Henry / Fiserv / FIS core-banking screens	Not in the dataset	Supported on the desktop layer with audit logs
Epic / Cerner / eClinicalWorks	Not in the dataset	Supported, HIPAA-compliant deployment
What a strong score proves	The agent drives the public web well	The agent completes the desktop workflow you watched it learn

If your workflows live entirely in the browser, a browser agent is the right tool and Web Bench is the right scoreboard. The split only matters when the work sits in desktop apps.

None of these could appear in the 452

Every system below is a place real teams run hundreds of repetitive workflows a week. Not one of them has a URL or a public traffic rank, so not one of them is eligible for a benchmark built from web traffic. This is the surface Mediar was built for.

SAP GUISAP Business OneOracle EBSJack HenryFiservFISEpicCernereClinicalWorksMainframe terminalsAS/400 green-screensLegacy Win32 line-of-business apps

The honest counterargument: isn't everything moving to the browser?

Yes, new software ships as SaaS, and for those workflows a browser agent measured by Web Bench is genuinely the better fit. We say this plainly: if your data lives in a modern web app, you do not need an accessibility-API approach, and a browser agent will serve you well.

But the systems of record in banking, insurance, healthcare, and manufacturing are not migrating off SAP, Oracle, Jack Henry, or Epic this decade. Those platforms were installed over twenty years and carry regulatory and integration weight that makes replacement a multi-year program, not a quarter. The desktop layer is not a shrinking legacy footnote. It is where the unautomated work has collected precisely because browser tools cannot reach it. A benchmark that grows from 15 to 452 websites is real progress on one axis and zero progress on that one.

How to read the benchmark before you buy

Use Web Bench for what it is good for. A high score is solid evidence an agent can navigate and act on public websites, and the READ-versus-WRITE gap is an honest signal about how far reliable web acting still has to go. When a vendor cites it, that is a fair claim about browser work.

Then ask the question the benchmark cannot answer: where is my stuck work actually running? If the answer is a SAP transaction, a green-screen, an EHR chart, or a mainframe form, the right test is not a website score. It is whether the agent can watch that desktop workflow once and then execute it through the accessibility tree, self-healing when a label or layout shifts, with audit logs the compliance team accepts. That is the test Mediar is built to pass, and it is the one a 452-website benchmark was never designed to run. The engine behind it is open source at github.com/mediar-ai/terminator.

Bring the workflow a website benchmark can't score

Show us the SAP, banking, or EHR screen your current tools can't reach. We'll tell you on the call whether the accessibility-API approach handles it.

Frequently asked

Frequently asked questions

What is the 5,750 tasks 452 websites benchmark called?

It is Web Bench, an evaluation dataset for AI browser agents built by Skyvern and Halluminate and published on May 29, 2025. It contains 5,750 tasks spread across 452 different websites, of which 2,454 tasks are open-sourced.

How is Web Bench different from WebVoyager?

WebVoyager, the prior standard, had 643 tasks across 15 websites. Web Bench widens that to 5,750 tasks across 452 websites sampled from the top 1,000 sites globally by traffic, across roughly 17 categories. The point was breadth: testing agents on many real sites instead of a handful.

What kinds of tasks does Web Bench include?

Tasks split into a READ family (navigate to information and extract it) and a WRITE family (enter data, fill forms, log in, solve 2FA, download files, and create, update, or delete records). READ tasks tend to score higher than WRITE tasks across agents.

Which agent is state of the art on Web Bench?

At publication, Skyvern reported Anthropic's Sonnet 3.7 computer-use agent (CUA) as the strongest performer, especially on read-heavy tasks. The live leaderboard is hosted at eval.skyvern.com.

Does Web Bench measure desktop or legacy enterprise systems?

No. By construction it samples public websites with URLs and a traffic rank. SAP GUI, Oracle EBS, Jack Henry and Fiserv core-banking screens, Epic and Cerner, and mainframe terminals are desktop applications with no URL and no public traffic ranking, so they cannot appear in the dataset at all.

Why does that matter for choosing automation software?

A high Web Bench score tells you how well an agent drives the public web. It says nothing about whether it can drive a SAP transaction or a green-screen, because those were never in scope. If your stuck workflows live in desktop apps, you need automation that reads the OS accessibility tree, not the DOM.