Dataset reference, verified 2026-06-21

WebBench is 5,750 tasks across 452 websites. None of them is a desktop app.

If you searched for the WebBench GitHub dataset and the 5,750-task number, here are the exact figures, the repo, and the leaderboard. Then the part no other write-up mentions: what an agent benchmark built entirely from top-traffic websites cannot tell you about the systems most enterprise work actually lives in.

Direct answer

WebBench is 5,750 tasks across 452 live websites. Of those, 2,454 tasks are open-sourced in the public repository at github.com/Halluminate/WebBench under the MIT license. The websites are pulled from the global top-1000 by traffic, so every task is a browser task.

Source: Halluminate/WebBench README and the leaderboard at webbench.ai. Re-checked 21 June 2026.

What is actually in the 2,454 open-source tasks

The full 5,750 are split into a public set and a private set. The private split is held back so agents cannot be trained directly on the answers. The 2,454 published tasks ship as two CSV files (webbenchfinal.csv and webbench_hitl_final.csv) and a results folder. The README labels them as roughly 2.5k READ and ACTION based tasks. Here is the exact composition.

Task typeWhat it testsCountShare
READNavigation and data extraction1,58064.4%
CREATEForm filling, posting, new records51220.9%
UPDATEEditing existing records1737.1%
DELETERemoving records1496.1%
FILE_MANIPULATIONUploads and downloads401.5%
Total open-source tasks2,454100%

READ alone is 1,580 tasks, almost two thirds of the public set. The four write categories (CREATE, UPDATE, DELETE, FILE_MANIPULATION) add up to 874 tasks, 35.6%.

The leaderboard, in one glance

WebBench scores computer-use agents on completing the tasks end-to-end in a real browser. These are the public standings at webbench.ai.

RankAgentScore
#1Anthropic Sonnet 3.7 CUA66.0%
#2Skyvern 2.064.4%
#3Skyvern 2.0 on Browserbase60.7%
#4OpenAI CUA59.8%
#5Browser Use Cloud43.9%

Even the leader clears two thirds of tasks. The harder write actions, logins, and multi-step flows are where the scores drop, which tells you these tasks are not trivial even on the open web.

The one thing 452 websites cannot measure

Look again at how the sites were chosen: the global top-1000 by traffic. That is a deliberate, sensible call for a browser-agent benchmark, and it is also the boundary of what the score means. Every single one of the 452 sites has a URL and a DOM. The agent can read the page source, query elements, and act on them.

Now picture where a regional bank, an insurance carrier, or a hospital actually keeps its work. It is not on a top-1000 website. It is in SAP GUI, in Oracle EBS, in a Jack Henry or Fiserv or FIS green-screen, in Epic or Cerner desktop clients, on a mainframe terminal. None of those has a URL. None of those has a DOM. A browser agent that scores 66% on WebBench has zero handle on any of them, because there is nothing in the browser to grab.

This is not a knock on WebBench. It is a precise statement of scope. If your automation problem is on the open web, the leaderboard is a real signal. If your automation problem is on a no-API desktop system, the benchmark is silent, and a high score is not evidence of anything.

Two different problems, two different surfaces

5,750 tasks on 452 browser-reachable websites from the global top-1000 by traffic. The agent works against a URL and a DOM.

  • Top-traffic public web apps
  • Pages have a URL and queryable elements
  • READ, CREATE, UPDATE, DELETE, file transfer in a browser
  • Best public agent clears 66.0%

How you measure (and automate) the layer WebBench skips

If there is no DOM, you need a different source of truth about the screen. That source already exists on every Windows machine: the accessibility tree, the same interface a screen reader uses to read out a SAP field or a green-screen cell. Mediar reads what an app exposes there, then executes through those same APIs. No pixel matching, no brittle selectors, so it does not break when a label moves or a layout shifts.

Mediar drives a no-API desktop app through the accessibility tree

SAP GUI
Mainframe terminal
Jack Henry
Epic / Cerner
OS accessibility tree
Read field values
Click and type entries
Validate against rules
Self-heal on UI change

The same approach is open source: the Terminator SDK on GitHub lets teams script accessibility-API automation against the desktop apps a browser benchmark never touches.

70%

An F&B chain moved off UiPath to Mediar and the CFO told the board they are now saving 70% on costs.

Mediar customer deployment, SAP Business One

What the desktop layer is worth when you actually automate it

0%cost cut vs UiPath at an F&B SAP B1 chain
$0K/yrsaved on insurance claims intake
0→2 minper claim, before and after
$0K/yrsaved on healthcare patient intake

Numbers from named Mediar deployments. Claims intake went from 30 minutes to 2 minutes per claim; bank onboarding went from 8 weeks to 2 weeks. None of these workflows would appear on a browser benchmark.

How to use the WebBench number well

When you read that an agent scores 66.0% on 5,750 tasks, ask one question first: are my workflows in the browser, or on the desktop? For open-web automation, treat the leaderboard as a fair comparison. For SAP, banking cores, EHRs, or mainframes, the score is the wrong tool, and you should be evaluating accessibility-API automation instead.

Browser-based AI agents are genuinely good at new SaaS. If your data lives in SAP GUI or a Jack Henry green-screen, they will not help. That gap is the whole reason the accessibility-API approach exists.

Your workflows are on the desktop, not the leaderboard

Book a call and we will scope one no-API desktop workflow (SAP, banking core, or EHR) and show what accessibility-API automation does with it.

Frequently asked

Frequently asked questions

How many tasks are in the WebBench dataset?

WebBench is 5,750 tasks across 452 live websites. Of those, 2,454 tasks are open-sourced in the public GitHub repository at github.com/Halluminate/WebBench. The remaining tasks are held back as a private split so the benchmark cannot be trained against directly.

Where is the WebBench GitHub dataset and what license is it under?

The open data lives at github.com/Halluminate/WebBench, released by Halluminate under the MIT license. The repo ships two CSV files (webbenchfinal.csv and webbench_hitl_final.csv) plus a results folder. The README describes it as roughly 2.5k READ and ACTION based tasks.

What is the READ vs WRITE split in the open-source set?

Inside the 2,454 published tasks, the breakdown is READ 1,580 (64.4%), CREATE 512 (20.9%), UPDATE 173 (7.1%), DELETE 149 (6.1%), and FILE_MANIPULATION 40 (1.5%). READ is navigation and data extraction; the rest are write actions like form filling, editing, deleting, and file transfer.

Which agent currently leads the WebBench leaderboard?

On the public leaderboard at webbench.ai, Anthropic's Sonnet 3.7 CUA leads at 66.0%, followed by Skyvern 2.0 at 64.4%, Skyvern 2.0 on Browserbase at 60.7%, OpenAI CUA at 59.8%, and Browser Use Cloud at 43.9%.

Does WebBench cover desktop applications like SAP GUI or mainframes?

No. All 452 sites are drawn from the global top-1000 websites by traffic, so every task is a browser task with a URL and a DOM. SAP GUI, Oracle EBS green-screens, Jack Henry and Fiserv banking cores, Epic and Cerner desktop clients, and mainframe terminals are not browser sites, so they are out of scope by construction. A high WebBench score says nothing about whether an agent can drive those systems.

How does Mediar relate to WebBench?

Mediar does not compete on WebBench, because Mediar automates the desktop layer the benchmark cannot reach. Instead of a DOM, Mediar reads what apps expose through OS-level accessibility APIs (the same interfaces screen readers use), then executes via those APIs with no pixel matching and no selectors. That is why it works on SAP GUI and banking green-screens where browser agents have nothing to grab.

M
Matthew Diakonov
6 min read

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.