Benchmark explainer

WebBench: 2,454 tasks, 452 websites, and the layer the benchmark cannot score

WebBench is the most honest public look we have at where browser agents actually stand. It also quietly draws the boundary of what a browser agent can never reach. Both facts live in the same dataset.

Matthew Diakonov, Written with AI

Published June 20, 20267 min read

Direct answer (verified 2026-06-20)

WebBench is Halluminate’s open, MIT-licensed benchmark for browser agents: 2,454 tasks across 452 live websites sampled from the global top-1000 by traffic. The code and dataset live at github.com/Halluminate/WebBench. In the published results, agents clear 70%+ on read tasks but only 46.6% on write tasks, and the best fully automated agent finishes 66.0% of all tasks.

Note: you will see ~5,750 quoted in the same repo. That is the raw candidate pool. 2,454 is the runnable set after infeasible tasks were pruned. Use 2,454 when you mean the scored benchmark.

0scored tasks

0live websites

0%are read tasks

0%write-task ceiling

What the 2,454 tasks actually contain

WebBench grew out of WebVoyager (15 sites, 642 tasks). Halluminate widened coverage to 452 sites across 17 traffic-ranked categories, generated roughly 5,750 candidate tasks, then trimmed to 2,454 after dropping tasks that broke when the live sites changed. Every task is tagged by what it asks the agent to do. The split is the whole story:

Task mix (2,454 total)

READ - 1,580 tasks (64.4%): extract information from a page
CREATE - 512 tasks (20.9%): make a new record or submission
UPDATE - 173 tasks (7.1%): modify existing data
DELETE - 149 tasks (6.1%): remove data
FILE_MANIPULATION - 40 tasks (1.5%): download or handle files

Read is nearly two-thirds of the set. The four action categories, collapsed together, are the other 35.6% (874 tasks). That ratio is not an accident of sampling: write tasks are simply harder to author, harder to verify, and far harder for an agent to complete. Which is the first thing the scores confirm.

Reads are basically solved. Writes are not.

On read tasks, 5 of the 7 agents Halluminate tested cleared 70%, and most landed above 75%. If your job is “find this fact on this page,” a 2026 browser agent does it. The headline 66.0% overall score for the best fully automated agent (Anthropic CUA) is carried mostly by that read-heavy majority.

46.6%

“Browser agents perform poorly on write-heavy tasks (e.g. logging in, filling out forms, downloading files), with SOTA agents only reaching 46.6% success rate.”

Halluminate, Web Bench: The Current State of Browser Agents

Read that again with an operations hat on. The tasks an enterprise actually pays people to do all day are write tasks: post the invoice, key the claim, update the customer record, pull the statement. WebBench puts a hard number on how well agents do those today, on the public web where they at least have a DOM and a URL to hold onto: fewer than half. Logging in, form-filling, and downloads are named explicitly as the weak spots.

The constraint nobody writes down: every target has a URL

Here is the part the other write-ups skip. WebBench samples from the global top-1000 websites by traffic. That selection rule decides, in advance, what the benchmark can ever say something about. It can score an agent on Amazon, on a SaaS dashboard, on a government portal. It cannot score an agent on the systems where a community bank, a regional carrier, or a hospital actually runs its back office, because those are not websites.

In WebBench (has a URL)

Top-1000 consumer and SaaS websites
E-commerce, travel, media, social portals
Anything with a DOM a browser agent can parse
452 live sites, 17 traffic-ranked categories

Not in WebBench (no URL, no DOM)

SAP GUI and Oracle EBS desktop clients
Mainframe green-screen terminals
Jack Henry, Fiserv, FIS banking cores
Epic, Cerner, eClinicalWorks desktop EHRs

This is not a knock on WebBench. It is a careful, useful benchmark for exactly what it claims to measure. The mistake is reading a browser-agent score and concluding something about enterprise automation in general. The hardest, highest-value back-office work sits on desktop apps with no API and no browser surface, and no public benchmark of websites will ever tell you how an agent does there.

Where the 46.6% problem and the no-URL problem both go away

A browser agent struggles with write tasks partly because it is reverse-engineering intent from a rendered DOM and, when that fails, from pixels. On a legacy desktop app there is no DOM to reverse-engineer at all. Mediar takes a different input entirely: the OS-level accessibility tree, the same structured interface a screen reader consumes. Controls have stable roles and names whether or not the app ever shipped a web front end.

sap-write-task.ts

That is the open-source Terminator SDK (github.com/mediar-ai/terminator, MIT, Windows-only) underneath. Because the agent locates a field by its accessibility role and name instead of a pixel position or a brittle selector, a write task on SAP GUI is the same shape of operation as a write task in any other Windows app: find the control, set the value. It also self-heals when a label or layout moves, because there is no coordinate or selector to break.

We have watched this play out in production, not in a benchmark. An F&B chain moved off UiPath to Mediar and their CFO told the board they are saving 70% on costs. At one mid-market insurance carrier, claims intake went from 30 minutes per claim to 2 minutes, which is about $750K a year on their own headcount math. Those are write tasks, on legacy desktop systems, the exact intersection WebBench is structurally unable to put a number on.

Have a write task stuck on a no-URL desktop app?

Book a working session. We will look at one real workflow on SAP, a banking core, or an EHR and tell you honestly whether the accessibility-API approach clears it.

WebBench, answered

What is WebBench and where is it on GitHub?

WebBench is an open, task-oriented benchmark built by Halluminate to measure how well browser agents handle realistic web workflows. It contains 2,454 tasks across 452 live websites sampled from the global top-1000 by traffic. The repository is at github.com/Halluminate/WebBench and the dataset is mirrored on Hugging Face under Halluminate/WebBench. It is released under the MIT license.

Why 2,454 tasks and not the larger number I have seen quoted?

Halluminate originally generated roughly 5,750 candidate tasks across the 452 sites. During evaluation they trimmed the set down to 2,454 by removing tasks that had become infeasible because the underlying websites changed. So 2,454 is the runnable, scored set; ~5,750 is the raw pool before pruning. The benchmark builds on the earlier WebVoyager work, which covered 15 sites and 642 tasks.

How do the tasks break down by type?

The 2,454 tasks split into READ 1,580 (64.4%) and ACTION 874 (35.6%). The action group is CREATE 512 (20.9%), UPDATE 173 (7.1%), DELETE 149 (6.1%), and FILE_MANIPULATION 40 (1.5%). Read means extracting information from a page; the action categories mean changing state: creating, editing, deleting records, or downloading files.

What scores did browser agents get on WebBench?

Read tasks are close to solved: 5 of the 7 agents tested cleared 70% on reads, and most reached above 75%. Write-heavy tasks are not solved. Per Halluminate, state-of-the-art agents only reach 46.6% on write tasks (logging in, filling forms, downloading files). The best fully automated agent overall, Anthropic CUA, completed 66.0% of all tasks, helped largely by the read-heavy majority.

Does WebBench measure desktop automation at all?

No. Every one of the 452 targets is a live website with a URL, sampled from the most-trafficked sites on the public web. There is no SAP GUI, no mainframe green-screen, no Jack Henry or Fiserv core, no Epic or Cerner client in the set, because none of those are public websites. WebBench is a browser-agent benchmark by construction, so it is silent on the legacy desktop layer where a large share of enterprise back-office work runs.

What does the 46.6% write ceiling imply for enterprise automation?

The tasks that move the needle in finance, claims, and banking ops are almost all write tasks: posting an invoice, keying a claim, updating a customer record, downloading a statement. WebBench shows those are exactly where browser agents are weakest, on the public web where they at least have a DOM and a URL to work with. On a desktop app with no API and no DOM, a browser-based agent has nothing to grab at all, which is the reason the accessibility-API approach exists.

How is Mediar's approach different from a browser agent?

Mediar's agents read what an application exposes through OS-level accessibility APIs, the same interfaces a screen reader uses, rather than parsing a web DOM or matching pixels. That works on legacy Windows desktop systems that have no browser surface and no API: SAP GUI, Oracle EBS, mainframe terminals, Jack Henry, Fiserv, FIS, Epic, Cerner. The open-source Terminator SDK (github.com/mediar-ai/terminator, MIT, Windows-only) is the same locator-by-role-and-name engine underneath.