Benchmark teardown

Skyvern’s WebVoyager benchmark, line by line: 85.85%, 15 sites, and the surface the benchmark does not measure.

Skyvern 2.0 reports 85.85% on the WebVoyager benchmark, up from roughly 45% on Skyvern 1.0. That number is real. It also describes a very specific surface: 643 natural-language tasks across exactly 15 public consumer websites. Most write-ups quote the headline. This page lists the 15 sites with their task counts (counted out of the official JSONL on 2026-05-08), shows where Skyvern sits on the public cross-agent leaderboard, walks the planner-actor-validator loop the benchmark scores, and is honest about the architectural gap between “works on Booking.com in a Chromium tab” and “works on a SAP GUI window inside a Citrix shell.”

Matthew Diakonov, Written with AI

Published May 8, 20269 min

Direct answer · verified 2026-05-08

Skyvern 2.0’s WebVoyager score is 85.85%, on a benchmark of 643 tasks across 15 public consumer websites.

Source: Skyvern’s own launch post, reproduced on the Steel.dev cross-agent leaderboard (snapshot dated 30 April 2026). Benchmark dataset: github.com/MinorJerry/WebVoyager, originally released by MinorJerry et al. in January 2024.

The progression from 1.0 to 2.0 was architectural, not a model swap. Skyvern 1.0 used a single agent and scored about 45%. Adding a planner agent took it to roughly 68.7%. Adding a validator agent that re-reads the post-action screenshot took it to 85.85%. All three numbers come from the same Skyvern launch post.

85.85%

“Skyvern 2.0 achieved a WebVoyager accuracy of 85.85%. All tests were run in Skyvern Cloud with an async cloud browser and GPT-4o plus GPT-4o-mini as the primary decision-making LLMs.”

Skyvern launch post, 'Skyvern 2.0: State of the Art in Evals'

0%Skyvern 1.0, single-agent baseline

0%Skyvern 1.5, planner added

0%Skyvern 2.0, planner + validator

0WebVoyager tasks across 15 sites

The 15 sites WebVoyager actually covers, with task counts.

The next table is what most coverage of Skyvern’s WebVoyager number leaves out. The 643 figure rolls up to a per-site distribution that is easy to count yourself. Curl the file at raw.githubusercontent.com/MinorJerry/WebVoyager/main/data/WebVoyager_data.jsonl and group by web_name. Counted on 2026-05-08, the per-site totals are below.

#	Site	Category	Tasks
1	Wolfram Alpha	Computation	46
2	Allrecipes	Consumer content	45
3	Booking	Travel SaaS	44
4	ESPN	Consumer content	44
5	Apple	Consumer ecommerce	43
6	ArXiv	Academic search	43
7	Cambridge Dictionary	Reference	43
8	Google Search	Search engine	43
9	Huggingface	Developer SaaS	43
10	BBC News	Consumer content	42
11	Coursera	Consumer SaaS	42
12	Google Flights	Travel SaaS	42
13	Amazon	Consumer ecommerce	41
14	GitHub	Developer SaaS	41
15	Google Map	Maps	41
	Total	15 sites	643

Two facts worth pulling out before moving on. First, the distribution is roughly flat: every site contributes between 41 and 46 tasks. There is no single site dominating the 643. Second, every one of the 15 is a public consumer-facing web property reachable from a stock Chromium browser. There is no site on the list that requires a corporate VPN, a Citrix client, or an installed Win32 thick-client. That second fact is the load-bearing one for how to read Skyvern’s 85.85%.

Where Skyvern sits on the public leaderboard.

The Steel.dev leaderboard at leaderboard.steel.dev collects published WebVoyager scores for every major web-agent framework. Snapshot below was pulled on 30 April 2026. The leaderboard itself notes that “rows may use different evaluation settings and are not always strict apples-to-apples,” which is honest: most of the variance between adjacent rows is grader configuration, dataset patches, and time-sensitive task drift, not a fundamental capability gap.

Agent	Org	Score	Note
Jina	Om Labs	98.9%
Alumnium	Alumnium	98.6%
Surfer 2	H Company	97.1%
Magnitude	Magnitude	93.9%
AIME Browser-Use	Aime	92.34%
Surfer-H + Holo1	H Company	92.2%
Browserable	Browserable	90.4%
Browser Use	Browser Use	89.1%
GLM-5V-Turbo	Z.ai	88.5%
Agent Kura	Kura	87.0%	602/643 tasks (41 removed for invalid/auth issues)
Operator	OpenAI	87.0%
Skyvern 2.0this page	Skyvern	85.85%
Project Mariner	Google	83.5%
Agent-E	Emergence AI	73.1%
WebSight	Academic research	68.0%
WebVoyager	Academic research	59.1%	The original 2024 baseline reported in the WebVoyager paper
Anthropic Computer Use 3.5	Anthropic	56.0%	Sampled 50 / 602 tasks; not directly comparable to full-suite runs
GPT-4 (All Tools)	OpenAI	30.8%

Two rows on this leaderboard deserve a footnote in any honest read. Anthropic’s Computer Use 3.5 row at 56.0% is sampled on 50 of 602 tasks, not the full set; the headline number does not translate one-for-one to a full-suite run. The original WebVoyager baseline at 59.1% is the 2024 academic-paper number and uses the un-patched dataset, including the time-sensitive Booking and Google Flights tasks that have since drifted.

The loop the benchmark scores: plan, act, validate.

The way Skyvern 2.0 went from roughly 45% to 85.85% on WebVoyager is architecturally simple to describe. The framework adds a planner and a validator on either side of the actor agent. The planner decides what sub-step to attempt; the actor fires the click; the validator re-reads the post-action screenshot and decides whether the page actually went where the planner expected. If not, control loops back to the planner with the error context. WebVoyager’s GPT-4V grader watches the entire trajectory and either passes or fails the task at the end.

WebVoyager task lifecycle in Skyvern 2.0

Goal

A natural-language instruction lands in the planner, e.g. 'Look up the closest Apple store to ZIP 90038 and check Smart Folio pickup availability.' This is the WebVoyager task shape.

Plan

Skyvern's planner agent decomposes the goal into web-page sub-steps. The plan is grounded in the rendered viewport plus the DOM, both of which exist because the surface is a Chromium tab.

Act

The actor agent fires a click or a type or a scroll. The action targets a labeled DOM node selected by a vision LLM reading the screenshot. WebVoyager's grader watches the trajectory.

Validate

The validator agent reads the post-action screenshot and judges whether the page changed in the expected way. If not, control flows back to the planner with the error. This is what took Skyvern from ~68.7% to 85.85%.

Score

GPT-4V scores the trajectory against the WebVoyager reference answer. Pass or fail per task across all 643 tasks.

Two things follow from this loop being the thing that gets graded. One, every step happens against a rendered Chromium viewport. The planner reads the DOM and screenshot, the actor selects a labeled DOM node, the validator reads a screenshot of the same Chromium tab. Two, the validator is what made the 17-point jump from 1.0 to 2.0 possible: a single-agent loop has no way to know it just took a wrong step, so any error compounds; a separate validator can catch and re-plan around it. The validator is also why the framework feels solid on portals it has not seen before.

What WebVoyager is and is not measuring.

A benchmark’s honest contract is “the score I publish carries weight on the surface I tested.” WebVoyager tested 15 public consumer-web sites with an automated GPT-4V grader. The score earns weight on that surface. It earns no weight on a surface the benchmark did not see. The next table is the boundary, side by side. The left column is what WebVoyager does cover (and where Skyvern’s 85.85% is real evidence). The right column is the enterprise legacy desktop work where Mediar lives and where the benchmark is silent by construction.

Feature	WebVoyager scope (browser-tab agents)	Mediar (desktop accessibility-tree agent)
Public sign-up flow on a consumer SaaS	Yes (e.g. Coursera, Huggingface enrollment).	Not in the dataset; Mediar's target user is already inside a logged-in desktop session.
SAP GUI window or SAP Business One	Not in the 15 sites. Has no DOM for a browser agent to read.	First-class. Mediar reads the SAP control surface through Windows UI Automation.
Mainframe terminal, AS/400 green-screen	Not in the 15 sites. Renders no DOM to score against.	Supported as a Win32 + accessibility-tree surface, not a screenshot.
Jack Henry / Fiserv / FIS core banking	Not in the 15 sites.	Documented in production at community-bank scale.
Epic Hyperspace inside a Citrix shell	Not in the 15 sites. The Citrix container makes the rendered DOM invisible to a Chromium-based agent.	Supported via the Windows accessibility tree the Citrix client exposes.
Google Flights date search	Yes, 42 tasks. Skyvern 2.0 scores well here.	Out of scope. This is exactly the surface a browser agent like Skyvern is sized for.
Reading an Excel spreadsheet that the user is editing in place	Not in the 15 sites.	Supported as a desktop surface, not a tab.
Filling a form on a consumer travel site	Yes, this is the WRITE-task category Skyvern is strongest at.	Out of scope.

The boundary is sharp. A SAP GUI window is not a Chromium tab. An Oracle Forms session is not a tab. A Jack Henry green-screen terminal is not a tab. An Epic Hyperspace patient chart inside a Citrix shell renders no DOM the agent can read. None of those surfaces is in WebVoyager, and none of them is in any other browser-agent benchmark either, because the input layer the benchmarks assume (a Chromium-rendered viewport with a stable DOM) is the input layer those surfaces are missing.

How a CFO should read Skyvern’s 85.85% before signing the RPA-replacement contract.

Two questions, in this order. First: of the workflow hours per week your team is trying to take off the floor, what percentage lives inside a Chromium tab on a public or near-public consumer web property? Vendor portal logins, lead enrichment, public-site downloads, payer claim-status checks on a hardened extranet, the long tail of B2B SaaS form-fills. That percentage is the slice of your portfolio where the WebVoyager number transfers cleanly. The Skyvern launch post is not exaggerating: a Chromium-tab agent that scores 85.85% on a 643-task suite of consumer-web tasks is a real piece of capability for that slice.

Second: of the same hours per week, what percentage lives on a Win32 thick-client, a Citrix-shelled enterprise app, a mainframe terminal, or a desktop spreadsheet the user is editing in place? That percentage is the slice where the WebVoyager number says nothing useful. There is no row in the 643 tasks that exercised this surface, no architectural transfer is implied, and no Skyvern tier currently ships a Windows desktop runtime sized for it. A buyer who reads the headline and assumes the score generalizes is quietly importing a benchmark scope mismatch into the contract.

The honest answer for most enterprise teams running existing UiPath or Power Automate workloads is a portfolio split, not a single pick. Browser-tab work goes to the agent that is strong on browser tabs. Desktop work goes to the agent that has a desktop runtime. Mediar’s side of that split prices in wall-clock minutes on Windows ($0.75 per minute of agent time, drawn against a $10,000 turn-key prepay that converts to credits with a small bonus) and drives the desktop through the Windows accessibility tree, the same interface screen readers use. That is a different unit on a different surface. Both can be true on the same purchase order.

Bring a workflow portfolio that crosses the browser-tab boundary.

If half your weekly hours are in Chromium and half are in SAP GUI, Jack Henry, Oracle Forms, or Epic Hyperspace, the WebVoyager number prices the first half cleanly and is silent on the second half. We can walk a real workflow on each side in twenty minutes and put real numbers behind both.

Frequently asked questions

What is Skyvern's WebVoyager benchmark score?

Skyvern 2.0 reports 85.85% on WebVoyager. The number was published by Skyvern on its own launch blog in January 2025 (skyvern.com/blog/skyvern-2-0-state-of-the-art-web-navigation-with-85-8-on-webvoyager-eval/) and is reproduced on the public Steel.dev leaderboard. Skyvern 1.0 scored roughly 45% on the same set; adding the planner agent took it to about 68.7%; adding the validator agent took it to 85.85%. The runs were executed in Skyvern Cloud against an asynchronous cloud-browser pool, with GPT-4o and GPT-4o-mini as the decision LLMs.

What exactly is the WebVoyager benchmark?

WebVoyager is the academic benchmark released in January 2024 by MinorJerry et al. (arxiv.org/abs/2401.13919). It contains 643 natural-language tasks distributed across 15 popular public websites. The grader is GPT-4V acting as a 'virtual annotator' judging an agent's trajectory against a ground-truth reference. The original WebVoyager paper reported 59.1% on its own benchmark; modern frameworks now sit between 56% and 99% depending on architecture and dataset adjustments. The dataset and scorer are open-source on github.com/MinorJerry/WebVoyager.

Which 15 websites are in WebVoyager?

Counted directly from the official JSONL (raw.githubusercontent.com/MinorJerry/WebVoyager/main/data/WebVoyager_data.jsonl) on 2026-05-08, the per-site task counts add up to 643: Wolfram Alpha (46), Allrecipes (45), Booking (44), ESPN (44), Apple (43), ArXiv (43), Cambridge Dictionary (43), Google Search (43), Huggingface (43), BBC News (42), Coursera (42), Google Flights (42), Amazon (41), GitHub (41), Google Map (41). Every one is a public consumer-facing web property. None of them is an enterprise system of record.

How does Skyvern's 85.85% compare to other agents on the same benchmark?

On the Steel.dev leaderboard snapshot dated 30 April 2026, the field above Skyvern 2.0 includes Browser Use at 89.1%, Browserable at 90.4%, Magnitude at 93.9%, Surfer 2 at 97.1%, and Jina at 98.9%. Below it, Project Mariner is at 83.5%, Operator is at 87.0% (Operator and Skyvern are within a percentage point), and the original 2024 WebVoyager baseline is at 59.1%. The leaderboard's own caveat says the rows 'may use different evaluation settings and are not always strict apples-to-apples,' which is honest. Most of the variance between published scores is grader configuration and dataset patches, not a 14-point capability gap.

Does the WebVoyager benchmark cover SAP, Oracle, Jack Henry, or Epic?

No. Every one of the 15 sites in WebVoyager is a public consumer web property reachable from a stock Chromium browser without a corporate VPN, without a Citrix shell, and without an installed thick-client. SAP GUI, SAP B1, Oracle EBS, Oracle Forms, Jack Henry SilverLake, Fiserv DNA, FIS Horizon, Epic Hyperspace, Cerner Millennium, eClinicalWorks, AS/400 green-screens, and mainframe terminals are all out of scope by construction. They render no DOM the agent can read, the screenshot pipeline does not see the same Chromium-rendered viewport, and the WebVoyager task shape does not generalize to a logged-in Win32 thick-client. A high WebVoyager score is honest evidence about a browser-tab surface; it carries no architectural claim about the desktop surface.

If I'm shopping for an RPA replacement, how should I read the 85.85% number?

Read it as 'Skyvern is genuinely strong at goal-directed automation inside a Chromium tab on consumer-grade web properties.' That is exactly what the benchmark measures and exactly what Skyvern 2.0 was built for. If your target workflows live there (vendor portals, lead enrichment from public sites, SaaS form-fills, public-site downloads), the WebVoyager number is good evidence and it deserves its weight in the buying decision. If your target workflows live on the desktop side (SAP GUI, Oracle, Jack Henry, Epic Hyperspace, Excel running locally, mainframe terminals), the number does not transfer. The benchmark does not measure that surface, and no Skyvern tier ships a Windows desktop runtime. The honest read is a portfolio split: which percentage of your hours per week is in a Chromium tab and which percentage is not, and pick the unit that prices each side.

How is Skyvern's WebBench (their own benchmark) different from WebVoyager?

WebVoyager is a 2024 academic benchmark with 643 tasks across 15 sites, reporting accuracy under GPT-4V automated judging. WebBench is a Skyvern-published benchmark from May 2025 with 5,750 tasks across 452 live websites (drawn from the global top-1000 by traffic), of which 2,454 are open-sourced. WebBench separates READ tasks (navigate, fetch data) from WRITE tasks (enter data, log in, download, solve 2FA). Skyvern reports 64.4% overall accuracy on WebBench and is strongest in the WRITE category; Anthropic's Sonnet-3.7 Computer Use leads on READ-only tasks. Both numbers are useful, both have caveats, and both are about browser-tab work.

Can WebVoyager be re-run privately on my own data?

Yes. The grader, scorer, and 643 task definitions are open-source under the WebVoyager repository on GitHub. The scorer requires a GPT-4V API key and the original benchmark assumed a stock Chromium with a residential IP. Live-site decay is the practical concern: between January 2024 and 2026, several time-sensitive tasks (notably on Booking and Google Flights) drift, and several Booking and Apple flows have changed CAPTCHA fabric. Skyvern's reported 85.85% explicitly notes 'eight outdated tasks removed' and 'flight and hotel dates updated.' If you re-run the suite in 2026, expect a similar maintenance pass before the numbers are comparable to a 2024 baseline.

Adjacent reads on browser agents, benchmarks, and the desktop boundary

Keep reading

Architecture

Skyvern, traced through the source: planner, actor, validator, and the WebBench leaderboard

A close read of the AGPL-3.0 Skyvern repository. The three cooperating agents, the LLM and password-manager integrations, the WebBench WRITE-task lead, and the Chromium-tab boundary the architecture inherits.

Read

Pricing

Skyvern pricing decoded: what a credit actually buys, and where it stops

The structural read of Skyvern's January 2026 credits-based pricing. Per-action math from the published tiers, the architectural reason a credit only buys browser-tab runtime, and the desktop-runtime replay code that explains why a different unit takes over.

Read

Input layer

RPA agent UI input layer: accessibility tree versus pixels

The choice of input surface is the most consequential architectural decision an RPA agent makes. Walks the tree-versus-pixel split and what each gives up.

Read