A guide

Skyvern, plainly: a Playwright-compatible runtime, three cooperating agents, and a Chromium-tab edge.

Skyvern is an open-source browser-automation framework, AGPL-3.0, built by Skyvern AI Inc. (Y Combinator). The runtime is a Playwright-compatible SDK over a managed Chromium tab. Three cooperating LLM agents decide what to click: a Planner that decomposes the goal, an Actor that fires the browser interaction, and a Validator that confirms the page changed the way the Planner expected. The latest release on github.com/Skyvern-AI/skyvern is v1.0.32 (1 May 2026). Below is what the brand pages skim past: the agent decomposition, the integration matrix, the WebBench score by task category, and the surface boundary the architecture inherits from the browser tab.

Matthew Diakonov, Written with AI

Published May 6, 202611 min

Direct answer, verified 2026-05-06

Skyvern is an open-source (AGPL-3.0) AI browser automation framework from Y Combinator-backed Skyvern AI Inc. It drives a Playwright-controlled Chromium tab through three cooperating LLM agents (Planner, Actor, Validator), supports OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, Ollama, and OpenRouter as the model provider, and ships a managed cloud tier with credits-based 2026 pricing on top of the open core. Source: Skyvern-AI/skyvern README.

The agent loop, drawn out.

Most explanations of Skyvern stop at "an LLM looks at the page", which is true and not useful. The actual loop has three named agents, each doing one job, and the value of the architecture is in how they pass control. The Planner takes the plain-English goal and turns it into an ordered list of steps. The Actor takes the next step, picks the element on the live page that matches it, and fires a Playwright click or type. The Validator reads the resulting page and decides whether the step succeeded. If yes, the Actor moves on. If no, the Validator can ask for a local retry, or kick the failure back to the Planner for a new plan. That triangle is the self-correction story: layout drift becomes a Validator failure, not a script break.

Skyvern's per-step agent loop, hub view

The hub does not run once per workflow. It runs once per step. Every click, every type, every dropdown choice goes through a vision-capable LLM call that takes the live screenshot and the rendered DOM, scores candidate elements against the current step, and returns a decision. That property is what makes Skyvern more resilient than XPath against layout drift, and it is also the reason credits-based pricing has to absorb a variable retry budget in the price.

Goal to output, in five honest stages.

The README describes a multi-agent swarm inspired by BabyAGI and AutoGPT. Read at the level of one workflow run, the path from your goal to a returned result is shorter than the swarm framing suggests, and the boundaries between Planner, Actor, and Validator do most of the work. The Goal lands as a string, the Planner orders it, the Actor fires it, the Validator checks it, and an Output (a downloaded file, a JSON object, a completion event) goes back to the calling code or workflow builder.

One run, end to end

Goal

plain English instruction

Planner

decomposes into ordered steps

Actor

clicks, types, navigates

Validator

confirms each step

Output

JSON, file, or completion

The thing this picture hides, deliberately, is the inner retry loop. A real run on a hardened portal does not look like a clean five-stop walk. Plan a step, fire it, the Validator notices that the page did not change the way it expected, retry, fire again, validate, move on. Each retry is another vision-LLM call. The credits unit you are billed in absorbs that variance instead of charging it to you per call.

“Different workflows consume different amounts of credits depending on runtime, page complexity, retries, and anti-bot measures (CAPTCHA, proxies, geo-targeting).”

Skyvern, on what a credit measures (Day 5 launch post, 30 January 2026). The number 3 here is the count of agents the credit is feeding: Planner, Actor, Validator.

What the runtime supports, beyond "clicks a button".

The README lists six primitives the runtime supports today, plus loops and conditionals listed as coming soon. Each one is a different shape of work the agent loop is sized for. The ones with the largest commercial gravity are form filling (where Skyvern leads on the WebBench WRITE category) and data extraction (where the JSON-schema interface lets you bypass building a parser).

Browser tasks

A goal in plain language (login to portal X, download invoices for January, save as PDF) gets decomposed by the Planner and executed step by step by the Actor. The repo README lists this as the primary primitive.

Data extraction

JSON-schema-driven extraction. You describe the shape of the data you want, the agent reads the rendered DOM plus the screenshot, and returns a typed object. Skyvern positions this as the cheaper alternative to bespoke parsers.

Form filling

The category Skyvern is strongest at on the WebBench leaderboard. The Actor pairs labeled fields against your input data, handles dropdowns, accepts dates in the format the page wants, and confirms via the Validator before submitting.

Validation checks

After every Actor step the Validator agent verifies the page changed in the way the Planner expected. If not, it can either retry locally or feed the error back to the Planner for a new plan, which is where the self-correction story comes from.

Loops and conditionals

Listed in the README as coming soon. The current shape is a workflow DSL where loops and branches are added at the workflow-builder level, not inside an individual agent step.

Email and HTTP, plus custom code

Workflows can call out to email (send a notification) and arbitrary HTTP endpoints, and embed custom code blocks for anything the agent shouldn't decide on its own. Useful glue around the agent loop.

The integration matrix, and what each row buys you.

A useful read of any AI agent product is the list of things it plugs into out of the box, because that list tells you what the team has decided is core. Skyvern's integrations cluster in three places: model providers (so you bring your own inference), credentials (so secrets stay in a vault), and orchestration (so a Skyvern run can be a node in an existing automation graph). The MCP server matters separately because it lets a coding agent in your IDE drive Skyvern as a tool.

LLM providers

OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, Ollama, OpenRouter. The runtime is provider-agnostic; you bring your own key and pick the model that prices the throughput you need.

Password managers

Bitwarden, 1Password, LastPass. Credentials never round-trip through the LLM as plaintext.

2FA

TOTP, email, and SMS factors. The Actor can read codes that arrive on the same surface and replay them into the login form.

Workflow platforms

Zapier, Make.com, n8n. The cloud product exposes runs as triggers and actions in standard automation graphs.

Model Context Protocol

Skyvern ships an MCP server, which means a coding agent in your IDE can drive Skyvern as a tool without bespoke glue code.

The provider-agnostic LLM list is the most consequential row. A regulated workload that needs an on-prem or VPC inference path can route through Bedrock or Ollama. A team that wants the cheapest plausible model for a high-volume workflow can route through OpenRouter. The runtime does not lock you to one vendor, which is unusual for an agent product and worth naming.

The WebBench numbers, by task category.

Skyvern published WebBench on 29 May 2025 as an open benchmark for browser agents: 5,750 tasks across 452 live websites drawn from the global top-1000 by traffic, with 2,454 tasks open-sourced for replication. The headline 64.4% accuracy is the overall figure. The more useful read is the split: WebBench separates READ tasks (navigate and fetch data) from WRITE tasks (enter data, log in, download files, solve 2FA). On WRITE, Skyvern leads the published comparisons. On READ, Anthropic's Sonnet-3.7 Computer Use posts the strongest scores.

0%Skyvern, overall WebBench accuracy

0WebBench tasks across 452 sites

0Tasks open-sourced for replication

v1.0.0Latest release tag, 1 May 2026

The category split matters because WRITE tasks are what enterprise RPA actually buys. Logging into vendor portals, filling out government forms, downloading invoices, solving 2FA on a payer site. Read-only research workflows are interesting; the category that pays the bill is WRITE, and that is where the architecture is currently strongest. None of which removes the boundary the next section is about.

The boundary the architecture inherits.

Every property described above is a property of one specific surface: a managed Chromium tab. Playwright drives the tab. The vision LLM scores elements rendered inside the tab. The Validator reads a screenshot of the tab. Anti-bot tooling (CAPTCHA, proxies, geo-targeting) operates at the tab edge. That tight coupling is what lets the architecture do what it does well, and it is the reason the architecture stops where the tab stops.

A SAP GUI window is not a tab. It is a Win32 process that publishes its UI through Microsoft UI Automation, a separate accessibility surface a Chromium-bound agent has no way to read. An Oracle Forms session is not a tab. A Jack Henry green-screen terminal is not a tab. An Epic Hyperspace patient chart inside a Citrix shell renders no DOM the Skyvern Actor can score against, even when the user is looking at it on a Windows desktop. For all of those, the agent's input surface is missing, so the credits unit has nothing to measure on.

The honest framing is not "Skyvern is wrong". The honest framing is that Skyvern is sized for the open web, where the surface it reads (Chromium DOM plus screenshot) is the surface the work is actually on. For workflows that cross out of the browser tab into closed Windows desktop apps, the surface is Microsoft UI Automation and the natural unit of work is wall-clock time on the OS. That is the gap we (Mediar) cover at $0.75 per minute of runtime. The tools are adjacent, not interchangeable. Most enterprise workflows actually need both.

If your workflow leaves the browser tab, the unit of work changes.

Twenty minutes is enough to walk through where a Chromium-bound agent works for you and where it stops. We will be honest about the line, including the workflows where Skyvern is the right answer.

Frequently asked questions about Skyvern

What is Skyvern and who builds it?

Skyvern is an open-source AI browser automation framework built by Skyvern AI Inc., a Y Combinator-backed company. The codebase lives at github.com/Skyvern-AI/skyvern under an AGPL-3.0 license, the latest release is v1.0.32 (1 May 2026), and the company also runs a managed cloud at skyvern.com that adds proprietary anti-bot tooling, residential proxies, and CAPTCHA solving on top of the open core. The product is positioned as a Playwright-compatible SDK that adds an AI layer for picking the right element on a page, instead of relying on hand-authored XPath or CSS selectors.

How does Skyvern actually decide what to click on a page?

On every step, Skyvern captures the rendered DOM and a screenshot of the live Chromium viewport. Both reach a vision-capable LLM, which scores candidate elements based on their visual rendering, surrounding text, and the goal the Planner is currently working on. The Actor agent fires the resulting click or type through Playwright. The Validator agent reads the next screenshot and decides whether the page changed in the way the Planner expected. The loop runs once per step, which is why the credits-based 2026 pricing is sized in browser-execution units rather than per-step.

What are the three agents inside Skyvern, and what does each one do?

Skyvern uses a multi-agent swarm inspired by BabyAGI and AutoGPT, documented in the README. The Planner takes a high-level goal (log into portal X, pull all invoices for January, save as PDFs) and orders it into a sequence of steps. The Actor agent does the actual browser interactions: clicks, typing, navigation, downloads, file upload, dropdown selection. The Validator agent verifies that each Actor step succeeded; if not, it can either retry locally or kick the error back to the Planner for a new plan. That triangle is the self-correction story: when a portal redesigns its layout overnight, the Validator catches the failure and the Planner adapts, instead of a brittle selector script breaking silently.

What is Skyvern's WebBench score, and what does it mean?

Skyvern reports 64.4% overall accuracy on WebBench, the open benchmark Skyvern itself published on 29 May 2025. WebBench includes 5,750 tasks across 452 live websites (drawn from the global top-1000 by traffic), with 2,454 of them open-sourced for replication. Tasks split into READ (navigate and fetch data) and WRITE (enter data, log in, download files, solve 2FA). Skyvern has the strongest published results on WRITE tasks, which is the category that overlaps most with traditional RPA workloads. Anthropic's Sonnet-3.7 Computer Use leads on READ-only tasks. The headline number is useful for sizing what works today; it is not the same as a guarantee that any specific workflow finishes first try.

Is Skyvern open source, and what does the AGPL-3.0 license actually let me do?

Yes, the runtime is open source under AGPL-3.0 at github.com/Skyvern-AI/skyvern. You can install it with pip install skyvern, run skyvern quickstart, or use docker compose up -d. AGPL-3.0 lets you self-host, modify, and run it on your own infrastructure. The catch is that AGPL extends copyleft to network use: if you offer a modified Skyvern as a hosted service to others, you have to publish your modifications under AGPL too. The cloud product's anti-bot fabric (CAPTCHA solving, residential proxies, geo-targeting) stays proprietary and is not in the public repository, so a self-hosted deployment covers the core agent loop but not those bundled cloud features.

Which LLMs does Skyvern support, and does the choice matter?

The README lists OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Gemini, Ollama, and OpenRouter. The runtime is provider-agnostic; you supply the key and Skyvern routes the calls. The choice matters in three places. Cost: a hardened portal that needs many vision passes burns inference budget quickly, and the per-1M token rate of the chosen model is the biggest knob you have. Latency: a heavy reasoning model adds wall-clock time per step, which compounds across long flows. Compliance: a regulated workload that requires data residency or an on-prem inference path will pick Bedrock or Ollama for that reason alone. Skyvern itself does not lock you to one vendor.

What integrations does Skyvern ship with out of the box?

Three categories. Password managers: Bitwarden, 1Password, LastPass, so credentials live in your existing vault and are not stored in the workflow file. Authentication: TOTP, email, and SMS 2FA, so an Actor step can fetch and replay a code arriving on the same surface. Workflow platforms: Zapier, Make.com, n8n, so a Skyvern run can be a node in an automation graph elsewhere. Skyvern also ships an MCP (Model Context Protocol) server, which lets a coding agent in your IDE call Skyvern as a tool without writing custom glue.

Where does Skyvern's architecture genuinely fit, and where does it not?

It fits when the workflow lives entirely inside a Chromium tab and the binding constraint is layout drift. Vendor portal logins, lead enrichment from public sites, payer claim status checks, document downloads from a hardened extranet, the long tail of B2B SaaS form fills. The vision-LLM-on-screenshots approach is genuinely more robust to redesigns than XPath, and the WebBench WRITE-task lead is real. It does not fit when the workflow has to leave the browser tab. A SAP GUI window is not a tab. An Oracle Forms session is not a tab. A Jack Henry green-screen terminal is not a tab. An Epic Hyperspace patient chart inside a Citrix shell renders no DOM the agent can read. The agent loop has nothing to score against, because the surface it knows how to read is not the surface the work is on.

What is the practical difference between Skyvern and an OS-level RPA tool like Mediar?

Skyvern reads pixels and DOM inside a managed Chromium instance, decides each click with a vision LLM, and prices in browser-execution credits. Mediar reads the Windows UI Automation accessibility tree directly (the same interface a screen reader uses), records once with a model and then replays without a model in the loop, and prices in wall-clock minutes the agent spends driving Windows controls. The two tools are not competitors on the same surface. Skyvern is sized for the open web; Mediar is sized for the closed desktop apps the open web is glued to (SAP, Oracle Forms, Jack Henry, Fiserv, FIS, Epic, Cerner). Most enterprise workflows actually live across both, which is why the honest answer is usually one of each, not one or the other.

How do I try Skyvern without paying anything?

Two paths. Self-hosted: clone github.com/Skyvern-AI/skyvern, install with pip, supply your own LLM key, and you have the full runtime locally. You pay your inference provider directly, you size your own concurrency, and you do not get the cloud product's bundled anti-bot fabric. Cloud free tier: skyvern.com/pricing publishes a Free tier at $0/month with roughly 1,000 credits (about 170 actions on the published tier averages), which includes basic CAPTCHA solving and is enough to validate whether the runtime fits a specific workflow before committing to Hobby ($29/mo) or Pro ($149/mo).

Keep reading

Pricing

Skyvern pricing decoded: what a credit actually buys, and where it stops

A close read of the January 2026 credits-based tiers, the per-action math the launch post implies, and the architectural reason a credit can only price browser-tab runtime.

Read

Architecture

CUA AI: the two architectures behind the term

Vision-loop CUAs run a model on every step (OpenAI Operator, Anthropic Computer Use, Skyvern). Tree-based CUAs run the model only at recording time. The dual-channel recording trick that distinguishes them.

Read

Input layer

RPA agent UI input layer: accessibility tree versus pixels

The choice of input surface is the most consequential architectural decision an RPA agent makes. Walks the tree-versus-pixel split and what each gives up.

Read