Browser automation, honestly compared
Skyvern's pros and cons, and the one line every comparison leaves out
Skyvern is one of the better browser-automation agents you can run today. It reads pages with a vision model instead of brittle selectors, it is open source, and it is cheap for simple web tasks. It also shares one hard limit with every tool in its category: it lives inside a browser tab. Here is the practical version of the trade-offs, including where it loses to Browser Use and Browserbase and where all three stop dead.
Short answer
Pros
- Vision-AI reads the page by context, so it self-heals when a site is redesigned (no selectors to maintain).
- Open source (AGPL-3.0), over 20,000 GitHub stars, self-hostable.
- Plain-English and SOP-driven workflows, usable by non-engineers.
- Low per-step cost for short, well-bounded web tasks.
Cons
- Browser-only. It cannot touch a native Windows desktop app (SAP GUI, mainframe terminals, EHR thick clients).
- AGPL-3.0 copyleft can complicate embedding it in a commercial product you ship.
- Per-step pricing climbs as workflows get longer and more complex.
- Highly customized workflows still need real workflow-logic tuning.
Verified 2026-06-16 against github.com/Skyvern-AI/skyvern and skyvern.com/pricing.
What Skyvern is genuinely good at
Most of the criticism of older browser automation is about maintenance. Selenium and Playwright scripts are keyed to selectors, and selectors rot every time a front-end team ships a redesign. Skyvern's answer is to look at the rendered page with a vision model and act on what it sees. That is a real improvement, and it is why the tool earns its place. The honest strengths:
It reads pages the way a person does
Skyvern pairs a vision model with the page so it acts on what is visually on screen, not on a CSS selector or XPath. When a site redesigns a form or moves a button, a script keyed to selectors breaks; a vision agent usually still finds the field. That is the single biggest reason teams pick it over Selenium or Playwright scripts.
Open source, real community
The core is on GitHub under AGPL-3.0 with over 20,000 stars. You can run it yourself, read the code, and self-host the workflow engine. That matters if you do not want a black box.
Natural-language and SOP-driven workflows
You describe the task in plain English, upload an existing standard operating procedure, or record yourself doing it. Non-engineers can author a run without writing Playwright by hand.
Per-step usage pricing
Skyvern publishes a free tier plus paid plans and has historically priced per step, so a simple web task is cheap. For light, well-bounded web flows the bill stays small.
Login, 2FA, and structured extraction
It handles authentication (including 2FA), pulls structured data out with JSON schemas, downloads files, and chains tasks in a workflow builder. The web-automation feature set is mature.
The cons that actually bite in production
None of these make Skyvern a bad tool. They make it a tool with a shape, and the shape matters when you scope a real automation program.
1. It only works inside a browser
This is the one most comparisons skip. Skyvern is built on Playwright and controls Chrome through remote debugging. Its entire automation surface is whatever renders in a web tab. A SAP GUI window, an AS/400 green screen, an Epic Hyperspace client, a VB6 line-of-business app: none of them are web pages, so none of them are visible to Skyvern. If a meaningful slice of your work lives in those apps, a browser agent simply does not reach it.
2. AGPL-3.0 is a strong copyleft
The open-source core is AGPL-3.0. For internal automation that is usually fine. But AGPL's network clause means a modified version offered to users over a network triggers a source-release obligation. If you intend to fork Skyvern and embed it in a product you sell, that is a legal conversation, not a footnote. A permissively licensed tool sidesteps the question.
3. Per-step cost scales with complexity
Usage pricing looks cheap on a five-step demo. A real multi-page workflow with retries, validation, and extraction can be dozens of steps, each one a model call. The per-step model that wins on simple tasks needs to be forecast against your actual step counts before you commit a budget.
4. Complex workflows still need tuning
Skyvern's own materials note that optimizing complicated automations takes an understanding of workflow logic, and that heavily customized flows need some technical input. The natural-language front door is real, but production reliability on a gnarly flow is still engineering work.
Skyvern vs the services it competes with
The usual comparison set is Browser Use and Browserbase. They are all worth knowing, but notice that the entire comparison happens inside one category: the browser. The last column is a different category entirely, and that is the point.
| Dimension | Skyvern | Browser Use | Browserbase | Accessibility-API agent |
|---|---|---|---|---|
| Automation surface | Browser tab | Browser tab | Browser tab (cloud) | Any Windows app + browser |
| How it sees elements | Vision model on the page | LLM + DOM | Your code (DOM) | OS accessibility tree |
| Reaches SAP GUI / green screens | No | No | No | Yes |
| License | AGPL-3.0 | MIT | Commercial SaaS | MIT (Terminator SDK) |
| Best fit | AI-native web workflows, end to end | Build-your-own web agent | Managed browser fleet | Legacy desktop + web together |
Sources: Skyvern, Mediar Terminator. Licenses and capabilities verified 2026-06-16.
The boundary, made concrete
Toggle the two views below. The same intake task, the same vision-AI quality, but the moment the workflow leaves the web tab the tool that drives the browser has nothing to drive.
Where a browser agent ends
A vendor portal in Chrome. Skyvern's vision model reads the rendered fields, fills the claim number, uploads the PDF, and submits. This is exactly what it is built for, and it does it well.
- Renders in a browser tab
- Vision model finds fields by context
- Self-heals when the portal is redesigned
The anchor: what each tool actually controls
You can verify this yourself in a minute. Skyvern's repository (github.com/Skyvern-AI/skyvern) states it is built on Playwright and connects to Chrome over remote debugging. Playwright drives browsers. That is the whole story of its reach: if a thing is not a browser, Playwright does not address it, and neither does Skyvern. This is not a knock on Skyvern. It is what a browser agent is.
The other side of the boundary uses a different OS interface. Mediar built and open-sourced the Terminator SDK (github.com/mediar-ai/terminator, MIT licensed) to read the Windows accessibility tree directly. That is the same data a screen reader consumes: every window, button, edit field, and value an app exposes, named and typed, with no pixels and no selectors. Because it reads structure rather than appearance, it keeps working when a label or layout shifts. That is how a SAP GUI maintenance screen, which Skyvern cannot see at all, becomes something an agent can read field by field.
“Browser-based AI agents are great for new SaaS, but if your data lives in SAP GUI or a Jack Henry green-screen they will not help. That is the whole reason the accessibility-API approach exists.”
So which should you pick?
The honest recommendation is not "use Mediar instead of Skyvern." It is to match the tool to where the work lives.
- Your workflows are entirely web-based. Skyvern is a strong choice. Its vision approach beats selector scripts on maintenance, and per-step pricing keeps simple flows cheap. Compare it against Browser Use (if you want to assemble your own agent) and Browserbase (if you mainly need managed browser infrastructure).
- You want to ship a modified version inside a commercial product. Check the AGPL-3.0 obligations first, or pick an MIT-licensed building block.
- Your work is in SAP, a banking core, an EHR, or any native desktop app. No browser agent reaches it. This is the case Mediar exists for: AI agents that watch a desktop workflow once and then run it through Windows accessibility APIs, at roughly 20% of UiPath's cost, with self-healing because there are no pixel matchers or selectors.
- You have both. Run a browser agent for the web slice and an accessibility-API agent for the desktop slice. Forcing one tool across the boundary is where automation programs stall.
Have desktop apps a browser agent cannot touch?
Book 20 minutes. We will look at the workflow on SAP, a banking core, or an EHR and tell you honestly whether it is automatable.
Skyvern, in practice
What is the single biggest pro of Skyvern?
It uses a vision model to act on what is rendered on the page rather than on brittle CSS selectors or XPath. When a website redesigns a form or moves a button, selector-based scripts break, but a vision agent usually still locates the field. That self-healing behavior on web UIs is Skyvern's strongest advantage over Selenium and Playwright scripts.
What is the single biggest con of Skyvern?
It only automates inside a browser. Skyvern is built on Playwright and controls Chrome through remote debugging, so its entire reach is bounded by what renders in a web tab. If your workflow touches a native Windows application (SAP GUI, a Jack Henry green screen, an Epic thick client, a VB6 form), Skyvern cannot see it at all. No browser agent can.
How does Skyvern compare to Browser Use and Browserbase?
All three live inside the browser. Browser Use is an open-source library that connects an LLM to a browser and leaves the orchestration to you. Browserbase sells reliable cloud browser infrastructure but you write the automation logic. Skyvern bundles the vision-AI automation with a workflow builder and a hosted option. The differences are real, but they are differences within the same web-only category. None of them reach a native desktop app.
Does Skyvern's AGPL-3.0 license matter for my company?
It can. AGPL-3.0 is a strong copyleft license: if you modify Skyvern and offer it to users over a network, the license requires you to release your modified source. For internal use that is usually fine. If you plan to embed a modified Skyvern inside a commercial product you ship to customers, run it past legal first. A permissively licensed option (MIT, for example) avoids that question entirely.
Is per-step pricing cheaper than traditional RPA?
For short, well-bounded web tasks, yes. The catch is that complex multi-page workflows take many steps, and vision-model calls add up, so the per-step model that looks cheap on a 5-step task can climb on a 60-step one. Forecast against your real step counts, not a demo. It is still typically far below a six-figure UiPath implementation for the web-only slice of work.
If I have both web and desktop workflows, what should I use?
Use Skyvern (or a peer) for the web-only flows where its vision approach shines, and an accessibility-API tool for the native desktop apps it cannot reach. Mediar's open-source Terminator SDK (MIT, github.com/mediar-ai/terminator) reads the Windows accessibility tree the way a screen reader does, so it drives SAP GUI, mainframe terminals, and EHR thick clients. Many teams run a browser agent and a desktop agent side by side rather than forcing one tool to do both.
The desktop boundary, in depth
Keep reading
Legacy desktop apps with no API are a moat
Why 'no documented API' is a durable economic moat, and the one OS-level surface that breaks it without a multi-year migration.
RPA selectors vs the accessibility tree
Selector-based automation breaks when a UI shifts. Reading the accessibility tree is why some agents self-heal and others do not.
Where Power Automate Desktop stalls on SAP
The specific points where browser-and-desktop RPA loses the thread on SAP GUI, and what reads those screens reliably.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.