Browser agents, one decision

Skyvern vs Browser Use, and the question both comparisons skip: where does the agent stop?

Every head-to-head on these two counts GitHub stars, names the license, and argues computer vision against the DOM. All true, all below. But Skyvern and Browser Use share one ceiling that none of those pages name: the edge of the browser window. This page gives you the honest pick between the two, then shows you exactly where both of them stop.

Matthew Diakonov, Written with AI

Published June 17, 20267 min

Direct answer (verified 2026-06-17)

Pick Browser Use to build agents into your own codebase. Pick Skyvern for hosted, no-code web workflows. Browser Use is an MIT-licensed Python library you import and drive with your own LLM keys. Skyvern is AGPL-3.0 and ships as a platform with a no-code workflow builder, a managed cloud, and computer-vision page reading.

Both are browser-only. Each drives a Chromium tab through the Chrome DevTools Protocol, so neither can act on anything outside the browser window. Verify the licenses at github.com/browser-use/browser-use (MIT) and github.com/Skyvern-AI/skyvern (AGPL-3.0).

The honest head-to-head, inside the browser

On their home turf, the pure-web workflow, these are both good tools and the choice is a matter of fit, not winner-and-loser. Here is the comparison on the dimensions that actually change which one you reach for.

Browser Use

License: MIT, the most permissive of the common licenses
Shape: A Python library you import into your own code
Page reading: Accessibility and DOM tree of the page, with visual context
Runtime: You manage the browser infrastructure
Best fit: Developers embedding an agent in an application
Community: The larger GitHub star count of the two by a wide margin

Skyvern

License: AGPL-3.0, copyleft that reaches networked use
Shape: A platform with a no-code workflow builder and managed cloud
Page reading: Computer vision plus LLM, reads the rendered page visually
Runtime: Self-host or use the managed Skyvern cloud
Best fit: Non-developers recording and running web workflows
Community: Smaller star count, with workflow features layered on top

Star counts move daily, so treat community size as a direction, not a number. The license and the shape are the stable facts: MIT library versus AGPL platform.

The ceiling they share

Strip away the license and the vision-versus-DOM debate and both tools are the same architecture: an LLM deciding actions, a real Chromium instance, and the Chrome DevTools Protocol carrying those actions to one browser tab. CDP is a powerful protocol, but it has no concept of the desktop around the browser. That is not a bug in either project. It is the boundary of the entire browser-agent category.

The clearest way to feel the ceiling is to watch a workflow hit it. Below, the same claims-intake task is written for a browser agent and for Mediar's accessibility-tree runtime. Both fill the web form fine. Then the form says "Attach" and opens the native Windows file picker.

The same workflow, at the moment it leaves the tab

# browser-use or skyvern: the agent lives inside the browser
from browser_use import Agent

agent = Agent(
    task="open the claims portal, fill the new-claim form, "
         "then attach claim.pdf and submit",
    llm=llm,
)
await agent.run()

# the form fills fine. then "Attach" opens the native
# Windows file picker. it is not a DOM element, it is an
# OS window with its own process. the agent's world ends
# at the browser tab. Skyvern, Browser Use, and the
# Playwright/CDP layer underneath them all speak to one
# tab. there is no path out of it.

-29% fewer dead ends

The anchor: where the boundary is enforced in code

The reason Mediar crosses the edge and a browser agent cannot is one decision in one file. Mediar's open Terminator runtime binds to the operating system accessibility tree (Microsoft UI Automation on Windows), the same interface a screen reader uses, which exposes both the browser DOM and the desktop around it through one selector schema. When a recorded step lands inside Chrome, the converter dispatches it as a fast browser script because prefer_browser_scripts defaults to true in apps/desktop/src-tauri/src/mcp_converter.rs. When the click target leaves the browser process, the same function returns None at line 862 and the executor falls through to UI Automation on the same workflow line.

A browser agent has no equivalent fall-through because it has no binding to fall through to. There is no second surface in CDP. You can check the fall-through yourself in a clone of github.com/mediar-ai/terminator; that source is open, so the claim is verifiable rather than a slide.

What lives past the browser window edge (and outside both agents)

The native Windows file picker for every upload and Save As dialog
SAP GUI, Oracle EBS, and other thick-client desktop apps with no web layer
Citrix and remote-desktop sessions where the app is delivered as a pixel stream
Jack Henry, Fiserv, and FIS green-screen terminal emulators in banking back offices
Epic Hyperspace and Cerner clinical desktops in regional healthcare
Excel formulas, Acrobat, and the OS clipboard moving data between two desktop apps

None of these are DOM nodes, so none are reachable through the Chrome DevTools Protocol that Skyvern and Browser Use both ride on. They are OS windows, thick clients, and pixel streams, exactly the surfaces an accessibility-tree binding was built to reach.

When each one is the right answer

This is not an argument that browser agents are bad. They are the right tool for an entire class of work, and pretending otherwise would be the same dishonesty as the pages that never mention the edge.

Reach for Browser Use when you are a developer building an agent into a codebase, you want the most permissive license, and the workflow lives inside web apps. Reach for Skyvern when non-developers need to record and run web workflows without writing Python, or when you want a hosted runtime and computer-vision page reading for messy or unfamiliar sites. Reach for Mediar when the workflow crosses the browser window edge even once: a file upload through the OS picker, SAP GUI, a Citrix session, a Jack Henry green screen, an Epic desktop, or any step that ends in a native dialog.

The cheap test is to walk your highest-volume workflow with a stopwatch and count how many times focus leaves the browser. Zero means a browser agent is the lighter choice. One or more means you are either going to wire a second tool in next to it, or use a binding that reaches both surfaces from one recorded file.

The five-second decision

Building agents in code, browser-only? Browser Use (MIT).

No-code hosted web workflows? Skyvern (AGPL-3.0).

Workflow touches a file picker, SAP, Citrix, or any desktop app? Accessibility-tree runtime.

Getting this wrong looks like a working demo on a clean web form and a stalled rollout the first time a step opens a native window.

Bring the workflow. We will count where it leaves the browser.

A 30-minute walkthrough on a real workflow. We show the recorder cross the file picker into a desktop app, and the mcp_converter.rs fall-through that makes it possible. No slides, the running artifact.

Frequently asked questions

Skyvern or Browser Use, which should I pick?

It comes down to how you want to ship. Browser Use is an MIT-licensed Python library you import into your own codebase and drive with your own LLM keys; it is the better fit when you are building agents into an application and want full control with the most permissive license. Skyvern is AGPL-3.0 and ships as a platform with a no-code workflow builder and a managed cloud, plus computer-vision page understanding that reads a page visually rather than only through the DOM; it is the better fit when non-developers need to record and run workflows without writing Python, or when you want a hosted runtime instead of self-managed browser infrastructure. The AGPL license matters if you plan to embed Skyvern in a closed-source product, because AGPL obligations reach networked use. Both are strong inside their lane. The lane is the browser tab.

Are Skyvern and Browser Use actually that different under the hood?

Less than the marketing suggests. Both drive a real Chromium instance, both call an LLM to decide the next action, and both ultimately speak the Chrome DevTools Protocol to one browser context. Skyvern leans harder on computer vision to interpret the rendered page and bundles a workflow layer on top; Browser Use leans on the accessibility and DOM tree of the page and stays a lean library. But the transport is the same family, and so is the boundary. Neither can act on a window that is not the browser, because CDP has no concept of the desktop around the browser.

What is the browser window edge and why does it decide the comparison?

A browser agent can see and act on anything inside a browser tab: inputs, buttons, links, ARIA regions, the rendered pixels. The moment an interaction leaves the tab, it leaves the agent's world. Clicking Attach opens the native OS file picker, which is a separate window owned by the operating system, not a DOM node. Printing to PDF, an OS sign-in prompt, the downloads bar, a desktop policy-admin app, SAP GUI, a Citrix session, a mainframe terminal: none of these are reachable through CDP. For a workflow that lives entirely inside one modern web app, this never bites. For most enterprise back-office workflows, it bites on step three. That is why the edge, not the star count, is the dimension that decides which tool survives in production.

How does Mediar reach past the browser when Skyvern and Browser Use cannot?

Mediar's open Terminator runtime binds to the operating system accessibility tree, the same interface a screen reader uses: Microsoft UI Automation on Windows. That tree exposes the browser DOM and the desktop around it through one selector schema. When a recorded step lands inside Chrome, the runtime dispatches it as a fast browser script (the prefer_browser_scripts flag defaults to true in apps/desktop/src-tauri/src/mcp_converter.rs). When the click target leaves the browser process, the same converter returns None at line 862 and the executor falls through to UI Automation on the same workflow line. One file, two execution layers, no second tool and no human hand-off. You can clone github.com/mediar-ai/terminator and grep for that fall-through yourself.

If my workflow is browser-only, do I even need Mediar?

No, and we will say so on a call. If your highest-volume workflow runs entirely inside one or more web apps, never opens a file picker, never crosses to Excel for a calculation, never touches a desktop tool, and never logs into a thick client, then a browser agent is the lighter, faster choice. Browser Use is excellent for embedding in code; Skyvern is excellent for hosted no-code web workflows. Mediar earns its place the moment a workflow crosses the browser window edge even once, which is the common case in finance, insurance, banking, and healthcare back offices.

Does Skyvern's computer vision let it click the native file picker?

No. Computer vision changes how Skyvern interprets the rendered page, not what surface it can act on. The capture and the action both run through the browser context. A screenshot of the screen is not the same as having an automation handle on an OS window. To click an item in the native file picker you need an OS-level automation API (UI Automation on Windows), which is a different binding than anything a browser agent ships with. Vision helps Skyvern read messy pages; it does not extend the boundary past the tab.

What about licensing if I want to build a commercial product on top?

Browser Use is MIT, which is the most permissive of the common open-source licenses and the easiest to embed in a closed-source commercial product. Skyvern is AGPL-3.0, whose copyleft obligations extend to software accessed over a network, so building a hosted product on a modified Skyvern can trigger source-disclosure obligations; many teams use Skyvern's managed cloud specifically to sidestep that. Mediar's Terminator runtime is open source at github.com/mediar-ai/terminator, and the commercial product is sold as a turn-key program rather than per-seat licensing. Read each license yourself before you build a business on it; this is a summary, not legal advice.

Can I keep Browser Use or Skyvern and add Mediar only for the desktop steps?

Yes, and some teams do exactly that during a migration. Mediar reaches the browser through the accessibility tree too, so it can run the whole workflow, but there is no rule that you rip out a working browser agent. A common pattern is to let the existing browser agent own the pure-web stretch and let Mediar own the steps that cross into the file picker, the desktop admin app, or the green-screen terminal, then consolidate once the desktop coverage proves out. We would rather you migrate the painful steps first than do a big-bang rewrite.

Adjacent guides on the same boundary

Keep reading

Comparison

Skyvern alternative: which one your workflow needs depends on where it runs

The real fork in any Skyvern alternative list is the browser window edge. If your workflow touches SAP GUI, a file picker, or Citrix, the comparison changes.

Read

Architecture

RPA, Selenium vs accessibility tree: where the workflow boundary sits

Selenium owns the page DOM. The accessibility tree owns the OS. The same per-step dispatch logic that decides browser script versus UI Automation, source-cited.

Read

Legacy systems

AI agents on legacy desktop systems with no API

Why the accessibility tree is the only mainstream way to reach SAP GUI, Jack Henry, Fiserv, Epic, Cerner, Oracle EBS, and mainframe terminals.

Read

The honest head-to-head, inside the browser

The ceiling they share

The anchor: where the boundary is enforced in code

When each one is the right answer

Bring the workflow. We will count where it leaves the browser.

Frequently asked questions

Keep reading

Skyvern alternative: which one your workflow needs depends on where it runs

RPA, Selenium vs accessibility tree: where the workflow boundary sits

AI agents on legacy desktop systems with no API

Comments (••)

Comments ()