RPA architecture, one decision

Selenium owns the page DOM. The accessibility tree owns the OS. Pick based on where your workflow boundary actually sits.

Most pages on this comparison frame it as web automation vs desktop automation and tell you to pick one. That is the wrong frame for any real RPA workflow, which crosses the boundary multiple times in one run. The honest version is below, with the source files that implement it.

Matthew Diakonov, Written with AI

Published May 19, 20266 min

Direct answer (verified 2026-05-19)

Selenium when every step lives inside one page DOM. The accessibility tree (UIA on Windows, AT-SPI on Linux, NSAccessibility on macOS) when any step leaves the page. That includes file pickers, OS sign-in dialogs, the downloads bar, native print preview, Excel, SAP GUI, mainframe terminals, anything with a Win32 or Cocoa window underneath.

The reason most RPA shops end up with two tools wired together (Selenium plus AutoIt or pywinauto) is that Selenium cannot reach past the page. A single accessibility-tree binding reaches both. Microsoft documents the binding at learn.microsoft.com/en-us/windows/win32/winauto/entry-uiauto-win32. Mediar's implementation is open at github.com/mediar-ai/terminator.

Where the boundary actually sits in a real workflow

A claims-intake workflow we shipped last quarter has eleven steps. Two are inside the page DOM of a Chrome tab. Three are inside SAP GUI. Two cross the Windows file picker. One is a print-to-PDF dialog. One copies from Acrobat. Two are inside a desktop policy admin app whose vendor never shipped an API. Selenium reaches step 1 and step 8. Everything else is invisible to it. The accessibility tree reaches all eleven through one selector schema because the browser publishes its DOM through UIA the same way the file picker does.

The diagram below is what the Mediar runtime sees when it loads a recorded workflow. Browser DOM events, native dialog events, and desktop app events all flow into one executor that resolves them against the same live tree. There is no second tool to wire in. The file is one TypeScript file emitted by the authoring layer; the recorder normalizes input into six event types regardless of where on screen the click happened.

One runtime, three input layers

The same workflow, written twice

Two snippets. The first is Selenium against a real banking core web app, then the moment the click opens a native file picker the script ends and a human takes over (or you wire AutoIt). The second is a Terminator selector against the same workflow. The second snippet crosses the boundary inside one file. The runtime decides per step whether to dispatch as a browser JS click (through the Terminator Bridge Chrome extension, ID ajnnhmahjbbohenflacjkbdeohicnbmg) or as a UIA action.

Selenium vs Terminator on the same workflow

// selenium-webdriver, the boundary is the page DOM
import { Builder, By, until } from "selenium-webdriver";

const driver = await new Builder().forBrowser("chrome").build();
await driver.get("https://core.bank/customer/new");

// in-page input, this works
await driver
  .findElement(By.css("input[name=customer_id]"))
  .sendKeys("00112233");

// "Save and export PDF" opens the native file picker.
// selenium does not see it. WebDriver only owns the page DOM.
await driver.findElement(By.css("button#save-export")).click();

// the next 90 seconds of every workflow we have ever shipped:
// switch to AutoIt, pywinauto, sikulix, or "ask the user to do it".

-28% fewer hand-offs

How the runtime decides per step

The dispatch logic lives in apps/desktop/src-tauri/src/mcp_converter.rs. A click event arrives with the focused UI element captured at recording time. The converter looks at the process name. If the process matches a known browser binary, the click is converted into a browser script and dispatched through the Terminator Bridge extension. If it does not, the function returns None and the executor falls through to the UIA path on the same workflow line.

Grep is the cheapest way to verify this against a vendor claim. Below is the test you can run yourself in a clone of github.com/mediar-ai/terminator.

grep test: where the browser-vs-UIA decision happens

1 file

“mcp_converter.rs:443-446 sets prefer_browser_scripts to true by default. Line 862 returns None when the click target leaves the browser process, which sends the executor down the UIA path. The same recorded workflow file runs both layers without a switch in tooling, language, or selector schema.”

apps/desktop/src-tauri/src/mcp_converter.rs, github.com/mediar-ai/terminator

Side by side, on the parts that decide a workflow's lifetime

This is not a checklist of every feature in each tool. It is the six dimensions that determine whether a workflow holds in production after a year of UI patches, framework migrations, and patch-Tuesday churn.

Feature	Selenium	Accessibility tree (Mediar / Terminator)
What it can see	The DOM of one tab. Cross-origin iframes are sandboxed. Browser chrome (address bar, downloads bar) is invisible.	Every element exposed to assistive technology: page DOM, browser chrome, OS file picker, native dialogs, SAP GUI, Excel, mainframe terminals.
Transport	W3C WebDriver protocol or CDP. JSON over HTTP to a driver binary (chromedriver, geckodriver). Browser is the only consumer.	Microsoft UI Automation COM interface on Windows (the uiautomation crate), AT-SPI on Linux, NSAccessibility on macOS. Same API a screen reader uses.
Where it breaks first	Anything that escapes the page: file uploads via OS dialog, print to PDF, download notification, browser sign-in popup, OS auth prompts, downloads cleared by IT.	Apps that explicitly opt out of accessibility (a small handful of custom-rendered canvas apps). Rare in regulated industries because accessibility opt-out fails compliance review.
Selector stability	CSS or XPath against the DOM. Class refactor, virtualized list, framework migration, all of these break selectors silently.	Four-strategy match cascade in apps/desktop/src-tauri/src/focus_state.rs:168-196. Automation id, then window plus bounds, then visible text, then window focus.
What a recorded workflow looks like	A test script. Selenium IDE emits one .side file scoped to one site. Cross-app workflows are not in scope by design.	A TypeScript file emitted from a recording. The recorder normalizes input into six events in recording_processor.rs: button_click, browser_click, text_input_completed, browser_tab_navigation, application_switch, file_opened.
Failure mode when the UI changes	NoSuchElementException. The script halts. A developer rewrites the selector. Maintenance burden scales with the number of pages automated.	Cascade falls to the next strategy. Label rewrite is absorbed by automation id. Dark-mode reskin is invisible to the tree. AutomationId churn is absorbed by Name plus role.
Workflow file ownership	Source is open (Apache 2.0). But each workflow is yours; the maintenance contract is implicit.	Source is open at github.com/mediar-ai/terminator. The executor crate is 871 lines. You can read what runs on your Windows VM line by line.

Selenium is the right tool for end-to-end testing of a single web app, and for RPA when the workflow truly never leaves the page DOM. The accessibility tree wins as soon as a file picker, native dialog, or desktop app enters the workflow. Most enterprise RPA in the wild lives in the second category.

When Selenium is still the right answer

There is a real case for Selenium in RPA. If the workflow runs entirely inside one modern SaaS app, never opens a file picker, never downloads anything that needs to be moved, never crosses to Excel for a calculation, and never touches a desktop tool, then Selenium (or better, Playwright) is faster to write and faster to run than a UIA agent. New-SaaS-only workflows are the home turf of the browser automation generation: Skyvern, Browser Use, CloudCruise.

The honest test is the boundary count. Walk through your highest- volume workflow with a stopwatch. Count how many times the focus leaves the page. Zero means Selenium is fine. One or more means you are either going to wire a second tool in (Selenium plus AutoIt, Selenium plus pywinauto, Selenium plus a person) or you are going to use a binding that reaches both. The accessibility tree is that binding.

The five-second decision

Workflow lives inside one tab forever? Selenium / Playwright.

Workflow touches a file picker, native dialog, Excel, SAP, or any desktop app, even once? Accessibility tree.

The cost of getting this wrong is a year of maintenance bills against a tool that cannot see the half of the workflow it needs.

Bring the workflow on the call. We will count the boundaries with you.

A 30-minute walkthrough on a real workflow. We show the recorder, the dispatch logic in mcp_converter.rs, and the UIA fallback against your actual screens. No slides, the running artifact.

Frequently asked questions

Why is Selenium the wrong tool for RPA in the first place?

Selenium was built for one job: end-to-end testing of a single web app. The W3C WebDriver protocol owns the page DOM and nothing else. RPA workflows in production span more than one app. A claims intake reads a PDF, pastes data into a desktop policy admin system, copies a quote ID, opens a portal in Chrome, attaches the PDF through the native file picker, and ends in a Save As dialog. Selenium sees one of those steps. The other six need a different tool, which means you wire AutoIt or pywinauto next to Selenium and now the workflow file is two languages with no shared selector model. The accessibility tree approach handles all seven steps in one binding because Microsoft UI Automation surfaces the browser DOM and the OS chrome through the same tree.

Doesn't UiPath solve this by bundling WebAutomation plus UIA?

UiPath does ship both, but as two separate activity libraries (UIPath.WebAutomation and UIPath.UIAutomation) that emit different selector formats. A workflow that crosses the browser-desktop boundary has to switch activity sets, which means two selector schemas, two debugging surfaces, and two maintenance code paths. A modern accessibility-tree agent reads the tree directly through Microsoft UI Automation, which already covers the browser content because Edge and Chrome expose their DOM through UIA for screen readers. One binding, one selector schema. The fragility of UiPath cross-boundary workflows is what drives a lot of the migrations we see; see the breakdown in the UiPath alternative guide linked below.

Can the accessibility tree see what Selenium sees inside a Chrome tab?

Yes, because Chromium publishes its render tree through the OS accessibility binding. On Windows, Chrome exposes a UIA tree mirroring the DOM (the same tree NVDA and Narrator read). Every input, button, link, and ARIA-labelled region is reachable from the desktop-side tree. The Mediar runtime prefers browser scripts when the focused element is in a browser process (mcp_converter.rs:443-446, the prefer_browser_scripts flag is true by default) because JS clicks are faster and more deterministic for in-page interaction. When the click target leaves the browser (file picker, OS dialog, downloads bar), the same converter falls through to UIA at mcp_converter.rs:862. Same workflow file, two execution layers, no developer hand-off.

Is the accessibility tree slower than Selenium?

Per call, yes, slightly. A UIA tree walk on a complex Windows app is in the tens of milliseconds; a Selenium findElement on a focused page is in the single digits. At workflow scale this rarely matters because the bottleneck is the app under automation, not the binding. A claims-intake step that types into SAP, waits for a server round-trip, and tabs through five fields spends 200 ms in UIA calls and 12 seconds waiting for SAP. The latency that does matter, an LLM call on every step, is the one to watch. Mediar's executor at crates/executor/src/services/typescript_executor.rs has zero LLM calls in the replay loop; grep for openai, gemini, anthropic, or claude in that file and the count is 0.

What about Playwright? Is it different from Selenium for this comparison?

Architecturally, no. Playwright is faster and has nicer ergonomics, but the boundary is the same: the page DOM through CDP or a browser-specific driver. The moment the workflow leaves the browser, Playwright is in the same position Selenium is. The accessibility tree is the only mainstream Windows binding that crosses the boundary in one selector schema. If your workflow is genuinely browser-only and never touches a file picker, downloads bar, OS sign-in dialog, or desktop app, Playwright is the better web tool. If any step leaves the browser, you want UIA.

How does the recorder decide whether a click is a browser click or a desktop click?

The recorder reads the process name of the focused window at the moment of the click. If the process is chrome.exe, msedge.exe, or another known browser binary (the list is in apps/desktop/src-tauri/src/mcp_converter.rs around line 105), the click is normalized as a browser_click event with both UIA coordinates and a CSS selector emitted by the Terminator Bridge extension (Chrome Web Store ID ajnnhmahjbbohenflacjkbdeohicnbmg, registered in chrome_extension.rs:9). Otherwise it is a button_click event with UIA selector only. The list of normalized events is in recording_processor.rs around line 248: button_click, browser_click, text_input_completed, browser_tab_navigation, application_switch, file_opened.

Does this work on SAP GUI, Jack Henry, Fiserv, Epic, Cerner?

Yes. Those systems expose UIA elements because they have to for accessibility compliance. SAP GUI publishes its controls through a UIA provider; Jack Henry and Fiserv terminal emulators run inside Win32 host windows that expose Edit and Button controls; Epic Hyperspace exposes its panels through the same provider. The Mediar binding reads them directly. Selenium cannot reach any of these systems because none of them are browsers. This is the whole reason the accessibility tree approach exists for enterprise RPA rather than for new SaaS workflows.

If Selenium IDE records a workflow, why can't I just use that for RPA?

Selenium IDE records DOM events: click on selector, type into selector, select option. A real RPA workflow records intent across processes. A claims-intake recording captures a PDF open in Acrobat, a copy from a specific field, a paste into a desktop admin system, a tab switch to Chrome, a click on a Save button that opens a file picker. Selenium IDE sees zero of those events because none of them happen inside one page. A purpose-built RPA recorder sees all of them because it hooks the OS accessibility events, not the browser DOM events. The Mediar recorder, apps/desktop/src-tauri/src/workflow_recorder.rs, is built on this distinction.

What's the maintenance difference once the workflow is shipped?

Selenium scripts break on framework upgrades, virtualization changes, A/B tests, and class refactors. Maintenance hours scale linearly with the number of pages automated. Accessibility-tree workflows mostly survive a label rewrite, dark mode, panel reorder, and AutomationId churn because the four-strategy match cascade in focus_state.rs:168-196 walks automation id, then window plus bounds, then visible text, then window focus before failing. The maintenance hours go down as the cascade absorbs more drift. The first thing a buyer should compute is the maintenance line in their existing RPA contract; that is the line that compresses when the binding changes.

Adjacent guides on the same topic

Keep reading

Comparison

UiPath alternative, accessibility API agents: a five-question test

Most vendors selling AI agents on accessibility APIs are still selector-based underneath. Five questions to ask, with the Mediar source files that answer each one.

Read

Architecture