What the agent actually says to Windows

RPA, legacy desktop, accessibility agents: the wire format

Every page on this topic says the same thing: “we use accessibility APIs instead of pixel matching.” None of them show you the call. This one does. Eight MCP tool names, the selector grammar, the three-fallback cascade, and the file in the open-source runtime where each piece lives.

Matthew Diakonov, Written with AI

Published May 22, 20267 min

Direct answer (verified 2026-05-22)

An RPA agent that drives a legacy Windows desktop app via accessibility APIs is an agent that emits structured tool calls against the OS accessibility tree (Windows UI Automation) instead of pixel matching or fragile control-tree paths. The recorder captures a single human walkthrough, the model converts events to a deterministic TypeScript workflow, and the executor replays it by firing one of eight MCP tools (click_element, type_into_element, press_key_global, activate_element, scroll_element, navigate_browser, execute_browser_script, run_command) at the OS, with a three-fallback selector cascade per call.

Reference implementation, MIT-licensed: github.com/mediar-ai/terminator. The MCP converter that emits the JSON below is apps/desktop/src-tauri/src/mcp_converter.rs.

What one tool call actually looks like

Below is a click against the “Customer” field of the SAP Customer Master window. The shape is the same for Oracle EBS, Jack Henry, Epic Hyperspace, and a 30-year-old VB6 internal app. The process name comes from the OS, the selector compiles to a UIA TreeWalker call, the fallback list is a comma-separated string (not an array, by design, so a workflow file diffs cleanly in git).

workflow.ts (excerpt: one of N steps)

Typing the value into that same field is the next step, same envelope, different tool:

workflow.ts (next step)

The defaults you see (verify_timeout_ms: 2000, timeout_ms: 3000, highlight_before_action: true) are set by helper functions in mcp_converter.rs around line 111. The default ceiling on the fallback cascade is three (max_fallback_strategies = 3, in the same file around line 30). Nothing here is invented for the page; you can grep the repo.

The eight MCP tools the recorder emits

The whole vocabulary an accessibility-API agent uses against a legacy Windows desktop app fits in eight tool names. Every step of a workflow recording becomes one of these, plus a selector, plus a fallback list. There is no “and then the model decides what to click next” in the hot path; the decisions were made by the human walking the workflow, frozen into the .ts file, and now replayed.

click_element

Resolve a control on the accessibility tree, then drive a click at the center of its bounding box. The primary verb. ~70% of an enterprise workflow is clicks against this tool.

type_into_element

Find an Edit or ComboBox by selector, focus it, send a Unicode string. The text goes through the same input pipeline a human keyboard would, so SAP, Oracle, and Win32 all validate it the way they validate a clerk.

press_key_global

Send a key sequence (F4, Ctrl+S, Tab, Enter) without targeting an element. Required for legacy apps where the form is driven entirely by keyboard shortcuts and the menus are decorative.

activate_element

Bring a window or pane to the foreground. SAP GUI, Citrix-published clients, and most banking cores require the right window to be active before the next click registers.

navigate_browser

Drive a URL change in a Chromium-based browser when one is actually in the workflow (e.g. internal portal sandwiched between SAP and a billing system). Optional, not the main verb on a legacy desktop run.

scroll_element

Scroll a Pane, List, DataGrid, or Document node. Required for line-item grids in SAP B1, the Epic patient list, and the long-form Citrix-published forms on Jack Henry.

execute_browser_script

Inject and run JavaScript when the workflow crosses into a browser surface. Out of scope for a pure desktop run; included for hybrid flows.

run_command

Shell out to a process (PowerShell, cmd, a helper binary). Used for the edges: copying a file, calling a CLI, or invoking a Python script the workflow needs at one step.

The call graph end to end

Six participants, no LLM in the loop after recording. The model ran once, on the screen-share that produced the workflow file. Production replays go straight from the executor through UI Automation to the legacy app and back.

Recording once, replaying deterministically

The verifier is the last hop on every step. Each tool call carries a verify_element_exists / verify_element_not_exists predicate with a 2000ms timeout. If the next expected state never appears, the step is logged as a soft failure and the fallback cascade activates. If every fallback misses, the run is flagged for human review rather than silently posting against the wrong field. The conservative default exists because the buyers are CFOs running general ledgers and CIOs running clinical charts.

The selector grammar in one screen

Three primitives compose every selector on the page. They map one-to-one to properties UIA already publishes for every Windows control, so screen readers, accessibility test tools, and an agent are all reading the same surface.

role:X

Match by UIA ControlType. Edit, Button, ComboBox, Window, Pane, Tab, ListItem, Document, DataItem. Survives label rewrites because the underlying control kind does not change in a normal patch.

role:X && text:Y

Narrow by the UIA Name property, which is the human-readable label the form definition published. For SAP GUI, Oracle Forms, mainframe terminals, and Win32 apps, Name comes from the form designer and stays stable across themes and DPI.

role:Window && text:Customer Master >> role:Edit && text:Customer

Scoped search. The >> operator restricts the inner match to descendants of the outer match. Prevents the agent from clicking a Customer field in a popup modal that happened to steal focus, or in the wrong tab of a tabbed SAP screen.

The generator that produces these is generate_primary_selector and generate_scoped_selector in mcp_converter.rs (around lines 1723 and 1772). The full path syntax with chained >> steps is logged as “chained selector with N levels.”

$0.75 / min

“The runtime is priced per minute of execution, not per seat. A 5-minute claim intake costs $3.75. The $10K turn-key program fee converts to credits with a bonus, so it is effectively prepaid usage. No certified-developer line item, no $250K annual maintenance bill for selector breaks.”

Mediar pricing

Why this shape works on legacy apps specifically

The selector grammar above maps directly to properties Windows has required every UI framework to publish since the late 1990s, when Section 508 made accessibility a procurement-blocking regulatory bar. Win32, MFC, WinForms, WPF, and even modern UWP surfaces all publish a UIA tree. SAP GUI does. Jack Henry's green-screen-in-a-window does. Epic Hyperspace does, including inside a Citrix-published session. Reflection, BlueZone, and PuTTY do, with patches as old as twenty years still publishing a clean tree.

What this means for a buyer: the surface area an accessibility-API agent can reach is identical to the surface area a blind employee can use through a screen reader. The regulatory bar floors the floor. If your clerk can read it to a customer over the phone using JAWS, the agent can drive it.

The contrast with browser-based AI agents (the Skyvern/CloudCruise/Browser Use generation) is the part most procurement decks miss. Those tools are excellent for web SaaS. They reach exactly zero of the surfaces above, because none of them paint into a Chromium DOM. A clean way to dogfood the difference: open the SAP GUI logon screen and try to point any headless-browser agent at it. There is nothing for it to attach to.

Have a legacy desktop workflow you want quoted in minutes, not months?

Book 25 minutes. We will record one pass of your workflow on a screen-share, show you the authored TypeScript file, and price the runtime per minute. No slides.

Common questions

What is an RPA agent that runs on legacy desktop apps via accessibility APIs?

An agent that watches a workflow once, then replays it by emitting structured tool calls (click_element, type_into_element, press_key_global, activate_element, scroll_element, navigate_browser, execute_browser_script, run_command) against the OS accessibility tree, not against pixels and not against fragile control-tree paths. On Windows the surface is UI Automation (UIA), the same surface NVDA, JAWS, and Narrator use to read SAP, Oracle, Jack Henry, and Epic to a blind user. The agent reads each control by role and name, attempts the primary selector, walks a short fallback cascade if the UI moved, and only fails the run if every fallback misses.

How is this different from UiPath or Automation Anywhere?

UiPath, Automation Anywhere, and Blue Prism record a stored path through the control tree for every click. The path includes element indices, automation ids, and class names that were correct at recording time. The path breaks every time the UI shifts: a SAP support pack renumbers a control, a WinForms recompile churns ids, a theme change repaints a panel. An accessibility-agent does not store a path. It stores the role and name (role:Edit && text:Customer), and resolves fresh on every run. The fallback cascade is part of the tool argument (fallback_selectors), not a separate maintenance job.

How is this different from a vision-based or pixel-matching agent?

Vision and pixel approaches read the rendered screen and infer where to click. They clear the surface bar (an LLM can see a SAP window) but stall on three things: latency (30-60 seconds per UI step from a model call), cost (tokens per step times steps per workflow), and determinism (the model can pick a different button on the second run). An accessibility-API agent reads a structured tree of role/name/value/state directly from the OS in microseconds, with no inference call, and the runtime that replays the workflow contains no LLM calls at all. The model is in the recorder, not in the hot path.

What does the selector grammar look like in practice?

Three building blocks. role:X targets a UIA control type (Edit, Button, ComboBox, Window, Pane, ListItem, Tab, Document). role:X && text:Y narrows by the human-readable name the form definition published to UIA. The >> operator scopes the search: role:Window && text:Customer Master >> role:Edit && text:Customer finds the Customer field inside the Customer Master window, not in some other modal that happens to have the same field. Every selector compiles down to a UIA TreeWalker call against the live process; nothing is stored as a coordinate.

What is the fallback cascade and why does it matter?

Every tool call carries a primary selector and a comma-separated list of fallback selectors. The default ceiling is three fallbacks (max_fallback_strategies = 3 in the converter's default config). If the primary misses, the runtime tries each fallback in order. Typical cascade: scoped selector (window >> element) first, unscoped element second, role-only third. Most quarterly SAP support packs, WinForms rebuilds, and Epic upgrades are absorbed by one of the first two fallbacks. A change that breaks all three flags the run for human review instead of silently posting to the wrong field, which is the conservative default for anything touching a general ledger or a clinical chart.

Where does the workflow actually live between recording and execution?

As a TypeScript file in a git repo. The recorder converts each captured event into an MCP tool step, the model converts that sequence into a reviewable .ts file, and the executor reads the .ts file at runtime. The model runs once, at authoring time, on a screen-share with the analyst who owns the workflow. The runtime is a Rust crate (see github.com/mediar-ai/terminator under MIT). Grep the runtime for openai, claude, or anthropic and you find zero matches. The architectural commitment is: the model is in the recorder, not in the hot path that types into your general ledger at 4am.

What apps does an accessibility-API agent actually reach?

Anything that publishes a UIA tree on Windows. That includes the systems where UiPath stalls and where browser agents cannot reach at all: SAP GUI, SAP Business One, Oracle EBS, Jack Henry SilverLake and Symitar, Fiserv DNA and Premier, FIS IBS, Epic Hyperspace, Cerner (now Oracle Health), eClinicalWorks, MEDITECH, Citrix-published thick clients, VB6 and Delphi internal apps, MFC and WinForms apps, mainframe terminal emulators (Reflection, BlueZone, PuTTY). The reach is identical to whatever a screen reader can read aloud, which is the regulatory bar Windows enforces.

Is the SDK open source?

Yes. The runtime that resolves selectors against UIA, walks fallbacks, and dispatches clicks is published under MIT at github.com/mediar-ai/terminator. The MCP server, the selector parser, and the Rust crate are in the same repo. The recorder and the no-code web app at app.mediar.ai/web sit on top of it. Teams that want to extend the agent (a custom selector primitive, a process-specific verifier, a new MCP tool) work in the same SDK.

What is the cost model once this is in production?

$0.75 per minute of runtime, billed against credits. The $10,000 turn-key program fee converts to credits with a bonus, so it is effectively prepaid usage. No per-seat licensing, no developer certification, no $250K annual maintenance line for selector cascades. Published deployments: an LG-supplier F&B chain reported a 70% reduction vs UiPath to their board, a mid-market insurance carrier saved roughly $750K a year on claims intake (30 minutes per claim down to 2), a community bank compressed customer onboarding from 8 weeks to 2 weeks on Jack Henry. The numbers are workflow-specific; we will price yours per minute on a screen-share before quoting.

Adjacent reading

Economics

Legacy desktop apps with no API: the moat

The business reason the accessibility-API approach exists. Why the missing endpoint is the vendor's pricing power, and why every cheaper automation layer bounces off the legacy desktop surface.

Read

Deep dive

AI agents on legacy desktop systems with no API

The tree format the model reads at authoring time, the four-strategy resolver, and the source-code grep that separates tree-based from vision-based architectures.

Read

Reference

OS-level accessibility automation for enterprise

The UIA properties the selector grammar maps to (Name, AutomationId, ControlType, BoundingRectangle, FrameworkId) and how the runtime composes them.

Read

What one tool call actually looks like

The eight MCP tools the recorder emits

click_element

type_into_element

press_key_global

activate_element

navigate_browser

scroll_element

execute_browser_script

run_command

The call graph end to end

The selector grammar in one screen

Why this shape works on legacy apps specifically

Have a legacy desktop workflow you want quoted in minutes, not months?

Common questions

Adjacent reading

Legacy desktop apps with no API: the moat

AI agents on legacy desktop systems with no API

OS-level accessibility automation for enterprise

Comments (••)

Comments ()