A short essay on a wrong default

Legacy desktop systems already have an API. It is the accessibility tree.

The current default answer for how AI agents should drive a SAP GUI, a Jack Henry teller window, an Epic chart, or a mainframe terminal is to point a vision model at the pixels and let it reason. That is a real solution. It is also the wrong default. The same operating system that paints those pixels also publishes a structured tree of every visible control, the same surface a screen reader reads. The tree is faster to read, deterministic to act on, and survives UI changes that break a pixel matcher. Most of the production-ready AI desktop agents that do not stall in audit are built on it.

M
Matthew Diakonov
8 min

Direct answer (verified 2026-05-01)

An AI agent that has to drive a legacy desktop system with no API reads the OS accessibility tree. On Windows the surface is UI Automation (UIA); on Linux it is AT-SPI; on macOS it is the AX accessibility API. Every visible control becomes a structured node a screen reader can read, and an agent reads the same surface as a substitute for the missing API.

A reference implementation is open source under MIT at github.com/mediar-ai/terminator. The function that flattens a SAP GUI window into the format the model sees is at recording_processor.rs:1014.

The pixel-first default is a category error

Read the current top results for this question and you get a consistent story: legacy systems have no API, vision-language models can read screens, therefore point a VLM at the screen and have it click. Microsoft Copilot Studio's Computer Use, Bytebot, OpenAI's computer-use agents, and most of the “agentic automation” pitches all sit on this premise. The premise is missing one fact.

Legacy desktop systems do have a machine-readable surface. The OS attaches one because screen readers need it. JAWS and Narrator do not look at the pixels of an SAP transaction code; they ask the window for its tree of controls and read it aloud. That tree contains the role of every node (Window, Button, Edit, ComboBox, CheckBox, Tree, TreeItem), the human-readable name (the same label a sighted user reads), and a small set of attributes that capture state (value, checked, selected). A 30-year-old Win32 form published the same tree the day it was compiled, because the accessibility surface lives at the OS layer, not in the application code.

A pixel-first agent ignores all of this. It sends a screenshot to a model and pays for inference to recover information that the OS would have handed it for free.

What the model actually reads from a SAP GUI window

The Mediar recorder serializes a captured window into a flat, indented list, one node per line, in this exact format. The serializer is generate_simplified_ui_tree_string at apps/desktop/src-tauri/src/recording_processor.rs:1014. A real Customer Master Data window from SAP comes out looking like this:

1.  I.    [Window] 'Customer Master Data Maintenance'
2.   II.  [Toolbar] 'Standard'
3.    III. [Button] 'Save' {focusable=true}
4.    III. [Button] 'Display'
5.   II.  [Group] 'General Data'
6.    III. [Edit] 'Customer' {value="0000470192"}
7.    III. [Edit] 'Title' {value="Mr."}
8.    III. [Edit] 'Name 1' {value="Imperial Treasure Pte Ltd"}
9.    III. [Edit] 'Search Term' {value=""}
10.   III. [Edit] 'Country' {value="SG"}
11.   III. [Edit] 'Postal Code' {value="048619"}
12.   III. [ComboBox] 'Reconciliation Account' {selected=true}
13.   III. [CheckBox] 'Posting Block' {checked=false}

That is what gets fed to the model at authoring time. No image, no OCR, no pixel coordinates. A SAP transaction with 80 controls becomes about 80 lines of text. The model reads the tree, decides which nodes the workflow touched in which order, and writes a TypeScript file that targets each node by role and name. The authored file is what runs in production.

The two architectures, side by side

On every UI step, the agent screenshots the window and asks a vision model what to do next. The model returns coordinates or an action, the agent fires it, captures a new screenshot, and asks again. Inference runs in the hot path of every workflow execution.

  • 30 to 60 seconds per UI step (model latency dominates)
  • Token bill scales with workflow length and queue volume
  • Two identical runs can pick different buttons (non-deterministic)
  • Audit trail is a chain of model decisions, not a code artifact
  • UI changes that survive vision still cost a fresh inference

The four-strategy fallback when the UI shifts

The classic complaint about RPA is that selectors break the moment someone renames a label or reorders a panel. The classic complaint about vision agents is that they hallucinate when the screenshot looks different. The tree-based approach gets a third option: when the recorded element does not match cleanly, walk a cascade. The cascade in apps/desktop/src-tauri/src/focus_state.rs lines 168 to 196 has four steps:

  1. Strategy 1

    Match by automation or accessibility ID

    Stable across most UI changes because it is the property the OS uses to identify the control to a screen reader. Survives label rewrites, panel reorders, and theme changes.

  2. Strategy 2

    Match by window plus bounds

    If the ID has changed (a new SAP support pack can do this), find the window, walk the tree, pick the node whose rectangle is closest to the recorded one. Survives most layout-stable patches.

  3. Strategy 3

    Match by visible text

    If the rectangle moved too, fall back to the human-readable name. The label on the field is what the analyst recorded against, so this is the closest analog to what they would do if asked to find the field again themselves.

  4. Strategy 4

    Window-only focus

    If none of the above land, focus the parent window so the next event in the workflow has a chance to fire against the right surface. The run is flagged for review, not killed silently.

A SAP support pack that nudges a field down by 12 pixels triggers nothing because the id matches. A pack that renames the field label triggers strategy two. A pack that does both still gets caught by visible-text matching. A pack that breaks all four flags the run rather than silently typing into the wrong field, which is the conservative default for anything posting to a general ledger.

Where the accessibility tree fits in a working pipeline

PDF invoice
Mainframe screen
SAP GUI window
Jack Henry teller
Epic chart
OS accessibility tree
Recorded events
Workflow file (.ts)
Deterministic replay

The runtime is the proof

A claim that the model only runs at authoring time is testable. Read the runtime. The Mediar production executor is a single Rust file at crates/executor/src/services/typescript_executor.rs. It is 871 lines. Its job is mechanical: pull a workflow off a Postgres queue, build MCP arguments (file URL, secrets, trace id), and call the MCP execute_sequence tool against a Windows session. The MCP server wraps the Terminator SDK, which is the part that talks to UI Automation.

Grep that file for gemini, openai, claude, or anthropic and you find zero matches. That is the architectural commitment. The model is in the recorder pass, where it has time and only sees one workflow once. The runtime that types into a customer's SAP system at 4am is plain code reading the tree.

70%

An F&B chain on SAP Business One moved off UiPath onto a tree-based agent and reported a 70% reduction in automation cost to their board.

Mediar customer reference, full disclosure on request

When the tree does not have what you need

The accessibility surface is rich enough for almost every Windows enterprise app, but it is not magic. Two known gaps. First, vintage custom-paint apps that bypass standard controls and render directly to a device context. These are rare in regulated workloads because they break screen readers too, and most enterprises retired them in the last accessibility audit cycle, but a few survive in manufacturing and trading desks. Second, partially-bridged Java AWT clients where some controls publish to UIA and some do not.

For both, the right answer is a hybrid. Read the tree where it is rich, fall back to OCR plus a small vision call for the residual controls. The hybrid pays for inference only on the controls that earn it, which is a couple of cents per run rather than a model call per UI step. The same hybrid pattern appears for browser-rendered legacy apps: when the target is Internet Explorer or Edge in compatibility mode, Mediar prefers DOM capture via a Chrome extension, then falls back to UIA. The handler that combines them is in apps/desktop/src-tauri/src/mcp_converter.rs starting at line 1034.

The point is not that the tree wins everywhere. It is that the tree wins so often, and the cost of reading it is so close to zero, that it should be the default and pixels should be the fallback, not the other way around.

What this means in practice

If you are evaluating an AI agent for a workload that lives in a Windows desktop system without an API (and that covers most of SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner, eClinicalWorks, mainframe terminal emulators, and the long tail of internal Windows apps), one question separates the production-ready tools from the demo-ready ones: where does the model live in your execution architecture.

If the answer is “the model decides the next action on every step from a screenshot,” you have bought a vision-based agent. Real products work. Plan for the per-step latency and per-run token cost in your ROI math, and plan for the audit conversation where you explain why the action sequence is non-deterministic.

If the answer is “the model runs once at authoring time, the runtime is plain code reading the OS accessibility tree, and the workflow file lives in git,” you have bought a tree-based agent. The bill is bounded by runtime minutes, the audit trail is a code file you can diff, and a UI rev does not require regenerating the workflow with a fresh model session.

Have a SAP, Jack Henry, Epic, or mainframe workflow you want quoted?

Book 25 minutes. We will record one pass of the workflow on a screenshare, show you the authored TypeScript file, and price the runtime. No slides, no decks.

Common questions

How can an AI agent automate a legacy desktop app that has no API?

It reads the operating system's accessibility tree. Every visible Windows control publishes a structured node through UI Automation (UIA), the same surface a screen reader reads. SAP GUI, Jack Henry teller windows, Epic charting screens, mainframe terminal emulators, even 1990s VB6 apps expose their fields, buttons, and tables to UIA. The accessibility tree is the API. Reading it gives you a structured, machine-readable representation of the screen without a single pixel-vision LLM call. AT-SPI on Linux and AX on macOS are the equivalent surfaces on those platforms.

How is this different from a vision-based computer-use agent?

A vision-based agent screenshots the screen and asks an LLM to reason over the pixels every step. That works in demos and stalls in production for three reasons: latency (one LLM call per UI action stacks into 30 to 60 seconds per step), cost (token bills compound on a queue of 5,000 invoices), and determinism (the model can pick a different button on the second run, which is unacceptable for an audited SAP post). A tree-based agent reads a structured representation that already exists on the machine, takes microseconds, and returns the same answer every run. The model still helps, but it lives at authoring time, converting one recording into a deterministic workflow file. The runtime that replays the file has zero LLM calls in its hot path.

What does the structured representation actually look like?

Mediar serializes a window into a flat indented list, one node per line, in this format: '{LineNumber}. {RomanNumeralIndent}. [Role] \'Name\' {Attributes}'. The function is generate_simplified_ui_tree_string in apps/desktop/src-tauri/src/recording_processor.rs at line 1014, in the open-source monorepo at github.com/mediar-ai/terminator. A SAP customer master screen comes out as a few dozen lines of text the model can read directly. There is no image involved. The same approach trivially extends to mainframe terminal emulators, COBOL forms-mode UIs running in Reflection or PuTTY, and Citrix-published apps, because the OS-level accessibility surface is what the emulator publishes regardless of how old the underlying software is.

What happens when the UI changes and the recorded element is not where it was?

Most of the legacy-system literature stops at this point because pixel-matching breaks here. A tree-based agent has more options. The Mediar runtime walks four strategies in apps/desktop/src-tauri/src/focus_state.rs lines 168 to 196: match by automation or accessibility ID, match by window plus bounds, match by visible text, and finally fall back to window-only focus. Most enterprise UI patches (a label rename, a panel reorder, a control id change) are absorbed by one of the first three. A change that breaks all four flags the run for review rather than silently posting against the wrong field.

Does this only work for new apps written with accessibility in mind?

No, and that is the point. Windows applications written in the 1990s already expose their controls because the accessibility surface was added at the OS layer, not by the app developer. SAP GUI, Oracle Forms, Win32 banking core systems, mainframe terminal emulators, and Citrix-published apps all surface controls through UI Automation today. The places this breaks are vintage apps that paint to a custom HDC instead of using standard controls (rare in regulated workloads because it breaks the screen reader story too) and a handful of poorly bridged Java AWT clients. For 95% of legacy desktop estates, the tree is already there.

What if the legacy app does not expose a useful tree?

Two falls back. First, the runtime can hybridize: read the tree where it is rich, fall back to OCR plus pixel match for the controls the tree misses. Second, on browser-rendered legacy apps (a startling amount of "legacy" enterprise software is now Internet Explorer or Edge in compatibility mode), Mediar prefers DOM capture via the Chrome extension, then UIA fallback. The relevant code is in apps/desktop/src-tauri/src/ui_tree_capture.rs and apps/desktop/src-tauri/src/mcp_converter.rs, where the browser-click-with-UIA-fallback handler is at line 1034.

What is the runtime that actually drives the desktop?

A 871-line Rust file at crates/executor/src/services/typescript_executor.rs in the same open-source repo. Its job is mechanical: pull a workflow off a Postgres queue, build the MCP arguments, and call the MCP execute_sequence tool against the target Windows session. The MCP server wraps the Terminator SDK, which is the part that talks to UI Automation. You can grep that runtime file for 'gemini', 'openai', 'claude', or 'anthropic' and find zero matches. That is the architectural commitment to the tree-not-pixels approach: the model only runs at authoring time, when a human walks the workflow once.

Can I see this work end-to-end?

The open-source Terminator SDK at github.com/mediar-ai/terminator includes the recorder, the executor, and a runnable example. A pilot through Mediar runs the same components plus the cloud authoring pipeline, the queue, and SOC 2 Type II controls. Mediar's pricing for the cloud product is $0.75 per minute of executor runtime plus a $10K turn-key program fee that converts to credits. A team that wants to extend the SDK themselves can do that under the MIT license without paying anything.

Where does the tree-based approach not win?

Two places. First, browser-only flows on modern SaaS where a stable HTTP API exists. If the data lives behind an OpenAPI spec, an API integration beats both tree-based and vision-based agents on cost and reliability. Second, workflows that need free-form planning that a recording cannot capture (multi-day case investigations, fraud triage, novel exception handling). For those, agentic systems with the model at runtime are still the right fit. The boundary where this tree-based approach is the right default is recording-shaped repetitive work on legacy desktop systems with no API, which is exactly the place RPA was invented for and exactly the place vision-only agents have not stuck.