Computer use agents
What is a CUA, and why do most of them break on SAP?
CUA stands for computer use agent. It is an AI agent that drives software through the screen, looking at the interface and then clicking and typing the way a person would, instead of calling an API. That single design choice (read pixels, or read structure) is the whole story of why some CUAs sail through enterprise desktops and others stall.
Direct answer · verified 2026-06-22
A computer use agent (CUA) is an AI agent that operates a computer through its graphical interface, observing the screen and acting like a human user rather than through an API. The term is used by OpenAI (which calls its model CUA, behind Operator). Searching cuaai usually points at one of two things: the general category, or cua.ai, the open-source project (trycua/cua) that runs sandboxed desktops for these agents. Mediar is a CUA built for the legacy enterprise case, and it reads the accessibility tree instead of pixels.
The CUA category, briefly
A handful of efforts launched the term into common use. They share the same core loop (perceive the screen, reason about it, take an action) and differ mostly in how they perceive.
OpenAI CUA / Operator
GPT-4o vision plus reinforcement learning; clicks by coordinate.
cua.ai (trycua/cua)
Open-source sandboxes and SDKs to train and run CUAs on full desktops.
Anthropic computer use
Vision-driven tool use that moves a virtual mouse and keyboard.
Mediar
Reads the Windows UI Automation tree; acts by role and name, not pixels.
How a screenshot-based CUA actually runs a step
Most CUAs you read about today are vision-based. Each step is a tiny loop: capture the screen, ask a vision model where the target is, and click a coordinate. It is general and impressive on fresh interfaces. On a dense enterprise window that re-renders, the last hop is the weak link.
A vision CUA clicking the Post button in SAP
The error on the last line is not hypothetical. It is the dominant failure mode when the interface is busy and changes often, which describes almost every legacy enterprise system.
The benchmark reality nobody puts in the headline
On OSWorld, the standard test for full computer-use tasks, the best general CUA model still finishes well under half of open-ended desktop tasks unattended.
OpenAI CUA on OSWorld (state of the art at launch)
Human performance on the same benchmark
Previous best, before CUA
The gap between 38 percent and a human is exactly the gap an enterprise cannot ship into a financial posting workflow. The number climbs when the task is narrow and the agent reads structure instead of guessing from a picture.
The other kind of CUA: read the tree, not the pixels
Every Windows application already publishes a structured map of itself: the UI Automation tree, the same data a screen reader consumes. Each control carries a role (Button, Edit, ComboBox) and a name ("Post", "Vendor"). An accessibility-first CUA queries that tree and acts on an element by identity, so it never has to know where the control sits on screen.
Two ways to click the same button
while not done:
img = screenshot() # capture the whole screen
plan = vision_model(img) # "Post button is near (812, 540)"
click(plan.x, plan.y) # click the remembered pixels
# vendor moves the field 40px -> the click lands on nothingThis is the anchor of the whole approach, and it is checkable. Mediar's open-source engine, Terminator (MIT licensed, written in Rust, described as "Playwright for Windows computer use"), targets a control with a selector like role:Button && text:Post pulled straight from the accessibility tree. No screenshot is saved. No x/y coordinate is saved. When the vendor ships a layout update, the element is still found because its role and name did not change.
What a structured run looks like on a window that moved
Here is the same SAP posting after the toolbar shifted down by 38 pixels from the original recording. A pixel matcher would have missed step three. Identity-based resolution does not care.
“we moved an LG-customer F&B chain from UiPath to Mediar; their CFO told the board they're now saving 70 percent on costs”
Mediar deployment, SAP Business One automation
Which kind of CUA you actually want
This is not a claim that pixels are always wrong. The two approaches win in different places, and pretending otherwise would be dishonest.
- Vision wins on new and unstructured surfaces. A modern web app or a fresh interface that exposes little accessibility metadata may only be reachable by reading pixels. Browser-based CUAs are genuinely good here.
- Structure wins on legacy desktop systems. If your data lives in SAP GUI, a mainframe terminal, a Jack Henry or Fiserv core, or an Epic or Cerner client, those apps have rich accessibility trees and no useful API. Reading the tree is faster to stand up and far more stable than guessing from a screenshot.
- The split is the buying decision. Most enterprise ops teams have both kinds of work. The mistake is forcing one engine onto both. Mediar exists for the legacy desktop half, where a general vision CUA stalls.
See a CUA run your actual workflow
Record a workflow once in the no-code web app and watch it replay against the live accessibility tree, or wire it up yourself with the open-source SDK.
Bring the workflow a vision CUA keeps fumbling
Book a working session and we will run your hardest legacy desktop task against the accessibility tree, live.
CUA questions, answered
What is a CUA in one sentence?
A CUA (computer use agent) is an AI agent that operates a computer through its graphical interface, observing the screen and then clicking, typing, scrolling, and navigating the way a person would, instead of calling an API. Because it works at the GUI layer, it can drive software that exposes no automation API at all, which is why the term shows up most around legacy desktop apps and enterprise systems.
Is cua.ai the same thing as a computer use agent?
Related but not identical. "Computer use agent" (CUA) is the general category. cua.ai (the open-source project trycua/cua) is one specific implementation: open-source infrastructure that spins up sandboxed macOS, Linux, Windows, and Android desktops so you can train and run computer-use agents against them. OpenAI's Operator, Anthropic's computer use, and Microsoft's agents are other implementations of the same category.
How well do CUAs actually work today?
On OSWorld, the standard benchmark for full computer-use tasks, OpenAI's CUA model scored 38.1 percent, which was state of the art and far above the previous best of 22 percent. For reference, human performance on the same benchmark is 72.4 percent. So a general vision-based CUA completes well under half of open-ended desktop tasks unattended today. The number rises sharply when the task is narrow, the app is stable, and the agent reads structure instead of guessing from pixels.
Why do screenshot-based CUAs struggle on SAP GUI, mainframes, and banking core systems?
Those interfaces are dense, text-heavy, and change layout often. A vision model has to locate a target visually each step and then act on a coordinate, so a 40-pixel shift, a re-rendered grid, or a theme change can send the click to the wrong place. There is also no margin for error when the action posts a financial transaction. Reading the accessibility tree sidesteps this because the target is identified by its role and name, not its position.
What is the accessibility-tree approach, and how is it different?
Every Windows app exposes a UI Automation tree, the same structured data a screen reader uses. Each control has a role (Button, Edit, ComboBox) and a name ("Post", "Vendor"). An accessibility-first CUA queries that tree for something like role:Button && text:Post and acts on the element directly. It never stores a screenshot or an x/y coordinate, so when the layout moves the element is still found by identity. Mediar's open-source Terminator SDK (MIT licensed, github.com/mediar-ai/terminator) takes this approach and bills itself as "Playwright for Windows computer use."
When is a vision-based CUA the better choice?
When the target is a modern web app or a brand-new interface that exposes little or no accessibility metadata, vision is sometimes the only thing that works, because the model can read pixels even when the structure is missing. Vision is also useful as a fallback layer. The honest split is: new SaaS and unstructured surfaces favor vision, while legacy Windows desktop systems with rich accessibility trees favor the structured approach.
Can a CUA learn a workflow without me writing code?
Yes. Mediar records a workflow once by watching you perform it, reading the accessibility node under each interaction rather than filming your screen, then replays those steps against the live tree. Non-developers record and run automations from the no-code web app at app.mediar.ai/web, and developers extend the same engine through the open-source Terminator SDK.
What does a CUA cost to run in production?
Mediar charges 0.75 dollars per minute of runtime with no per-seat licensing. There is a 10,000 dollar turn-key program fee that converts to credits with a bonus, so it functions as prepaid usage. That model exists because the value is in execution time on real workflows, not in seats sitting idle.
Keep reading
What is RPA, and how the robot finds the button
The plain definition, plus the locator mechanism every other guide skips.
Accessibility tree vs pixels at the input layer
Why the layer an agent reads decides whether it survives a UI change.
Automating SAP data entry without an API
How structured automation posts into SAP GUI end to end.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.