A definition, then a disagreement

CUA AI: what a computer-using agent is, and why the term hides two very different products

Most explainers stop at the definition. A computer-using agent is an AI system that drives a computer the way a human would, using mouse, keyboard, and screen context. True, useful, and where most articles end. The interesting part starts after that, because the same name covers two architectures that share almost nothing in common at runtime. One of them benchmarks at 38.1% on OSWorld and is what most people picture when they hear the term. The other is what gets signed into production at a community bank or an SAP-running F&B chain. They look identical in a demo and they make the difference between a pilot and a contract.

Matthew Diakonov, Written with AI

Published May 6, 20269 min

Direct answer (verified 2026-05-06)

A computer-using agent (CUA) is an AI system that drives a computer's graphical interface the way a human does, using mouse, keyboard, and screen context. The term was popularized by OpenAI's January 2025 launch of the model behind Operator, but the category covers any agent that fits that pattern.

Two architectures share the name. A vision-loop CUA invokes a vision-language model on every UI step (OpenAI's Operator, Anthropic's Computer Use beta, the open-source trycua/cua). A tree-based CUA reads the OS accessibility tree at runtime and invokes the model only at authoring time (Mediar; the open-source reference is github.com/mediar-ai/terminator). The first benchmarks at 38.1% on OSWorld; the second is what ships in audited workloads.

Why the term feels new and is not

The acronym CUA showed up in the popular vocabulary when OpenAI published Operator in January 2025, and most of the explainers published since then trace the lineage back to that launch. The underlying idea is older. Anything that drives a graphical interface using something other than an API qualifies. UiPath was a CUA before the abbreviation existed. So was AppleScript, AutoIt, the Selenium IDE recorder, and every screen-scraping tool the bank operations team was running in 2008. What changed in 2024-2025 is that vision-language models got good enough to substitute for the part of those systems that a developer used to do by hand: read a screen, decide what control to touch, and emit the touch.

That substitution is the actual claim. The marketing version of CUA collapses it into a one-liner: "the model uses your computer." The accurate version is more useful. It says the model has been promoted from the design loop into the perception-action loop, and lets you ask the next question. Promoted to where, exactly? On every step? On some steps? Once at the start, never again? Each answer corresponds to a different product with different cost, latency, and audit properties.

Most pages on this topic do not ask that question. So the rest of this one is about it.

The two architectures, side by side

The split is not about which model the agent uses. GPT-4o, Claude 3.7, Gemini 2.5, and UI-TARS all show up on both sides. The split is about where the model runs in the lifecycle of a workflow.

Where the model lives

Vision-loop CUA

On every UI step, the agent screenshots the screen and sends it to a vision-language model. The model returns a click coordinate or a typed string. The agent fires it, captures a fresh screenshot, and asks again. The model lives in the runtime hot path.

Examples: OpenAI Operator (the consumer surface for the company's CUA model), Anthropic's Computer Use beta, trycua/cua (open source infrastructure for cloud desktops), and Microsoft Copilot Studio's Computer Use feature. The published OSWorld score for the OpenAI agent is 38.1% across 369 tasks, well above the prior 22.0% state of the art and well below the 72.4% human baseline.

Tree-based CUA

A vision-language model runs once, when a human walks the workflow on a screenshare. It reads a flattened version of the OS accessibility tree, infers the intent, and emits a TypeScript file that targets each control by role and name. The runtime that replays the file at 4 a.m. is plain code reading UI Automation. There is no model in the per-step loop.

Mediar is one example. The reference implementation is open source under MIT at github.com/mediar-ai/terminator. The runtime that drives a customer's SAP GUI session at night is a single Rust file at crates/executor/src/services/typescript_executor.rs and you can grep it for openai, claude, anthropic, or gemini and find zero matches.

A short detour into benchmarks

The number a buyer should ask about first is the OSWorld success rate. OSWorld is the benchmark that matters for this category because it tests full computer-use tasks against real desktop environments rather than scripted web-only flows. OpenAI's CUA was the first agent to push past the previous state of the art on it.

OpenAI CUA on OSWorld

Prior state of the art

Human baseline

Source: OpenAI CUA system card, January 2025. OSWorld measures 369 real-world computer-use tasks across operating systems.

Six tasks in ten fail. That is a research result, and a real one, and worth celebrating in the context of where the field was a year earlier. It is also not a number you can stage a hundred-step claims-intake flow on. A workflow's success probability is the product of its step probabilities, so a hundred steps at even 95% per-step accuracy comes out to a coin flip across the whole run. OSWorld at 38% across the whole task means that on a hundred-step flow you are running effectively zero percent end-to-end without retries.

A June 2025 paper called OSWorld-Human adds the time picture. A task a human finishes in two minutes can take an agent over twenty, with 75 to 94 percent of that time spent in the LLM's planning and reflection calls. The cost picture tracks the time picture: every UI step is at least one inference call, sometimes several.

The vision-loop architecture pays this tax by design. The tree-based architecture does not, because the model is not in the loop. The trade is that you give up open-ended in-the-moment reasoning at runtime in exchange for a deterministic replay of a recorded workflow. Whether that trade is good depends on what the workflow is.

Inside a tree-based CUA: dual-channel input fusion

Here is one implementation detail that does not appear in any of the popular CUA explainers, and that turns out to matter for any agent that has to handle a real enterprise mix of native windows and embedded browsers.

When a user clicks a control inside a Chromium browser, two events fire in parallel. A Chrome extension running in the page publishes a BrowserClick event carrying the CSS selector for the element. Independently, the Windows UI Automation layer publishes a Click event carrying the accessibility-tree selector for the same element. The same physical click produces two structured records of itself, viewed from two different layers of the stack.

A vision-only CUA never sees either of these. It only sees the screenshot. A tree-based CUA can take both, fuse them inside a short time window, and emit a single run_command that carries both selectors. At replay time, the runtime tries the higher-fidelity CSS selector first; if the page is rendered through Internet Explorer compatibility mode, an iframe boundary, or any other surface where the DOM is not addressable, it falls back to the UI Automation selector. The fusion logic is in apps/desktop/src-tauri/src/workflow_recorder.rs lines 176 to 194, and the runtime fallback is in apps/desktop/src-tauri/src/mcp_converter.rs starting at line 1034, both in the open-source repo.

One click, two channels, one fused command

That single design choice is why the same agent that drives a SAP GUI window can also drive a hybrid app where some surfaces are native Win32 and others are embedded Chromium. Each surface gets the most precise selector available; neither needs a vision model. Pixel-vision agents cannot do this because they have discarded the structured input before they ever reach the model.

The runtime grep is the architecture audit

You can disambiguate the two architectures with a one-line test on the runtime executable. If the runtime imports an LLM SDK, the model is in the loop. If it does not, the model has been confined to authoring time.

For Mediar, the runtime executable is one Rust file at crates/executor/src/services/typescript_executor.rs. It is 871 lines. Its job is mechanical: pull a workflow off a Postgres queue, build the MCP arguments (workflow URL, secrets, trace id), and call the MCP execute_sequence tool against the target Windows session. The model only enters the picture in a completely separate file at authoring time, in apps/desktop/src-tauri/src/recording_processor.rs's analyze_step function around line 1197, where Gemini Pro reads the captured tree once for each recorded step.

You can run the test yourself. Clone github.com/mediar-ai/terminator, open the executor file, and grep for openai, claude, anthropic, or gemini. You will find zero matches. That is the architectural commitment in source form: the model is allowed to think slowly about one workflow at a time during recording and is not allowed to be in the loop when the workflow is replaying against a customer's general ledger at 4 a.m.

The same grep against a vision-loop CUA finds the LLM client import in the runtime, because that is where the architecture puts it. Neither answer is wrong; they are different products. The grep just makes the choice legible.

When each architecture is the right call

The vision-loop architecture is the right default when the workload is exploratory or one-shot, when the surface genuinely has no structured channel (a third-party PDF report you cannot script, a CAPTCHA-laden site, a video frame), and when the team has the budget for per-step inference. Operator booking a flight, a researcher asking Claude Computer Use to summarize a Notion page, or a developer prototyping with trycua/cua all fit. The per-step model call is a feature in those settings because the system needs to react to whatever shows up on screen.

The tree-based architecture is the right default when the workload is repetitive, the target surface publishes an accessibility tree (which covers SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner, eClinicalWorks, mainframe terminal emulators, and most of the Windows desktop estate), and the team needs deterministic replay for audit. A claims intake flow at a mid-market carrier, a customer onboarding flow at a regional bank, a patient verification flow in a clinic, and a master data update on SAP B1 all fit.

The honest summary is that the two architectures address adjacent but different problems. The vision-loop CUA is the right answer for novel exploration; the tree-based CUA is the right answer for unattended replay. A team buying for the second and shown a demo of the first will be disappointed by the OSWorld score in production. A team buying for the first and handed a tree-based runtime will find it inflexible. The confusion comes from the shared name.

What this means if you are evaluating a CUA

The first question to ask a vendor is not which model they use. The first question is where the model runs. If the answer is "on every step from a screenshot," you have a vision-loop product. Plan for the per-step latency, the per-run token bill, and the audit conversation about why the action sequence is non-deterministic. If the answer is "once at recording time, the runtime is plain code reading the OS accessibility tree, and the workflow file lives in git," you have a tree-based product. The bill is bounded by runtime minutes, the audit artifact is a diffable code file, and a UI rev does not require regenerating the workflow under a fresh model session.

The second question is what the runtime executable looks like. Ask to see it. If the vendor cannot point at a single file (or will not), the architecture probably does not support the determinism story they are telling.

Have a desktop workflow you want priced as a tree-based CUA?

Book 25 minutes. We will record one pass of the workflow on a screenshare, show you the authored TypeScript file, and quote the runtime. No slides.

Common questions

What does CUA stand for?

Computer-Using Agent. The term was popularized by OpenAI's January 2025 system card for the model that powers Operator, but the category is older: any AI system that drives a computer's graphical interface using mouse, keyboard, and screen context qualifies. Anthropic's Computer Use beta, Google's Gemini 2.5 Computer Use model, the trycua/cua project, Microsoft Copilot Studio's Computer Use, and the desktop side of Manus all fit the definition.

Is CUA the same as RPA?

Not quite, and the difference is where the intelligence lives. Classic RPA tools (UiPath, Automation Anywhere, Blue Prism, Power Automate) ship with a designer where a developer hand-authors selectors and a runtime that replays them. A CUA replaces the designer with a model. The two flavors of CUA differ on whether they also replace the runtime: a vision-loop CUA does (the model is in the runtime), and a tree-based CUA does not (the model is only in the designer; the runtime is still deterministic code).

How well do vision-loop CUAs work today?

On the OSWorld benchmark, OpenAI's CUA scored 38.1% across 369 real desktop tasks. The previous best was 22.0%; the human baseline is 72.4%. Translated for a buyer: roughly six in ten attempts fail. WebArena (web-only tasks) and WebVoyager are easier, where the same agent reaches 58.1% and 87% respectively. The OSWorld-Human follow-up paper from June 2025 measures the time cost too: a task a human finishes in two minutes can take an agent over twenty, with 75 to 94 percent of that time spent in LLM planning and reflection calls.

Why does that matter for a regulated workload?

Two reasons. First, a 38% per-task success rate is a research result; running a hundred-step claims-intake flow at that error rate compounds to near-certain failure. Second, the per-step model call is non-deterministic. Two identical inputs can produce different actions, which is acceptable for a research benchmark and unacceptable for a posting against a general ledger or a chart in Epic. The audit conversation that asks why the agent clicked button B on Tuesday and button C on Wednesday is one most CFOs will not have a second time.

How does a tree-based CUA avoid that?

It moves the non-deterministic part out of the runtime. The model is invoked once, at recording time, when a human walks the workflow on a screenshare. The model's job is small: read the structured accessibility tree of each screen and emit a TypeScript file that targets each control by role and name. That file is checked into git, code-reviewed, and replayed by a runtime that reads UI Automation directly. Two identical replays click the same control because the control is identified by its role and name, not by a model decision.

Where does Mediar invoke the model and where does it not?

Mediar invokes Gemini Pro at one place: the analyze_step function in apps/desktop/src-tauri/src/recording_processor.rs at line 1197, which fires while a human is recording. The runtime that replays workflows is crates/executor/src/services/typescript_executor.rs, 871 lines of Rust whose only job is to pull a workflow off a Postgres queue and call the MCP execute_sequence tool. Grep that file for openai, claude, anthropic, or gemini and you find zero matches. The architectural commitment is in the source.

What is the dual-channel recording trick?

When a user clicks a control inside a Chromium browser, two events fire in parallel. The Chrome extension publishes a BrowserClick event with the CSS selector. The Windows UI Automation layer publishes a Click event with the accessibility-tree selector. Both reach the recorder. Mediar fuses them into one run_command that carries both selectors (apps/desktop/src-tauri/src/workflow_recorder.rs lines 176-194 and the merge handler in apps/desktop/src-tauri/src/mcp_converter.rs starting at line 1034). The runtime tries the CSS selector first and falls back to the UIA selector when the page is rendered in compatibility mode or when the DOM has shifted. A vision-only CUA cannot do this because it never sees structured input in the first place.

When is a vision-loop CUA the right choice?

Three places. First, when the workload is exploratory: a researcher who wants an agent to book a flight, summarize a Notion page, or run a one-off data pull. Second, when the target surface genuinely has no other channel: a screenshot of a third-party report you cannot script, a video frame, a CAPTCHA-laden site. Third, when the team has the budget for the per-step inference and the freedom to retry until it works. None of those describe a regulated enterprise workload running unattended overnight.

Does the tree-based approach cover every workflow?

No. Three known gaps. First, vintage custom-paint apps that bypass standard controls (rare in regulated workloads because they break screen readers too). Second, partially bridged Java AWT clients where some controls publish to UIA and some do not. Third, novel exception handling that a recording cannot capture (multi-day case investigations, fraud triage). For all three, a hybrid is correct: read the tree where it is rich, fall back to OCR plus a small vision call for the residual controls. The hybrid pays for inference only on the controls that earn it, not on every step.

How do I try a tree-based CUA without buying anything?

Clone github.com/mediar-ai/terminator. The repo includes the recorder, the executor, and a runnable example. The license is MIT. A team that wants to extend the SDK in-house can do that without paying. The Mediar cloud product layers the authoring pipeline, the queue, secret management, and SOC 2 Type II controls on top, with pricing at $0.75 per minute of executor runtime plus a $10K turn-key program fee that converts to credits.

Adjacent reading

Architecture

AI agents on legacy desktop systems with no API

The structured tree that screen readers read is also the surface a tree-based CUA reads. The exact format Mediar feeds the model from a SAP GUI window and the four-strategy fallback when the UI shifts.

Read

Boundaries

AI agents replacing UiPath RPA: the boundary line

Where the model lives in a working replacement: at authoring time, not in the runtime hot path. The shape of the deterministic artifact and the source-code grep that disambiguates the architecture.

Read

Governance

Enterprise AI agent governance for legacy systems

What a code-file workflow buys you in an audit conversation, and why the audit story is the part that decides whether a CUA pilot becomes a contract.

Read