Guide

Workflow assistant: the test isn’t whether it can click and type

A workflow assistant watches you do a task once and rebuilds the recording into something a machine can replay later. The honest implementations read what the OS already exposes through accessibility APIs (the same interfaces a screen reader uses) so the recording survives layout changes, and they label each step using the surrounding context, not the pixels in isolation.

Matthew Diakonov, Written with AI

Published April 30, 202611 min read

Direct answer, verified 2026-04-30

An AI workflow assistant is a program that watches you complete a task once and turns the recording into something a machine can replay later, without you writing a script.

What separates a useful one from a demo: it reads the accessibility tree the OS already exposes (so it survives layout changes), it labels each step in the recording using the steps that came before and after it (not just the click in isolation), and it produces an output you can read, edit, and version (a file, not a black box). Source code is at github.com/mediar-ai/terminator.

Where the term has drifted

The phrase "workflow assistant" now covers three different products in three different layers, and most pages on the subject blur the distinction. The blur is fine for a shopping comparison and bad for an architecture decision, so it is worth pulling apart up front.

The first layer is the API-plumbing assistant. Zapier, Make, n8n, Gumloop, and most of the platforms that surface in "AI workflow tools 2026" roundups live here. You describe a trigger, a condition, and an action; the platform translates that into HTTP calls against a fixed catalog of integrations. The assistant part is usually a chat that drafts the configuration. The runtime is just an iPaaS.

The second layer is the screen-watching assistant. Adobe Firefly, Microsoft Copilot inside Office, the Operator-style browser agents, and the new wave of vision-based desktop agents like Simular and Skywork live here. They look at the screen (often as pixels through a vision model) and try to act on what they see. They are often impressive in a demo and brittle in a production loop, because the perception layer is non-determinist and the recordings are not durable artifacts.

The third layer is the recording-and-replay assistant. Mediar lives here, and so does any RPA platform that read the OS accessibility tree honestly (UiPath at its best, parts of Power Automate Desktop, Blue Prism). The user does the task once. The assistant captures structured events and the accessibility tree. Output is a workflow file that can be re-run on a schedule or on demand. This is the layer where "workflow assistant" stops being a chat veneer and starts being something you can deploy.

Inside the recording loop

The rest of this page documents the recording-and-replay layer using Mediar’s desktop agent as the worked example. The source files cited live in the apps/desktop/src-tauri/src/ directory of the Mediar product monorepo.

What the assistant sees, processes, and emits

The six events the assistant pays attention to

The first thing the assistant decides is what not to think about. A live screen produces hundreds of low-level events per second: mouse moves, focus changes, scroll wheels, hovers, property updates inside the accessibility tree. Treat them all as candidates for a step and the LLM bill alone makes the assistant unaffordable, never mind the noise in the labels.

Mediar narrows this firehose at one named function:

fn is_meaningful_event_type(event_type: &str) -> bool {} at recording_processor.rs:250

Meaningful event types (everything else is recorded but not analyzed)

button_click
browser_click
text_input_completed
browser_tab_navigation
application_switch
file_opened

Six is a small number on purpose. Each one is an action the user deliberately took: pressing a button, clicking inside a browser, finishing a text input, switching tabs, switching applications, opening a file. Hovers, scrolls, and focus events get captured for replay accuracy, but they never trigger a Gemini call. This is the difference between $0.75/minute of runtime and a bill that scales with cursor movement.

The four stages, in the order they actually run

Stage 1 starts in parallel with the recording. Stages 2, 3, and 4 run after the user clicks stop. The labeling gate (Stage 2) is the one that decides whether the labels read as descriptions of clicks or as descriptions of intent.

Watch the pipeline run on one recording

01 / 04

Stage 1, step analysis

For every meaningful event, the assistant sends Gemini the screenshot before, the screenshot after, the accessibility tree before, the accessibility tree after, and the recent low-level events. The model returns eight named fields: step_title, step_summary, events_that_happened, how_content_changed, results_if_any, what_was_clicked, what_was_typed, user_intent.

Why the labeling gate exists

The labeling gate at check_labeling_gate on line 315 of recording_processor.rs enforces three conditions before step N is allowed to receive its second-pass label. The step itself must have a Stage 1 analysis. The five steps before it must each have an analysis. The five steps after it (or as many as exist if N is near the end of the recording) must each have an analysis. Until all eleven slots are filled, the labeler waits.

The reason is the prompt that runs once the gate opens. Read the prompt verbatim from recording_prompts.rs: "For example, if the target step’s summary is just"Clicked button ‘Submit’", but the neighboring steps show the user filling out a registration form, the new label should be "Submitted the ‘New User Registration’ form"."

Without the gate, the label is correct and useless. With the gate, the label is correct and useful. The synthesis stage that runs next consumes those labels as the names of the steps in the hierarchical workflow, and a workflow named "Click Submit" is much harder to debug than one named "Submit New User Registration" when it breaks in production three months later.

11 / 11

“Step N gets its context-aware label only after the five steps before it and the five steps after it have all been analyzed. Below the threshold, the labeler waits.”

apps/desktop/src-tauri/src/recording_processor.rs:315

What synthesis is allowed to do, and what it isn’t

Stage 3 takes the full timeline of labeled steps and produces a hierarchical workflow definition: workflows containing steps containing substeps, where each substep declares its inputs, outputs, and business_logic. The schema is enforced by WORKFLOW_SYNTHESIS_SCHEMA in the prompts file, not by post-hoc validation, so the model either returns the right shape or the call fails.

The most load-bearing line of the synthesis prompt is the last rule: "Do Not Hallucinate. Base all synthesized information directly on the provided context and event data. Do not invent steps, inputs, or outputs that are not supported by the evidence." This is a real constraint, not a marketing line. If the user only typed a vendor ID once in the recording, synthesis cannot fabricate a "retry on failure" substep that was never observed; the assistant has no evidence for it. The output is a faithful structured rendering of what actually happened, and nothing else.

That sounds like a limitation, and it is. But it is the same limitation a screen reader has: it can only narrate what is actually on screen. When the workflow needs branching, retry logic, or error handling that the recording did not include, the recording-and-replay assistant is not the right place to add it. The output file (TypeScript via the @mediar-ai/workflow SDK) is, because it is a real file you can edit.

Three perception mechanisms, one outcome that matters

The choice of perception mechanism is the single decision that determines whether your workflow assistant survives a software update. There are three families in the wild today.

Pixel matching

The assistant takes a screenshot, finds the "Submit" button by template-matching its image, and clicks the matched coordinates. Cheap, easy to start with, and fragile. The recording binds to a specific layout, font, theme, and DPI. Move the button, change the icon, ship a dark theme, and the playback breaks. Some of the older RPA tools and most of the "AI clicks for you" demo videos are doing this under the hood.

Vision model on raw pixels

An LLM with vision (GPT-4V class, Gemini Vision, Claude with image input) looks at the screenshot and decides what to click in natural language. Flexible across surfaces, non-deterministic across runs. The same recording can play back differently on Tuesday than it did on Monday because the model’s interpretation of an ambiguous icon changed. Useful for exploration, hard to deploy as a production loop without a separate verification layer.

Accessibility tree

The assistant queries the OS-level accessibility API (the same one screen readers use) and gets back the structured tree of UI elements with roles, names, and properties. The "Submit" button is a node with role=button and name=Submit, regardless of where it is on screen or what it looks like. Mediar reads from this tree. So does UiPath’s strict-selector mode and Power Automate Desktop on Win32 targets. The recording survives layout changes because the tree did not change.

The trade-off is honest: the accessibility tree only exists if the OS-level API is wired into the application. On Windows that is most desktop apps. In a browser tab the tree is exposed through the same DOM-based accessibility hooks. On a mainframe terminal emulator, on a Citrix-published session, on a VMWare Horizon view of a remote desktop, the assistant has to fall back to other surfaces. That is where the "AI works on anything" pitch starts to need an asterisk, and the asterisk is worth asking about before signing.

Five questions to ask before buying any workflow assistant

Does it read the OS-level accessibility tree, or does it match pixels and selectors. Pixel matching breaks when the icon moves.
When it labels a step, does it look at the steps before and after, or only the one in front of it. Out-of-context labels are how recordings turn into garbage.
What does it do when the recording is finished. Does it hand you a file you can read, version, and edit, or does it stash the workflow in a black box on the vendor's server.
Can the assistant work on a Windows desktop app that has no API. SAP GUI, Jack Henry, Epic, mainframe terminal emulators. The honest answer is yes or no, not, {"it depends"}.
What is the unit of billing. Per seat, per minute of execution, per workflow, per agent-hour. The wrong unit makes a cheap assistant expensive at scale.

A vendor that cannot answer these in plain language is selling a chat veneer over an API catalog, not a workflow assistant. The Mediar answers, in order: yes (accessibility tree, no pixels), yes (the [N-5, N+5] gate), yes (a TypeScript file using @mediar-ai/workflow), yes (SAP GUI, Jack Henry, Epic, Cerner, Oracle EBS, mainframe terminals over Win32 emulators), and per-minute of runtime at $0.75/min with no per-seat cost. The source files for everything above are in the open desktop agent at github.com/mediar-ai/terminator.

Watch the pipeline run on a workflow you actually do

A 30-minute call. We record one of your existing tasks (SAP, Oracle, Epic, Jack Henry, Excel, you pick) and you walk away with the labeled trace, the synthesized workflow, and the TypeScript file the four stages produced.

Frequently asked questions

What is an AI workflow assistant

A program that watches you complete a task once and rebuilds the recording into something a machine can replay later. The honest implementations read what the OS already exposes through accessibility APIs (the same interfaces a screen reader uses) so the recording survives layout changes, and they label each step using the surrounding context, not the pixels in isolation. Mediar processes the recording through four stages: step analysis, context-aware labeling, synthesis, code generation. The output is a TypeScript file against an open SDK at github.com/mediar-ai/terminator, not a closed executable.

How is a workflow assistant different from workflow automation software

Workflow automation software (Zapier, Make, n8n) is plumbing between APIs. You describe a trigger, a condition, an action, and the platform fires it. A workflow assistant builds the description for you by watching you do the task once. The two categories are complementary on systems that have APIs. The categories diverge on systems that do not, which is most of the legacy desktop world. There, the only surface available is the accessibility tree, and only assistants that read it can produce a runnable recording at all.

Why does it matter whether the assistant uses accessibility APIs or pixel matching

Pixel matching binds the recording to the layout. Move the Submit button five pixels to the right, change a font, swap a light theme for a dark theme, and the playback breaks. Accessibility APIs return the structured tree of UI elements with their roles and names, the same way a screen reader sees the screen. The role and name of the Submit button do not change when the layout shifts. That is why Mediar reads from the tree and not the pixels, and it is why the same recording continues to run when the underlying app ships an update.

How does Mediar narrow what the assistant pays attention to

The function is_meaningful_event_type at apps/desktop/src-tauri/src/recording_processor.rs:250 returns true for exactly six event types: button_click, browser_click, text_input_completed, browser_tab_navigation, application_switch, file_opened. Mouse moves, focus events, scroll events, and the rest of the firehose are recorded but never analyzed. Without this filter the assistant would spend most of a recording explaining how the user moved the cursor between buttons. With it, every analysis call has a chance of producing a useful label.

What does the labeling gate actually check

check_labeling_gate at recording_processor.rs:315 enforces three conditions before step N is allowed to receive a context-aware label. The step itself must have an analysis. The five steps before it must each have an analysis. The five steps after it (or as many as exist if N is near the end of the recording) must each have an analysis. Until all eleven slots are filled, the labeler waits. The reason is that labels are not just descriptions of one click; they are the names that synthesis uses in the hierarchical workflow. A label assigned without context tends to read like User clicked Submit, which is correct but useless. With context the same step becomes Submitted the New User Registration form.

What does the assistant produce at the end

A TypeScript file that uses the @mediar-ai/workflow SDK. The file contains one createWorkflow block, a zod input schema derived from the inputs the assistant detected during recording, and one step entry per substep produced during synthesis. The file is generated against a documented template (WORKFLOW_TS_TEMPLATE in recording_prompts.rs). Because the output is a file, you can read it, edit it, version it in git, and review the changes when the assistant produces a revised version against a new recording. The same surface area is also available as YAML on the no-code web app at app.mediar.ai/web for non-developer users who want a runnable workflow without touching TypeScript.

How long does a recording take to become a runnable workflow

Stage 1 (step analysis) runs in parallel as the recording happens, so the assistant is mostly done analyzing by the time you stop recording. Stage 2 (labeling) runs as soon as the [N-5, N+5] window for each step closes. Stages 3 and 4 (synthesis, code generation) happen after recording stops and are typically the only stages the user waits on. For a recording with thirty meaningful events, total processing is a few minutes. The faster the source LLM (Mediar uses gemini-pro-latest by default per the GEMINI_MODEL constant), the faster the wait.

What workflows is the assistant actually good at

Workflows that recur more than a hundred times per week and that involve at least one Windows desktop application without a usable API. The recurring shape is what makes the upfront cost of recording pay back. The desktop-app shape is what differentiates this category from the API-plumbing category. Specific examples on Mediar deployments: order entry from a POS system into SAP B1, claims intake into mid-market insurance carriers, customer onboarding across Jack Henry, Fiserv, and FIS, patient intake into Epic and Cerner, and PDF data extraction into Excel and Oracle EBS.

What is the cost shape of running a workflow assistant

Mediar charges $0.75 per minute of runtime (the time the assistant spends executing the workflow, not the time you spend recording it). There is no per-seat licensing. The $10,000 turn-key program fee converts to credits with a bonus, so it functions as prepaid usage during the first deployment. A workflow that runs five hundred times per week at one minute per run costs about $1,500 per month in runtime. The same workflow on UiPath (per the F&B chain comparison Mediar publishes) ran about five times that. The reason the unit is minutes of runtime, not seats, is that the assistant does the work; people do not.

On the same recording-and-replay pipeline

Related deep dives

Architecture

Automation of workflow

The five-step orchestration pipeline that converts observed user behavior into a runnable workflow. Initiate, refine, boundaries, synthesize, timeline.

Read

Deep dive

Workflow automation software

The recording loop, the constants that throttle it, and the labeling gate that holds each step until its neighbors are ready.

Read

Runtime

Workflow automation platform

Where the YAML and TypeScript files come from, how steps map onto Terminator SDK calls, and what the runner actually does on Windows.

Read