Inside Mediar

What is Mediar? A walkthrough of what happens when it watches one click.

Most descriptions of Mediar stop at "the AI watches your workflow and runs it 24/7." That sentence is true and useless. It does not tell you what gets stored, what the model sees, or why the workflow does not break the next time SAP nudges a button two pixels left. This guide opens the desktop agent and shows the data shape that flows from a single click to a replayable TypeScript workflow file.

Matthew Diakonov, Written with AI

Published April 27, 20268 min

First, three Mediars: which one is this?

Search for "mediar" and you will land on at least four different companies. The two biggest ones are completely unrelated: MediaRadar is an ad-intelligence tool for sales teams, and Mediar Therapeutics is a biotech. Mediar Solutions sells in-store analytics for retail.

This page is about Mediar AI, the Y Combinator backed company at mediar.ai. The product is a desktop agent for Windows that watches a workflow once, stores it as semantic intent rather than coordinates, and replays it at scale. The open-source executor underneath is called Terminator and lives at github.com/mediar-ai/terminator.

Moment one: the click and the tree

When you click, type, or navigate while Mediar is recording, the desktop agent grabs the raw input event plus a compact snapshot of the Windows UI Automation tree. UI Automation is the accessibility framework that screen readers like NVDA and JAWS use to tell a blind user what is on the screen. Mediar uses the same source, encoded as Roman-numeral-indented YAML.

Here is the same SAP AP invoice header before and after typing a vendor code into the Vendor field. Notice that nothing pixel-related is stored. The model is going to compare these two trees to learn what changed.

I. [Window] 'SAP Easy Access' II. [Pane] 'Document Header' III. [Edit] 'Reference' value="" III. [Edit] 'Posting Date' value="04/27/2026" III. [Edit] 'Vendor' value="" focused=true III. [Button] 'Save' enabled=false III. [Button] 'Post' enabled=false II. [Pane] 'Line Items' III. [Table] 'Items' rowCount=0

Vendor edit field is focused but empty
Save button is disabled
Each line carries role, name, and a few key attributes

Moment two: stripping the volatile bits

Before either of those snapshots gets written, the recorder runs them through a small preprocessor that drops attributes that change for trivial reasons. Coordinates, sizes, and the cached value of input fields all get removed. A button at (412, 218) on your screen and the same button at (440, 232) on a coworker's 4K monitor are treated as the same node.

dom_tree_diff.rs

This is the first half of why Mediar workflows survive a UI refresh. Stable identity comes from role, name, automation ID, and tree position, not from where the pixels happen to land today.

Moment three: the four-stage pipeline kicks in

Each meaningful event (a click, a keystroke run, a navigation) gets queued for processing. The desktop UI shows four progress counters because there are four explicit stages. They are visible in the source as stage_totals on the ProcessingProgress struct in recording_processor.rs: step_analysis_total, labeling_total, synthesis_total, generation_total.

1. Capture the click and the tree

When you click, type, or navigate while Mediar is recording, the desktop agent stores the low-level event plus a compact YAML snapshot of the Windows accessibility tree from before the action and from after it. The tree is the same one screen readers consume, not pixels.

2. Strip volatile attributes

Coordinates and dimensions get removed before the snapshot is written. A button at (412, 218) and the same button at (440, 232) after a resize are treated as the same element. Only role, name, automation id, and structural relationships survive into the saved tree.

3. Run the four-stage pipeline

The recorded session is processed in four explicit stages: step analysis, labeling, synthesis, and generation. Each stage has its own progress counter the desktop UI displays as you watch.

4. Extract eight semantic fields per step

Step analysis sends the before tree, the after tree, both screenshots, and the raw events to Gemini Vertex AI. The model returns a JSON object with eight named fields that describe what you did and why, not where you clicked.

5. Re-label with neighbor context

The labeling pass re-reads each step alongside its two neighbors so a generic action like 'clicked Submit' becomes the more useful 'submitted the new vendor master record'. This is the bit that makes a replay legible to a human reviewer six months later.

6. Generate a replayable workflow

Synthesis groups steps into one or more workflows, each with substeps that list inputs, outputs, and business logic. Generation writes them out as a TypeScript workflow file the runtime can execute against any Windows desktop, with the four-strategy focus restoration cascade as the safety net when an element has moved.

Moment four: what the model returns

The step-analysis stage sends the before tree, the after tree, both screenshots, and the raw input events to Gemini Vertex AI with a strict JSON schema. The schema has eight required fields, all defined in recording_prompts.rs. Here is what the model produced for the vendor-entry click above.

step_0042.json

Notice what is and is not in this object. There are no coordinates. There is no XPath. There is no DOM selector. There is the user's intent, the result the system gave back, and a precise description of what changed in the tree. This is the actual primitive Mediar stores; the replayable workflow file is built on top of these.

Moment five: replay time, and the fallback cascade

Replay does the inverse. Each stored step has a target description, and the runtime walks the live UI Automation tree to find a matching element. If the application has been redesigned in the meantime, the obvious match might not exist. The runtime tries four strategies, in order, before giving up and surfacing the failure to a human.

focus_state.rs

Strategy one is the cheapest and most precise: the developer set an automation ID on the field, and that ID survived the redesign. Strategy three is the most resilient: even if the layout shifted and the IDs were reassigned, the visible label Vendor usually does not change because users would notice. The cascade is why "self-healing" is a real product property here, not a marketing claim.

Watch a recording session in real time

The desktop agent emits each stage as a structured log line. This is roughly what you see in the developer console while a session processes (event counts and IDs trimmed for clarity).

mediar record

The pipeline in numbers

The shape of the pipeline is small enough to fit on one screen. That is the point.

0stages in the recording pipeline

0semantic fields extracted per step

0fallback strategies for finding an element

0selectors you maintain by hand

What this changes versus traditional RPA

The mechanics above are the reason Mediar pitches itself as a replacement for UiPath, Power Automate, and Automation Anywhere rather than a complement. Stored intent plus an LLM at replay time changes the failure mode of the whole stack.

Feature	Selector-based RPA	Mediar
What gets stored per click	An XPath or CSS selector against the rendered UI	Before and after accessibility tree, screenshots, raw events, and an LLM-extracted intent record
Coordinates in the saved workflow	Often pixel coordinates or absolute selectors	Stripped before save; positions are recomputed at replay time
When the target button moves	Selector misses, run fails, developer fixes the selector	Four-strategy cascade tries id, then window-and-bounds, then text content, then the parent window
How a workflow is described	A list of UI actions: click(x), type(y), wait	step_title, step_summary, events_that_happened, how_content_changed, results_if_any, what_was_clicked, what_was_typed, user_intent
Re-record when the screen redesigns	Yes, a fresh build of every selector	Usually no, because intent and accessibility names usually survive a redesign

Where this falls short, honestly

Two things to call out before the FAQ. First, the four-stage pipeline takes seconds per step, not milliseconds. If you are trying to replay 10,000 steps a minute against a single Windows session, the LLM-grounded path is the wrong tool; you want a deterministic UIA recording for that. Mediar uses the LLM path during recording and replay-time intent matching, but the actual UIA calls at replay are deterministic and fast.

Second, applications that publish a poor accessibility tree (some Java Swing apps, certain Citrix-rendered sessions) are weaker on strategy one and lean harder on strategies two and three. They still work, but the elevated false-match rate means we recommend an explicit verify-after-act loop on those targets. The Citrix-specific path is documented separately in our Epic in Citrix guide.

Want to see this run on your stack?

Book a 30-minute call. We will record one of your real workflows live and walk through the resulting step-analysis JSON together.

Frequently asked questions

There are several products called Mediar. Which one is this?

This is Mediar AI, the Y Combinator backed company building AI desktop automation. The desktop agent records workflows on Windows and replays them via accessibility APIs. It is not MediaRadar, the ad sales tool, or Mediar Therapeutics, the biotech, or Mediar Solutions, the in-store retail analytics company. The website is mediar.ai and the open-source executor lives at github.com/mediar-ai/terminator.

What does Mediar actually record when I click a button?

Three things are stored alongside the raw click event. First, the Windows accessibility tree from immediately before the click, encoded as Roman-numeral-indented YAML where each line reads like 'III. [Button] Save enabled=false'. Second, the same tree from after the click. Third, a before and after screenshot. Coordinates and dimensions are stripped from the tree before it is saved, so a window resize between recording and replay does not invalidate the snapshot.

Why does Mediar use the accessibility tree instead of pixels?

Because the accessibility tree is what the application itself publishes. It includes the role of every element (Button, Edit, Pane, Window), the name (the visible label or the accessible name), automation IDs the developer set, and parent-child relationships. Vision-only systems have to infer all of that from pixels, and they get it wrong when fonts change, themes flip, or DPI shifts. Mediar uses screenshots as a secondary signal for the model's reasoning, not as the primary identifier.

What does the four-stage pipeline do exactly?

Stage one is step analysis: every meaningful event becomes a structured eight-field JSON object via Gemini Vertex AI. Stage two is labeling: each step gets re-read with its two neighbors so generic labels become specific ones. Stage three is synthesis: steps are grouped into one or more workflows, each with substeps that list inputs, outputs, and business logic. Stage four is generation: the result is written out as a TypeScript workflow file the runtime can execute. The desktop UI shows you all four progress counters live as the recording is processed.

What happens at replay if the UI has changed since the recording?

The runtime tries to find the target element through a four-strategy cascade defined in focus_state.rs. First it looks for the element by accessibility or automation ID. If that fails, it looks by window plus bounds. If that fails, it looks by text content. If all three fail, it falls back to focusing the parent window and lets the next step retry. Most UI tweaks (a button moves a few pixels, a panel reorders) are absorbed by strategies one and three because the role and name usually survive a redesign.

Is Mediar open source?

The executor is. Terminator, the Rust SDK that performs the actual UI Automation calls and the four-strategy focus restoration, is published as the terminator-rs crate and lives at github.com/mediar-ai/terminator. The desktop recorder, the cloud processing pipeline, and the no-code workflow builder at app.mediar.ai are commercial. Teams that want to extend the runtime can build on Terminator directly.

Does Mediar work on macOS or Linux?

Today the production product targets Windows desktop applications, including SAP GUI, Oracle Forms, Epic Hyperspace, Citrix-published apps, and most legacy line-of-business tools. The Terminator SDK has scaffolding for other platforms, but the recording and replay paths the commercial product ships are Windows-only. This is intentional: most enterprise desktop automation demand sits on Windows.

How fast can I get a workflow into production?

Most teams record their first end-to-end workflow within a week, often within an afternoon. The recording itself takes as long as the task takes you to do once. Processing runs in the background and emits a TypeScript workflow file. The longer part is usually the access review with IT, not Mediar setup. The standard turn-key program is three months because that includes change management and broader rollout, not because the technology takes that long.