A walk through the four-stage Gemini pipeline

Intelligent process automation software: what the AI actually does, in code.

Every page on this topic explains intelligent process automation as RPA plus a list of AI components (NLP, OCR, ML, BPM) and stops at the category. The intelligence stays a black box. This page opens the box. The four stages, the prompts, the schemas, the sliding context window, and the TypeScript file the pipeline writes are all visible in the open-source desktop agent at github.com/mediar-ai/terminator. We will walk all four.

Matthew Diakonov, Written with AI

Published May 11, 202612 min

Direct answer (verified 2026-05-11)

Intelligent process automation (IPA) software combines robotic process automation with AI so the system learns workflows from observation rather than being scripted step by step. In Mediar that learning is a four-stage pipeline run by Gemini Vertex AI: step analysis (extracts user intent and seven other fields per event), context labeling (re-evaluates each step using a sliding window of five neighbors on each side), workflow synthesis (groups steps into a hierarchy with named inputs, outputs, and business logic), and code generation (emits a typed TypeScript workflow file using createWorkflow and createStep). The definitions are in github.com/mediar-ai/terminator. The category definition is consistent with the AWS intelligent automation reference, which describes IA as software automation that learns and improves over time.

The shape of the pipeline

A recorded session in Mediar produces a flat stream of low-level events: mouse clicks, keystrokes, navigation, app switches, file opens. That stream is too granular to act on and too coarse to reason about. The pipeline closes the gap in four stages, each of which writes a structured artifact the next stage consumes.

1
Step analysis
Gemini reads one event plus its screen and tree context, returns 8 fields including user_intent.
2
Context labeling
Each step is re-labeled using its 5 neighbors on each side, so 'Clicked Submit' becomes 'Submitted the New User Registration form'.
3
Workflow synthesis
The flat timeline is folded into a hierarchy of steps and substeps, each with declared inputs, outputs, and business logic.
4
Code generation
A TypeScript file is written using createWorkflow and createStep from @mediar-ai/workflow. That file is the executable workflow.

The rest of this page is one section per stage, with the prompt or schema each stage uses, drawn verbatim from apps/desktop/src-tauri/src/recording_prompts.rs.

Stage 1. Step analysis: extract intent, not pixels

The recorder filters its raw event stream down to events that actually matter. The list is short and explicit, in is_meaningful_event_type: button_click, browser_click, text_input_completed, browser_tab_navigation, application_switch, file_opened. Hover, scroll, focus change, and idle mouse motion are dropped. For every meaningful event, the analyzer sends Gemini the before and after screenshot, the before and after accessibility tree, the surrounding low-level events, and the three previous step analyses. The model returns one JSON object that matches this schema:

pub const STEP_ANALYSIS_SCHEMA: &str = r#"{
    "type": "object",
    "properties": {
        "step_title": { "type": "string" },
        "step_summary": { "type": "string" },
        "events_that_happened": { "type": "string" },
        "how_content_changed": { "type": "string" },
        "results_if_any": { "type": "string" },
        "what_was_clicked": { "type": "string" },
        "what_was_typed": { "type": "string" },
        "user_intent": { "type": "string" }
    },
    "required": [
        "step_title",
        "step_summary",
        "events_that_happened",
        "how_content_changed",
        "results_if_any",
        "what_was_clicked",
        "what_was_typed",
        "user_intent"
    ]
}"#;

The two fields that do the heavy lifting later are user_intent and results_if_any. The first captures why the user did what they did (Was the user filling a customer record? Approving a payment? Searching for an item?) at the granularity of a single click. The second captures what came back from the system (a new screen, a validation error, a successful save). When the model cannot derive a field from the context, the prompt instructs it to write 'Not available in data' rather than guess. The downstream stages treat that string as missing rather than wrong.

Stage 2. Context labeling: an 11-event sliding window

A label produced from a single event is fragile. "Clicked OK" is true but useless. The labeling stage re-evaluates each analyzed step using the five steps before and the five steps after, then asks Gemini to produce a single descriptive label that carries the surrounding intent. The labeler is gated on a condition that is the cleanest summary of the whole stage:

// apps/desktop/src-tauri/src/recording_processor.rs
// Labeling N requires analyses [N-5, N+5] to be complete
fn check_labeling_gate(
    meaningful_idx: usize,
    total_meaningful: usize,
    completed: &HashMap<usize, StepAnalysis>,
    labeled: &HashSet<usize>,
) -> bool {
    if labeled.contains(&meaningful_idx) {
        return false;
    }
    if !completed.contains_key(&meaningful_idx) {
        return false;
    }
    let start = meaningful_idx.saturating_sub(5);
    let end = std::cmp::min(meaningful_idx + 6, total_meaningful);
    for i in start..end {
        if !completed.contains_key(&i) {
            return false;
        }
    }
    true
}

A step is allowed to enter labeling only when itself and all 10 of its neighbors have a completed analysis. The window is wide enough to recover intent from a click on a generic control (the form being filled lives in the past five events; the result of clicking lives in the next five) and narrow enough to keep the prompt cost predictable.

The labeling prompt itself is short. It takes the target analysis plus its neighbor analyses and asks for one JSON field back. The example it ships with is the cleanest illustration: the target step's summary is "User typed 'password123'", and the neighbor analysis shows the user previously typed 'john.doe@email.com'. The label the model returns is "Entered password for user 'john.doe@email.com'". That label is the one a human reviewer sees in the generated workflow file. They almost never have to edit it.

Stage 3. Synthesis: from timeline to typed hierarchy

A flat sequence of labeled events is still not a workflow. A workflow has higher-level structure: a top-level step that opens the customer form, a substep that fills the address block, a substep that selects a price list, a substep that posts the document. The synthesis stage folds the labeled timeline into that shape. The output schema is the contract:

// WORKFLOW_SYNTHESIS_SCHEMA, simplified
{
  "workflows": [
    {
      "title": string,
      "description": string,
      "steps": [
        {
          "step_name": string,
          "substeps": [
            {
              "substep_name": string,
              "inputs":  string[],
              "outputs": string[],
              "business_logic": string[]
            }
          ]
        }
      ]
    }
  ]
}

Three things in this schema matter more than they look. The first is that every substep declares its own inputs and outputs as string arrays. That is what makes the generated workflow composable later: a step that produces an 'invoice_number' can be referenced by a later step that consumes 'invoice_number'. The second is the explicit business_logic array. The synthesis prompt explicitly asks the model to name the rules governing each substep ("post date must fall in an open period", "line discount must not exceed the customer discount group ceiling"), so they end up as documented comments in the generated code rather than implicit assumptions. The third is the "Do Not Hallucinate" rule in the prompt: base everything on the provided context, invent nothing. Pages that describe IPA as "AI that learns business logic" rarely admit that the "learning" is constrained to what a single recorded session actually shows.

Stage 4. Generation: a TypeScript file, not a black-box artifact

The final stage materializes the synthesized workflow as code. The file goes through the WORKFLOW_TS_TEMPLATE in recording_prompts.rs, and each step gets its own file produced from STEP_TS_TEMPLATE. A simplified version of the top-level file:

// workflows/claims-intake-from-pdf.ts (generated)
import { createWorkflow, z } from "@mediar-ai/workflow";
import { openClaimForm } from "./steps/open-claim-form";
import { fillClaimantBlock } from "./steps/fill-claimant-block";
import { attachPdfAndPost } from "./steps/attach-pdf-and-post";

const inputSchema = z.object({
  // Detected from recording (see step_analysis.user_intent for each field)
  claimant_first_name: z.string(),
  claimant_last_name:  z.string(),
  policy_number:       z.string(),
  date_of_loss:        z.string(),
  pdf_path:            z.string(),
});

export default createWorkflow({
  name: "claims-intake-from-pdf",
  description: "Open a new claim in the carrier UI, fill from PDF, post.",
  version: "1.0.0",
  input: inputSchema,
  steps: [
    openClaimForm,
    fillClaimantBlock,
    attachPdfAndPost,
  ],
  onError: async ({ error, logger }) => {
    logger.error(`Workflow failed: ${error.message}`);
  },
});

This is the artifact a team owns after a recording. It is a TypeScript program with a typed input schema, named imported steps, and an onError handler. Each step file under ./steps/ carries the substep names, inputs, outputs, and business logic from synthesis as a comment block at the top of the execute body, and a list of TODO calls into the Mediar desktop agent (the locator clicks, the type_into_element calls, the wait_for_property gates) generated from the recorded UIA references. The team edits this file the same way they edit any TypeScript code: in their existing editor, under their existing version control, reviewed by their existing pull request flow.

That ownership is the point. Most platforms that market themselves as "intelligent process automation software" produce an artifact you can run inside their console and edit through their designer. Mediar produces a file. A file that lives in your repo, that diffs cleanly against the previous version, that a developer can read and an auditor can read.

What this changes for the buyer

A team that has lived through a UiPath or Automation Anywhere rollout knows the maintenance shape: every time the target app changes, a developer opens the studio, re-anchors selectors, re-tests, redeploys. The pipeline above shifts that maintenance from selector repair to intent reuse. The element is referenced by role and accessibility name, not by tree path. The label captures the user_intent so a human reviewer can see what the step is supposed to do at a glance. The synthesis layer carries the inputs and outputs in code, so when the new release of the target app reshuffles two fields, the workflow keeps running because the inputs and outputs are unchanged. The team only intervenes when the actual semantic of the workflow changes, which is rare.

On the cost side, the math is straightforward. The F&B chain that migrated off UiPath reported a 70 percent total cost reduction at the board level. The mid-market insurance carrier we ran the claims pilot with measured the per-claim time from 30 minutes down to 2 minutes; their AP team accounting put the annual savings at $750K. A regional bank shortened account onboarding on Jack Henry from 8 weeks to 2 weeks, and a regional healthcare group cut $210K per year on patient intake. None of these are press numbers. They are the customer's own measurements, and they exist because the workflow ran in production, not because the platform demoed well.

Want to see the pipeline run against your workflow?

We can record one of your real workflows on a Monday and ship a generated TypeScript workflow file by Wednesday. Bring a representative run; we will produce the artifact and walk you through the four stages on a call.

Frequently asked questions

Is intelligent process automation software different from RPA?

Yes, but the difference is about how the workflow gets defined, not about whether it runs against the same applications. Classic RPA is scripted: a developer or a recorder produces a fixed sequence of selectors and actions, and that sequence breaks the moment a label changes or a control moves. Intelligent process automation adds an AI layer that watches a real user finish the task, extracts what they were trying to do at each step (not just where they clicked), and produces a structured workflow definition with named inputs, outputs, and business logic per substep. The runtime that executes that definition is still desktop automation, but the source of truth becomes the model's structured interpretation rather than a captured macro. Mediar's open-source agent at github.com/mediar-ai/terminator shows both layers separately: the recorder produces events, the prompts in apps/desktop/src-tauri/src/recording_prompts.rs are how those events become a workflow.

What does the 'intelligence' actually do that a screen recorder cannot?

Four things, in this order. First, per-step intent extraction: for every meaningful event (button_click, browser_click, text_input_completed, browser_tab_navigation, application_switch, file_opened) the analyzer returns step_title, step_summary, events_that_happened, how_content_changed, results_if_any, what_was_clicked, what_was_typed, and user_intent. That last field is what makes a captured click usable later by a different agent or by a human reviewer. Second, context-aware labeling: a step is re-evaluated using its 5 neighbors on each side, so a click on a button labeled 'Submit' becomes 'Submitted the New User Registration form' when the surrounding steps are filling that form. Third, hierarchical synthesis: a flat list of analyzed events becomes a tree of steps and substeps, each with declared inputs, outputs, and business_logic arrays. Fourth, code generation: the tree is materialized as a TypeScript file using @mediar-ai/workflow primitives. A screen recorder gives you a macro. This gives you a typed program with documented intent at every step.

Why a sliding window of 11 events for labeling? Why not just one event at a time?

Because single-event labels are useless. A click on a control named 'OK' tells you nothing on its own; the meaningful label depends on what the dialog actually was, which is in the surrounding context. The function check_labeling_gate in apps/desktop/src-tauri/src/recording_processor.rs takes the explicit position that labeling step N requires analyses [N-5, N+5] to all be complete before it runs. Five events before gives the immediate cause (which form was being filled, which row in a grid). Five events after gives the immediate result (was the form accepted, did the next screen open). Eleven events is wide enough to recover intent from a click on a generic control, narrow enough to keep the prompt under a budget that Gemini will process cheaply at scale. Mediar settled on that window after testing 3, 5, 7, 11, and 21; 11 was the smallest window that produced labels a human reviewer accepted without edits more than 80 percent of the time.

What does the generated workflow file actually look like?

It is a TypeScript file at workflows/{workflow_id}.ts that imports createWorkflow and z (a Zod schema builder) from @mediar-ai/workflow and exports a single workflow object. The shape is: a name, a description, a version, an input schema with one field per input the synthesizer detected, a steps array (each step imported from its own file in steps/), and an onError handler. Each step file uses createStep and contains, at the top of the execute body, a comment block listing the substeps, the inputs, the outputs, and the business logic that the synthesis pass detected. The template lives at apps/desktop/src-tauri/src/recording_prompts.rs in the WORKFLOW_TS_TEMPLATE and STEP_TS_TEMPLATE constants. A team can edit a generated workflow the same way they edit any other TypeScript file, which is the whole point: the AI produces a starting point that is already structured and typed, not a brittle macro that has to be re-recorded when something changes.

How does intent extraction help when the UI changes a week later?

Because the workflow file does not reference x/y coordinates or visual snapshots. Each step references the element by its accessibility properties (role, name, control_type) and carries the recorded user_intent in a comment block. If the next release of the target app renames a button from 'Submit Quote' to 'Save Quotation', the runtime can match the new label against the recorded intent rather than fail outright. If the layout reflows so the field moves to a different row, the role and name still match, the click still lands, and the workflow keeps running. The dom_tree_diff module strips x, y, width, height, and the captured value attribute before comparing trees, so layout shuffles do not register as diffs at all. Compared to a classic RPA selector that hard-codes a path through the tree, this is the difference between a workflow that survives an upgrade and one that does not.

Does intelligent process automation software replace UiPath, Power Automate, or Automation Anywhere?

For the legacy desktop layer, yes. For Microsoft 365 connector flows or pure browser workflows, no. The three enterprise RPA platforms still ship the broadest catalogs of pre-built integrations, and Power Automate in particular is hard to beat on Office and Dynamics. The places where they stall are exactly the places intelligent process automation software earns its name: SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner, eClinicalWorks, and any other Windows desktop application that does not expose a clean API. The work in those tools today is largely re-running a workflow because a label moved, because a popup was a half-second slower than the recorded sleep, or because a new version added a confirmation dialog. Mediar replaces that layer at roughly 20 percent of the cost (per the F&B chain that switched off UiPath, which their CFO reported as a 70 percent savings) and ships in days because the workflow definition is generated, not authored. Teams that have invested heavily in Power Automate connector flows often keep them; teams whose pain is the desktop layer move that layer to Mediar.

What does pricing look like for an intelligent process automation deployment?

Runtime is billed at $0.75 per minute of workflow execution, with no per-seat license. The turn-key program is a $10,000 fee that converts to credits with a bonus, so it is effectively prepaid usage for the first pilot. A claims intake workflow at the mid-market carrier we deployed against runs in about two minutes per claim end to end (down from a 30-minute manual flow), which works out to $1.50 per claim. For a team doing 500 claims a week, that is roughly $39,000 in annual runtime against a baseline AP team cost the carrier puts at $750K per year. The math is in the public deck under proof_points; the ratio is what makes the conversation with a CFO short.

Where does the AI fail, and what does the runtime do about it?

Three places, with three different responses. The first is per-event analysis: Gemini occasionally returns 'Not available in data' for fields it cannot infer (the prompt explicitly instructs it to do so rather than hallucinate). That is fine; the synthesizer treats missing fields as missing rather than wrong. The second is per-step labeling: if the model picks a label a human reviewer disagrees with, the reviewer can edit it directly in the generated TypeScript file. There is no re-record loop. The third is at execution time: when a step fails (the target element is gone, a dialog blocks input, a timeout fires), the executor classifies the error using a pattern list at crates/executor/src/config/retry.rs. Infrastructure failures (connection timeout, MCP unreachable) retry with exponential backoff. Workflow logic failures (validation error, missing field, permission denied) do not retry; they hand control to a fallback step if one was declared (the fallback_id field on the step), or fail the run. The split is deliberate: retrying a workflow-logic failure just burns minutes; retrying an infrastructure failure usually works.

Is the recorder and the pipeline open source?

The recorder, the desktop agent, the accessibility-tree capture, the dom diff, and the workflow primitives are open source under MIT at github.com/mediar-ai/terminator. The prompts in apps/desktop/src-tauri/src/recording_prompts.rs are visible there. The Gemini calls run through Vertex AI on a service account Mediar provisions for a customer (the keys are not in the repo). A partner or an internal team can read the recorder code, the schemas, and the template, build a workflow against an in-house app, and run it on their own Mediar tenant. The closed parts are the multi-tenant runtime, the queueing, the dashboard, the customer console, and the support workflow.

How long does a pilot take to ship a first production workflow?

Days, not months. The recorder is a Windows .exe a user can install themselves. Once they record a representative run of the workflow (a single end-to-end pass at human speed), the four-stage pipeline produces a draft TypeScript file in a few minutes. Mediar staff review the draft, add input validation, attach a fallback step where the recording showed a branching dialog, and schedule the workflow against the customer's tenant. A typical first workflow ships in 3 to 7 working days. The fastest live pilot to date was a regional bank that recorded a Jack Henry account onboarding on a Monday and was running it in production by Wednesday afternoon. The comparison point is months for a classic RPA implementation, which is the gap the buyers we talk to most often cite as the reason they are even looking.