An argument about what RPA actually is

RPA is two regex bets. If you want to know whether a bot will hold up, read those regexes.

Every other definition of robotic process automation on the open web ends at the abstract category. Software bots, repetitive tasks, rule based, mimics a human. None of those sentences tell you the one thing that decides whether a bot will still be running next month: what the bot considers the same screen across two captures. That decision is two regular expressions in 144 lines of Rust. They are public. They are short. They are the spine of the category, and reading them is the closest thing to reading the contract you sign when you buy RPA.

Matthew Diakonov, Written with AI

Published April 29, 20269 min

The thesis

A robotic process automation system has one mechanical job: take a recording made yesterday, walk the live screen today, and decide whether what is on the screen now is the same thing the recording was made against. If yes, run the recorded step. If not, pause. Everything else (the recorder, the executor, the queue, the retry policy) is downstream of that one decision.

The decision is not made on raw screen captures. The screens are captured as accessibility trees, the way a screen reader sees them, and the trees are compared. The interesting question is how. Two trees are never byte-identical between runs. Window positions shift. Internal ids get recomputed. The visible content of an input field changes the moment a user types into it. Without a rule for which differences matter, every comparison would scream change and every bot would pause every run.

So the rule is: throw away the bytes that are known to be volatile before comparing. What remains is the substrate of the bet. RPA is a bet that what remains is enough.

The bet, in two regexes

Mediar publishes its desktop runtime as the Terminator project at github.com/mediar-ai/terminator under MIT. The diff layer lives incrates/terminator/src/ui_tree_diff.rs. The compact YAML path is the most legible. Here is the function that runs before any two captures are compared.

ui_tree_diff.rs

// crates/terminator/src/ui_tree_diff.rs

pub fn remove_ids_and_bounds_from_compact_yaml(yaml_str: &str) -> String {
    // Remove #id patterns (e.g., #12345, #abc-def-123)
    let id_re = Regex::new(r" #[\w\-]+").unwrap();
    let result = id_re.replace_all(yaml_str, "");

    // Remove bounds patterns: "bounds: [x,y,w,h]"
    let bounds_re = Regex::new(r"bounds: \[[^\]]+\],?\s*").unwrap();
    bounds_re.replace_all(&result, "").to_string()
}

// Before: - [Button] Submit #id123 (bounds: [10,20,100,30], focusable)
// After:  - [Button] Submit (focusable)

The first regex finds a leading space, then a hash, then a run of word characters or hyphens. It deletes them. That is the rule for stripping the unique id token the platform attaches to every UI element. The second regex finds the four-coordinate bounds block in square brackets and deletes that too, along with any trailing comma and whitespace.

“A line that says '- [Button] Submit #id123 (bounds: [10,20,100,30], focusable)' becomes '- [Button] Submit (focusable)' before it is compared to anything else. That short line is the entire spec of what RPA considers the same screen.”

remove_ids_and_bounds_from_compact_yaml, ui_tree_diff.rs

The companion JSON path makes the same bet a different way. Instead of regex, it walks the parsed tree and drops two map keys.

ui_tree_diff.rs (JSON path)

// crates/terminator/src/ui_tree_diff.rs

pub fn remove_ids(value: &Value) -> Value {
    match value {
        Value::Array(arr) =>
            Value::Array(arr.iter().map(remove_ids).collect()),
        Value::Object(obj) => {
            let mut new_obj = serde_json::Map::new();
            for (key, val) in obj.iter() {
                if key != "id" && key != "element_id" {
                    new_obj.insert(key.clone(), remove_ids(val));
                }
            }
            Value::Object(new_obj)
        }
        _ => value.clone(),
    }
}

Two regexes for the line-oriented format, two key drops for the tree-oriented format. Both express the same bet. Read them once and you understand more about what an RPA system can and cannot do than you would learn from any vendor brochure.

The diff layer in flight

The two regexes do not run on their own. They are step one of a short pipeline that runs every time the executor needs to decide whether a recorded step still matches the live screen. The diagram below is the call shape.

ui_tree_diff.rs at runtime

The Compare actor is thesimple_ui_tree_difffunction in the same file. It uses thesimilarcrate to do a line-based diff over the cleaned trees and returns None when there are no differences and Some(diff) otherwise. Replay reads that result and either continues or pauses for human review. The whole loop is deterministic. There is no model in it.

Three consequences of that bet

The two-regex strip is one design choice, but the choice cascades. Here is what flows downstream of it.

1. The recorder is allowed to be sloppy about ids

If the diff layer is going to throw away id tokens anyway, the recorder does not have to fight the OS to capture a stable one. It can take the id that is exposed today and let the next capture take a different id without ever raising a diff. This is not laziness; it is the specification.

2. Layout shuffles do not produce a diff

The four-coordinate bounds block is the bytes that change every time a window resizes, every time a DPI setting changes, every time a vendor reflows a panel. Stripping that block before comparing trees is what lets a recorded run survive a Tuesday-morning UI release that nudges a field down a row.

3. The bot is forced to identify elements by role and name

What survives the strip is the role token in square brackets ([Button]), the visible label after it (Submit), and the focusable flag in parentheses. Those are exactly the attributes a screen reader reads aloud. The bot is constrained to find elements the same way an accessibility user does: by saying their name. That constraint is the entire reason the same recording works against SAP GUI, Jack Henry, Epic, and a Chrome tab.

The bet is not a hack. It is the operating spec of the category. Once you accept that the diff layer trusts role and name and distrusts ids and coordinates, the rest of the bot design follows. A recorder that stored XPath would be making a different bet. A recorder that stored bitmap hashes would be making a worse one.

The counterargument

The fair counterargument is that vision-based RPA, the variety that watches a screen as pixels and asks a model what is on it, does not need this bet at all. It can match a button by appearance and reach back into the screen at runtime regardless of whether the underlying tree exposes a useful name. That is true, and it is also the wrong way to think about it.

Vision RPA makes a different bet. It bets that a button looks roughly the same across releases, that font rendering has not changed, that anti-aliasing is stable, that an icon set has not been refreshed, that the viewport DPI matches. Those bets are also sometimes wrong. The maintenance shape is different (the failures cluster around visual refreshes rather than id renumberings) but the shape exists. The honest claim is not that one bet is free and the other is expensive. The honest claim is that the accessibility-tree bet is what survives the failure modes that actually occur in legacy desktop ERPs, banking cores, and EHRs, because those vendors are required by their customers and by regulation to maintain the accessibility surface across releases.

On a brand-new SaaS app with no accessibility commitments and a visual designer who reflows the canvas every quarter, neither bet is great. That is the right place for an LLM-driven browser agent, not for traditional RPA at all.

What this changes for the buyer

The reason to pay attention to those 144 lines is not academic. The dollars on the contract are downstream of them. A bot that survives a vendor release runs the workflow. A bot that does not survive pages a human and cuts a maintenance ticket. The difference between those two outcomes is whether the diff layer chose to throw away the bytes that the vendor was free to change.

Closed-source RPA hides this decision. UiPath, Automation Anywhere, and Blue Prism do not publish the equivalent of ui_tree_diff.rs. There is no way for a buyer to read which attributes the diff layer trusts and which it strips. The implementation is opaque, the bet is implicit, and the failure mode is whatever bug the vendor has not caught yet. The pitch is reliability and the proof is a slide deck.

Open source flips that. The buyer can read the file, run the tests, fork the repo, and build the desktop binary themselves. Mediar ships the runtime that wraps this file as a desktop app and charges $0.75 per minute of runtime, but the file itself is MIT licensed and small enough to be read in one sitting. The reason the company is comfortable shipping this in the open is exactly that the bet is correct: when you see what attributes the diff layer keeps and which it throws away, the design defends itself.

Bring one workflow you cannot keep alive and we will show you the diff layer in flight

On a 30-minute call we point a recording at your actual screen, run it twice across a layout shuffle you control, and show you exactly which bytes the diff layer chose to ignore. You leave with a TypeScript file you can run yourself.

Frequently asked questions

Does this mean Mediar's RPA cannot tell the difference between a real change and a layout shuffle?

It can. The point of stripping ids and bounds is that the diff layer ignores changes that are not semantic. A real change (a new field, a renamed label, a removed button) produces a diff because it survives the strip. A layout shuffle (the same Submit button moved 12 pixels left because a panel grew) does not produce a diff because everything that moved was inside the stripped bytes. The bot is not blind to change; it is deliberately blind to noise.

What about recordings that depend on the exact id, like a workflow that targets element id 'submit_btn_42' specifically?

A recording does not store the volatile id. The recorder captures role plus visible name plus the path through the accessibility tree, and the diff layer is given trees with ids already stripped before comparison. If an automation needs to disambiguate between two same-named buttons on one screen, the disambiguator is the parent path (the WorkCenter title, the panel header) plus tab order, not a numeric id that the host application is free to recompute on every render.

Why two regexes and not a more sophisticated parser?

Because the compact YAML format the recorder ships is line-oriented and the only volatile tokens in that line are the id (a hash followed by word characters) and the bounds (the four-coordinate block in square brackets). Two regular expressions are a complete spec for what a line of that format can look like. A heavier parser would not strip more; it would only spend more cycles. The companion JSON path uses an explicit tree walk that drops the id and element_id keys, which is the JSON-equivalent of the same two-regex transform.

What if the application I am automating exposes nothing useful to the accessibility tree?

Then the bet does not pay off and the right answer is not a Mediar recording. Some Java Swing apps in their default theme, some custom OpenGL surfaces, and a small handful of legacy terminal emulators expose almost nothing through Windows UI Automation. Computer-vision RPA can sometimes work on those, at the cost of a different and worse bet (that pixel patterns survive). The honest answer is to admit the failure mode and pick the right tool. Most legacy desktop systems Mediar's customers care about (SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner) expose dense and stable accessibility trees because they were built to be screen-readable.

Where does this leave selector-based RPA in the UiPath, Blue Prism, and Automation Anywhere style?

Selector-based RPA stores a single id-or-XPath per step and asks the live UI to find that exact selector at replay time. It is making a different bet: that the chosen selector survives. That bet pays off when the application owners are disciplined about ids and breaks when they ship a refactor that renumbers everything. The maintenance tax on selector RPA is the cost of that bet failing on a schedule. The accessibility-tree approach trades one volatile selector for a captured tree at recording time and a regex strip at compare time, and pushes the bet onto attributes that vendors maintain across releases because they have to (screen readers and assistive technology certifications depend on them).

How is this different from a desktop macro recorder?

A macro recorder stores keystrokes and mouse coordinates and replays them at the same coordinates. The bet a macro recorder makes is that screen geometry is stable between record and replay, which is wrong even on the same monitor over the same week. RPA in the modern sense stores a description of what was clicked (role and name plus a tree path) and finds a matching element at replay time. The two-regex strip in ui_tree_diff.rs is the operationalization of that distinction at the diff layer.

Can I read this code myself?

Yes. The file is at crates/terminator/src/ui_tree_diff.rs in the open-source Terminator repository at github.com/mediar-ai/terminator. It is 144 lines including tests. The two regexes are in remove_ids_and_bounds_from_compact_yaml; the JSON walk that drops id and element_id keys is in remove_ids; the actual line-based comparison is in simple_ui_tree_diff. The license is MIT.

What is the practical consequence of this design for someone running 100 recordings a week?

Two consequences. First, a routine release of the host application that nudges the layout produces zero false positives in the diff. The runs continue. Second, when a release does change something semantic (a renamed field, a removed button), exactly that change shows up in the diff, the run pauses, and the human is asked to update one recording rather than triage a hundred. The maintenance shape moves from constant low-grade attention to occasional batched updates.

Pricing for runs that depend on this design?

Runtime is billed at $0.75 per minute regardless of outcome. A typical recorded step that opens a screen, types a few fields, and confirms takes 25 to 60 seconds against a desktop ERP. The $10K turn-key program fee converts to credits with a bonus, so it is effectively prepaid usage that covers the first pilot. Per-seat licensing is not part of the model.

The four-strategy match cascade, the side-by-side decomposition, and one tenant-level application of the same bet

Adjacent reading

Companion piece

What robotic process automation is, in three numbers: six event types, four stages, four match strategies

The companion piece to this one. Where this page is about the diff layer's bet, the sibling is about the four-strategy match cascade in focus_state.rs that the bet enables at runtime.

Read

Definition

The meaning of robotic process automation: a side-by-side decomposition of the modern bot vs the 2003 selector recorder

Two definitions of RPA, scored against each other on what the robot is, what a process is, and what the automation actually guarantees.

Read

Application

SAP Business ByDesign order-to-cash automation: where the OData line ends and the WorkCenter line begins

The bet from this page applied to a specific tenant. Seven O2C touchpoints scored API-only, UI-only, or mixed, and the four screens where the recorded line earns its keep.

Read