Intelligent process automation

What makes an intelligent process automation tool actually intelligent

Every roundup defines the category the same way: robotic process automation plus AI. That is correct and it is also the least useful part. The thing that decides whether a tool survives contact with your actual systems is narrower and almost never discussed: how it reads the screen it is driving.

M
Matthew Diakonov
9 min read

Direct answer

An intelligent process automation tool is software that combines robotic process automation with AI, so it can automate workflows that involve unstructured inputs and decisions, not only fixed, rule-based steps. The capability that actually determines whether one works on your stack is its perception layer: tools that read a screen through pixels or auto-generated selectors break on legacy desktop apps, while Mediar reads the operating system accessibility tree (the same structured element data screen readers use), so it can drive SAP GUI, Oracle EBS, mainframe terminals, and EHRs that have no API.

Category definition cross-checked against the AWS intelligent automation reference on June 16, 2026.

The definition everyone agrees on, and where it stops being helpful

Pull up the popular explainers and the listicles and they converge. Intelligent process automation is RPA with a layer of machine learning, natural language processing, or generative AI on top so it can deal with messy inputs. Then the article becomes a list of vendor names. None of them tell you the one thing that predicts whether a given tool will run on a Tuesday in your environment six months from now.

Here is the uncomfortable part. The word "intelligent" usually refers to the bolt-on AI: document classification, an LLM that drafts an email, a model that routes a case. But that intelligence sits downstream of a much older problem. Before any model can reason about a step, the tool has to perceive the screen and act on the right control. If perception is brittle, the smartest model in the world is reasoning over a guess.

Two ways to read a screen

There are essentially two ways for an automation tool to know what is on screen and where to click. The first is to look at the pixels: a coordinate, a cropped screenshot to template-match, or an auto-generated selector. The second is to ask the operating system for a structured description of the interface. The difference is the whole ballgame on legacy desktop software.

Coordinates and crops, versus roles and names

# What a pixel/selector tool sees
click(x=842, y=311)        # a coordinate
template_match("btn.png")  # a screenshot crop
selector(".x-form-field-7")# an auto-generated id

# When the window moves, the theme changes,
# or SAP renumbers the field, all three break.
# The "AI" bolted on top never gets a clean
# description of the screen to reason about.
-11% fewer lines

The left column is how most tools that call themselves intelligent still perceive a desktop app. It works in a demo and degrades the first time the window moves, the theme changes, or the vendor renumbers a field. The right column is stable across those same changes because it identifies a control by what it is, not where it happened to be.

What the tree actually looks like inside Mediar

This is the part you cannot get from a category overview. When Mediar records a workflow, it does not save a video. It captures the accessibility tree before and after every action, then represents each element in a simplified format. The desktop recorder spells the format out in a single line of context it hands to the model (in recording_processor.rs):

UI_TREE_STRUCTURE: The UI tree is a simplified
representation of the accessibility tree. Each line
has the format:
'LineNumber. RomanNumeralIndentation. [Role] 'Name' {Attributes}'.

Two details matter. First, the capture is paired: a before tree and an after tree surround each action. The diff between them is how the tool learns what a step actually did, which means a step is recorded as intent ("the Customer Code field gained focus and then held a value") rather than as a raw click at a pixel. Second, the representation is a role-and-name hierarchy, not an image. A model reasoning over [Edit] 'Customer Code' has a real description of the screen to work with. That is what the word "intelligent" should point at: the agent reasons over structure, not over a screenshot.

The same recorder and the same accessibility primitives are open source under the Terminator SDK, so teams that want to verify or extend how the tree is read can do it directly rather than taking a marketing page on faith.

"Self-healing" is a fallback cascade, not a slogan

Every intelligent automation vendor claims its bots adapt when a UI changes. Most of the time that claim has no mechanism behind it. In Mediar it is a specific, ordered sequence of strategies for relocating an element at runtime. When the agent goes to act, it tries these in order and stops at the first one that succeeds:

1

Strategy 1: accessibility / automation ID

First it tries to relocate the element by its stable identifier. When an app exposes a real automation ID, this is the cleanest match and nothing downstream has to run.

2

Strategy 2: window context plus bounds

If the ID is missing or changed, it walks the window tree for the owning process and matches on position and shape inside that window, not on absolute screen coordinates.

3

Strategy 3: text content

If layout shifted too, it falls back to the element's visible text. A button still labeled 'Post' is still the button, even after a re-theme that moved and recolored it.

4

Strategy 4: window focus fallback

If the specific control genuinely cannot be found, it restores focus to the right window and logs the miss, instead of blindly clicking a coordinate that now belongs to something else.

The point is that no single brittle selector is load-bearing. A field that moved 200 pixels resolves on strategy two. A button that got recolored and relabeled in a theme update still matches its text on strategy three. Only when a control genuinely no longer exists does the run degrade, and even then it logs the miss against the right window instead of clicking whatever now sits at the old coordinate. That is the difference between graceful degradation and a 2 a.m. page.

Record once, then run

Put the pieces together and the lifecycle is short. You do not script the workflow; you perform it once and the perception layer captures the structure underneath.

From one recording to a self-healing run

1

Record once

You run the workflow yourself. The recorder snapshots the accessibility tree before and after each action.

2

Diff the trees

The before/after pair tells the model which element changed and why, so the step is captured as intent, not as a coordinate.

3

Emit a workflow file

Steps become an inspectable, version-controlled workflow rather than an opaque recording you cannot audit.

4

Run with self-healing

At execution time each element is relocated through the fallback cascade, so a UI change degrades gracefully instead of snapping the run.

Where this actually pays off

The perception argument is not abstract. It maps directly to the systems that have defeated traditional RPA programs: SAP GUI and SAP Business One, Oracle EBS, mainframe terminals, banking core systems like Jack Henry, Fiserv, and FIS, and EHRs like Epic and Cerner. These share a trait. They expose no clean API and no stable web DOM, and their layouts and field identifiers drift between versions. A tool that reads structure rather than pixels has a fighting chance there.

70%

An F&B chain moved from UiPath to Mediar and reported a 70 percent cost reduction to its board, with no six-figure platform license and billing at $0.75 per minute of runtime.

Mediar customer deployment, F&B chain migrating off UiPath

To be honest about the boundary: if your work lives entirely in modern web apps with clean APIs, a browser-based agent is often the simpler choice and you may not need any of this. The accessibility-tree approach earns its keep specifically on the no-API desktop layer, the place browser agents cannot reach. Picking an intelligent process automation tool is really a question of which of those two worlds your hardest workflows live in.

How to evaluate an intelligent process automation tool

When a vendor demos, the demo always works. The questions that separate tools are the ones that probe perception and failure:

  • How do you identify an element? If the answer is coordinates, image matching, or auto-generated selectors, expect maintenance every time a screen changes. Ask whether it reads the accessibility tree.
  • What happens when the UI updates? Push for the mechanism, not the adjective. A real answer describes an ordered fallback, like ID, then bounds, then text. "Our AI adapts" is not a mechanism.
  • Can it run on my legacy desktop apps at all? Name your worst system out loud, SAP GUI or a green-screen terminal, and ask for a recorded run against it, not a slide.
  • Can I audit what it captured? A workflow you can read and version-control is reviewable. An opaque recording is not. Mediar emits an inspectable workflow file rather than a black box.

You can record your own first workflow in the no-code web app at app.mediar.ai/web and watch the before/after tree capture happen on a screen you already know.

Bring your worst legacy screen to the call

Book 30 minutes with the founders, name the desktop app that has defeated your RPA program, and we will show the accessibility-tree approach run against it live, with the open-source code on a screen share.

Frequently asked questions

What is an intelligent process automation tool?

It is software that pairs robotic process automation (RPA) with AI so it can handle workflows that involve unstructured inputs and judgment, not just fixed rules. Plain RPA follows a recorded script step by step; an intelligent process automation tool adds machine learning, document understanding, or an agent layer on top so it can read a PDF, classify a case, or adapt a step. The practical question that decides whether one works on your systems is narrower than the category sounds: how does it perceive the screen it is driving?

How is intelligent process automation different from RPA?

RPA is the execution engine: it clicks, types, and reads fields by following a defined path. Intelligent process automation is RPA plus a reasoning layer that can deal with inputs a fixed script cannot, such as a scanned invoice in an unfamiliar layout or a claim that needs routing. The two are not rivals; almost every intelligent process automation tool has an RPA core inside it. What varies a lot between tools is how reliably that core reads the application, which is where most of the real-world failures come from.

Why do intelligent automation tools break on legacy desktop apps?

Most read the screen through pixels, screenshot template matching, or auto-generated selectors. SAP GUI, Oracle EBS, mainframe terminals, Jack Henry and Fiserv green screens, and EHRs like Epic and Cerner do not offer a clean web DOM or a stable API, and their layouts and field IDs shift between versions. A pixel or selector match that worked yesterday silently points at the wrong control after an update. Mediar reads the operating system accessibility tree instead, the same structured interface screen readers use, so a field is identified by its role and name rather than its coordinates.

What does 'self-healing' actually mean in practice?

In Mediar it is a concrete fallback cascade, not a slogan. When the agent goes to act on an element, it first tries to relocate it by its accessibility or automation ID, then by window context and bounds, then by visible text content, and finally falls back to restoring focus to the correct window and logging the miss. Because no single brittle selector is load-bearing, a re-theme or a moved field usually resolves through one of the later strategies instead of breaking the run.

Do I need to write code to use an intelligent process automation tool?

Not with Mediar for the common case. You record a workflow once in the no-code web app at app.mediar.ai/web, and the recorder turns what you did into a reusable workflow. Teams that want to extend or script behavior can use the open-source Terminator SDK at github.com/mediar-ai/terminator, but that is optional. The recording flow exists so an operations person, not only a developer, can build an automation.

How much does Mediar cost compared with traditional RPA?

Mediar bills at $0.75 per minute of runtime with no per-seat licensing, plus a $10,000 turn-key program fee that converts to usage credits. There is no six-figure platform license. One F&B chain that moved from UiPath to Mediar reported a 70 percent cost reduction to its board. The pricing model rewards workflows that run quickly and reliably rather than charging for every named user who might touch the console.

When is a browser-based AI agent the better choice instead?

When the work lives entirely in modern web apps with clean APIs or a stable DOM. Browser agents are a good fit for new SaaS tools. They do not help when your data sits in SAP GUI, a Jack Henry green screen, or a Windows desktop application with no integration surface. That gap, the no-API legacy desktop layer, is exactly the case the accessibility-tree approach was built for, so the two approaches tend to win in different places rather than competing head to head.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.