A field guide, not a pitch

Where RPA stalls on legacy apps is a short, nameable list

If you have run UiPath, Power Automate, Automation Anywhere, or Blue Prism against a desktop estate, you already know the feeling: a queue that was moving at 2am is frozen by 6am, and nobody touched the workflow. RPA does not fail at random on legacy apps. It stalls at the same handful of spots, over and over. This is the catalog.

M
Matthew Diakonov
9 min read

The short answer

RPA stalls on legacy desktop apps at seven recurring points: dynamic element ids, coordinate drift after a UI change, unexpected modal dialogs, Citrix and virtualized sessions, slow screens and race conditions, session timeouts, and exception cases the recording never saw. Every one traces to a single root cause: a selector-based or pixel-based bot has no durable handle on the screen, so the moment the UI shifts shape or timing, it has nothing to grab. The fix is not better selectors. It is reading the same accessibility tree a screen reader reads, then classifying each stall so the runtime knows whether to retry it, re-resolve it, or stop and get a human.

The trap in most RPA postmortems is treating every stall as one problem to harden against. It is not. The seven failure points below fall into different families, and they need different handling. A bot that cannot tell them apart either retries forever or dies on a transient.

Why legacy apps are where RPA goes to stall

A web app hands a bot a stable DOM: ids, classes, a real document model the bot can query. A legacy Windows desktop app hands it none of that. No API. Often no stable control ids. Frequently a window that, as far as the bot can tell, is just pixels. SAP GUI, Oracle EBS, mainframe terminal emulators, Jack Henry teller screens, Epic charting windows: every one of them predates the idea that a program other than a human would need to read it.

So a selector-based RPA platform invents a handle. It records an XPath, an automation id, a window title, an image template, a screen coordinate. Each of those is a guess about the screen's current shape, and each one is fragile in a different way. The bot is not bound to your workflow. It is bound to a snapshot of the UI taken on the day someone recorded it.

That is the whole story behind "where RPA stalls." The stall points are not exotic. They are the predictable ways a snapshot stops matching reality. Here is each one, why a selector or pixel bot breaks there, and what an agent that reads the operating system's accessibility tree does instead.

The seven places legacy RPA stalls

Sorted roughly by how often each one shows up in a stalled overnight queue.

01

Dynamic element IDs

selector decay

Why the bot stalls

The app regenerates a control's automation id or XPath on every build or session. The bot recorded usr/txtFIELD-3819; next run that node is txtFIELD-4102 and the selector resolves to nothing. SAP support packs, VB6 recompiles, and .NET WinForms rebuilds all churn ids this way.

What an accessibility-tree agent does

Match on what the operating system uses to name a control for a screen reader: its accessibility name, its role, and its parent window. Those survive an id churn because they describe what a human sees, not a build artifact.

02

Coordinate drift after a UI change

pixel decay

Why the bot stalls

A patch moves a panel, a new monitor changes the resolution, Windows DPI scaling shifts at 125 percent, or a theme update repaints the screen. Image-template and click-at-XY bots now land on empty space or the wrong field.

What an accessibility-tree agent does

Never store a coordinate. Resolve the element fresh from the accessibility tree on every run, then click its reported center. A moved panel is still the same node in the tree.

03

Unexpected modal dialogs

wrong surface

Why the bot stalls

A save-as-draft confirmation, a license-renewal nag, or a Windows security prompt opens on top of the screen the bot expected. The bot keeps clicking the spot where the field used to be, or types a customer name into the dialog.

What an accessibility-tree agent does

In the accessibility tree a dialog is a new top-level window. An agent that walks the tree sees a window it did not expect, and routes to a fallback branch instead of driving the wrong surface.

04

Citrix and virtualized sessions

no tree to read

Why the bot stalls

The legacy app runs inside a Citrix or RDP window. Run locally it would publish accessibility nodes; through the remoting channel the host machine sees one flat bitmap. Selector bots fall back to OCR, which needs constant recalibration.

What an accessibility-tree agent does

Read the accessibility tree inside the session host, where the app actually lives, instead of scraping the bitmap the remote viewer paints. The tree is intact on the machine running the app.

05

Slow screens and race conditions

timing

Why the bot stalls

A legacy core system loads at inconsistent speed. A fixed Sleep(3000) is too short on a bad day, so the bot types into a field that has not rendered yet, and wasteful on a good day. Skipped steps and premature entries follow.

What an accessibility-tree agent does

Wait for the element to actually appear in the tree, not for a clock. The agent polls for the node and proceeds the moment it is present, with a real timeout as the ceiling.

06

Session timeouts and re-auth

wrong surface

Why the bot stalls

Mid-run, an ERP or banking-core session expires and a login screen takes its place. The bot keeps driving a dead session, posting nothing, until a human notices the queue stopped moving hours later.

What an accessibility-tree agent does

The login window is a surface the agent did not expect. It detects the mismatch, re-authenticates through a fallback step, and resumes the run from the step that stalled.

07

Exceptions the recording never saw

real exception

Why the bot stalls

A record that does not exist, a field that rejects the input, a validation rule the analyst never hit while recording. This one is not a brittleness bug. The workflow genuinely cannot proceed.

What an accessibility-tree agent does

This is the one stall a runtime should not paper over. It gets classified as a workflow-logic failure, the run stops cleanly with the step and reason logged, and a human picks it up.

70%

We moved an LG-customer F&B chain off UiPath onto Mediar. Their CFO told the board they are now saving 70 percent on costs, because the automation self-heals when SAP B1 screens change instead of needing a developer every patch cycle.

Mediar deployment, F&B chain on SAP Business One

A stall is a classification problem

Look at the seven failure points again. Most of them are recoverable if the runtime knows what kind of stall it hit. A coordinate drift needs a re-resolve. A connection blip needs a retry. A modal dialog needs a branch. Only the last one, a genuine workflow exception, needs a human. Traditional RPA collapses all seven into one outcome, "the bot stalled," and pages an engineer for every one of them.

Mediar's workflow executor does not. Before it decides what to do with a failure, it classifies it. The function that does this lives in crates/executor/src/config/retry.rs and it is called classify_error(). It buckets every failure into one of three categories by matching the error text against two pattern lists.

Infrastructure

Retried automatically. The workflow is fine, the plumbing hiccuped.

  • +connection timeout
  • +connection refused
  • +503 service
  • +504 gateway timeout
  • +mcp not responding
  • +vm is down
  • +health check failed
  • +deadline exceeded

Workflow logic

Never retried automatically. Retrying cannot fix a record that does not exist.

  • !validation failed
  • !invalid input
  • !record not found
  • !permission denied
  • !step failed
  • !assertion failed
  • !file not found
  • !parse error

Anything that matches neither list is Unknown, and the executor is deliberately conservative: unknown failures are not retried. Infrastructure failures retry with exponential backoff, three attempts, a 30 second initial delay, a 2x multiplier, capped at 600 seconds, so the delays run 30s, 60s, 120s. That backoff math is the default RetryConfig in the same file.

This is the part no general guide on RPA and legacy systems mentions, and it is the part that matters. "The bot stalled" is not an outcome. It is an unclassified event. A runtime that can name the category of a stall can recover from six of the seven failure points above without a human ever seeing them.

Every step declares what to do when it stalls

Classification answers "is this worth retrying." The second half is "what should this specific step do." In a Mediar workflow, each step carries its own error strategy. The enum is small and it is explicit.

// crates/executor/src/models/workflow.rs
pub enum ErrorStrategy {
    Stop,      // halt the run, log the step
    Continue,  // record it, move to next step
    Retry,     // re-run this step (retry_count)
    Fallback,  // jump to fallback_id branch
}

A read step from a slow SAP screen might be Retry with a retry count. A step that hits an optional confirmation dialog might be Fallback, jumping to a branch that dismisses the dialog and rejoins. A step that posts a financial transaction is Stop: if it fails, the run halts cleanly rather than guessing. And because the executor supports partial re-execution, a fixed run resumes from the step that stalled instead of replaying the whole workflow from the top.

What happens when a step stalls

1

A step stalls

element missing, dialog, timeout

2

classify_error()

infrastructure, logic, or unknown

3

Apply on_error

stop, continue, retry, or fallback

4

Resume from step

no full re-run of the workflow

The difference from a selector-based bot is that none of this is glue code an RPA developer hand-wrote per workflow. It is how the runtime behaves by default, the same way for SAP GUI, a Jack Henry teller window, an Epic chart, or a mainframe terminal.

The one stall you should not automate away

Failure point seven, a genuine workflow exception, is the honest limit. If a claim references a policy number that does not exist, no amount of retrying, re-resolving, or branching fixes it. The right behavior is to stop, log the step and the reason, and surface it to a person. A runtime that quietly retries a real exception is worse than one that stalls visibly, because it hides the problem.

This is also the honest limit of browser-based AI agents on legacy work. They are good on modern SaaS. If your data lives in SAP GUI or a green-screen teller app, a browser agent cannot reach it at all. The accessibility-tree approach exists precisely because that is where the hard, unglamorous, high-volume work still sits.

Bring the workflow that stalls on you

A 30 minute call. Pick the legacy workflow you have lost the most overnight runs to, and we will walk through which of the seven stall points is hitting it and how the classifier would route each one.

Frequently asked questions

Why does RPA break on legacy desktop apps more than on web apps?

A web page hands a bot a stable DOM with ids, classes, and a real document model. A legacy Windows desktop app hands it nothing comparable: no API, often no stable control ids, and frequently a window that only renders as pixels. A selector-based or pixel-based bot has to invent a handle on the screen, and any shift in shape or timing breaks that handle. The accessibility tree is the one stable, structured surface a desktop app does publish, because screen readers depend on it.

Can RPA work through Citrix or RDP?

Partly, and that is the trap. Inside a Citrix or RDP session the legacy app still publishes accessibility nodes on the machine that runs it, but the remote viewer on the bot's side usually receives one flat bitmap. Selector matching fails, so most RPA platforms fall back to image recognition and OCR, which need constant recalibration when resolution, DPI, or theme changes. The reliable approach is to read the accessibility tree inside the session host where the app actually lives, not to scrape the painted bitmap.

What is the most common reason an RPA bot stalls overnight?

An unexpected window. A session timeout drops a login screen in front of the workflow, a vendor patch shows a one-time notice, or a confirmation dialog the recording never captured opens on top of the screen. A coordinate or selector bot keeps driving the surface underneath, posts nothing, and the queue sits frozen until a human checks it in the morning. An agent that walks the accessibility tree sees a top-level window it did not expect and can branch instead of pushing into a dead surface.

Does an accessibility-tree agent stall at the same points?

It removes most of them and changes the rest. Dynamic ids and coordinate drift stop mattering because the element is resolved fresh from the tree on every run rather than from a stored id or pixel. Slow screens become a wait-for-element check instead of a fixed sleep. Unexpected dialogs and session timeouts become a detected window mismatch that routes to a fallback step. The one stall that remains real is a genuine workflow exception, like a record that does not exist, and that one should stop and reach a human rather than be retried.

Why do RPA bots stall after a vendor patch even when the workflow did not change?

Because the bot was never bound to the workflow, it was bound to the screen's current shape. A SAP support pack can rename a field label or churn control ids. A WinForms rebuild reassigns automation ids. A UI refresh moves a panel. The business process is identical, but every selector and every recorded coordinate now points at the old shape. Binding instead to the accessibility name, role, and parent window keeps the automation pointed at the same control through a patch.

If a step genuinely fails, does Mediar just retry forever?

No. The executor classifies every failure before deciding. Its classify_error function buckets a failure as infrastructure (a connection timeout, a 503, a VM that is down), workflow logic (validation failed, record not found, step failed), or unknown. Infrastructure failures retry with exponential backoff, three attempts starting at 30 seconds and capped at 600. Workflow-logic and unknown failures are not retried automatically. On top of that, every step carries its own on_error strategy: stop, continue, retry, or fallback. Retrying forever is exactly the behavior the classifier exists to prevent.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.