A 2026 reliability read

Enterprise process automation in 2026: the fail-safe layer no vendor talks about.

Enterprise process automation (EPA) is the coordinated automation of repetitive workflows across multiple systems and departments at organization scale. The 2026 shift, summarized at Appian World on 30 April 2026 and reported by SiliconANGLE, is that EPA buyers now grade platforms on reliability and auditability, not on what their agents can technically do. Building agents is the easy part. Keeping them accurate, auditable, and useful on live business workflows is the part that decides whether the program lasts a year. This piece walks the reliability primitive that decides it (a code-defined boundary between transient infrastructure failures and permanent workflow logic failures) and shows the actual Rust the Mediar executor uses, line by line.

M
Matthew Diakonov
11 min

Direct answer (verified 2026-05-01)

Enterprise process automation in 2026 is the orchestration of repetitive workflows across multiple systems and departments at organization scale, judged less on agent capability and more on the platform's fail-safe layer. The platforms that survive production share one architectural choice: a code-defined classifier separates transient infrastructure failures (retry with exponential backoff) from workflow logic failures (stop, alert, never retry), with a hard ceiling on retries and an auto-cancel rule for stuck schedules.

Source: SiliconANGLE coverage of Appian World 2026 plus the published source of Mediar's executor at github.com/mediar-ai/terminator.

1. The 2026 shift: reliability over capability

For five years the EPA story was about capability. Could the agent read a PDF, drive a SAP GUI, fill an Oracle Forms screen, hand a row to Excel, escalate to a human, sign a JWT? By 2026 the answer to all of those is yes, from at least four serious vendors. The buying conversation has moved.

Adam Glaser, Appian's VP of product, summarized the shift this way at Appian World 2026: building agents is no longer the hard part; keeping them accurate, auditable, and useful on live business workflows is. The 40% Gartner forecast (40% of enterprise applications will embed task-specific AI agents by end of 2026, up from less than 5% in 2025) only matters if the embedded agents stay reliable past the demo. The platforms whose 2025 pilots are now in production all share the same property. They have a fail-safe layer. The platforms whose 2025 pilots stalled at the first quarterly review do not.

The fail-safe layer is the part nobody puts on a feature page, because it sounds like the absence of capability rather than the presence of one. A lot of vendor copy talks about scale, intelligence, orchestration, and self-healing. Almost none of it shows the SQL query that decides when a stuck workflow gets killed, or the substring list that decides whether an error is transient. The system property that decides whether an EPA program survives is the system property most marketing pages refuse to surface.

40%

As generative AI matures, building agents is no longer the hard part. Keeping them accurate, auditable, and useful on live business workflows is.

Gartner 2026 forecast for enterprise applications embedding task-specific AI agents (up from less than 5% in 2025), via SiliconANGLE coverage of Appian World 2026

2. Three failure modes that kill EPA programs

The post-mortems on stalled EPA programs land on the same three failure modes, in roughly the same order.

One: every flake becomes a ticket. The bots get built, the workflows ship, and the failure layer does not. A 503 from the target system, a VPN reconnect that drops the session, a window-focus loss while a user moves the mouse, all of these mark the workflow failed. Each one creates an ops ticket that turns out to be transient and self-resolved within a minute. By month two, ops is spending 30 percent of its time triaging tickets that did not need triage. The platform looks unreliable; the underlying infrastructure is fine.

Two: the retry layer corrupts data. Ops asks for retries and the platform adds them. Now the 503 self-heals. So does the validation error: the bot resubmits the same bad input five times before failing, and downstream systems record five duplicate rows. Or the permission-denied error retries until the lockout policy locks the service account, taking down every other workflow using it. Retrying everything is worse than retrying nothing, because the failures it does not heal it amplifies.

Three: the dashboard goes red and stays red. Both of the above get fixed. The classifier exists; retries respect it; the failure rate drops 95 percent. Then the target system goes down for a Sunday-night maintenance window, the cron-scheduled workflow fires every minute against the dead endpoint, and by Monday morning the dashboard has 1,440 identical red rows. The team learns to ignore the dashboard. The next real incident, two weeks later, sits in the noise for six hours before someone notices. The fail-safe layer is incomplete unless it knows when to stop.

3. What the fail-safe layer actually contains

Three primitives, one for each of the three failure modes above. None of them are exotic. All of them have to be code-reviewable, in the path of every execution, with constants the buyer can read before signing.

Primitive A: an error classifier. Every error message that comes off an executed step gets categorized into one of a small fixed set: infrastructure (transient, retry), workflow logic (deterministic, do not retry), unknown (be conservative, do not retry). The classifier is the boundary between modes one and two above. Without it, you either retry everything or retry nothing.

Primitive B: a retry policy bound to the category. Infrastructure errors get a bounded retry: a small number of attempts, exponential backoff with a hard ceiling, then permanent failure. Workflow logic errors get zero retries. The constants are the system property a buyer should grep for: how many retries, what initial delay, what max delay, what multiplier. Vendors that decline to publish those numbers are publishing a marketing claim, not a system property.

Primitive C: consecutive-failure auto-cancel. A background pattern check runs against the execution history. If the same workflow has failed N times with the same error in the last M minutes, the next scheduled tick is skipped and the schedule itself is marked auto-cancelled until a human acks. This is the only primitive that prevents mode three. Without it, every prolonged target-system outage trains your team to ignore the dashboard.

Failure cascade in the Mediar queue processor

Executorclassify_errorDecisionRetry pathPermanent failexecute stepthrows errorclassify_error(msg)Infrastructure | WorkflowLogic | Unknownif Infra && retries < 3schedule retry +30s, +60s, +120sif Logic | retries == 3mark permanently failed

4. Reading the actual code: classifier and retry config

The Mediar executor is open source. The reliability primitives are in two files inside github.com/mediar-ai/terminator: the retry config and classifier at crates/executor/src/config/retry.rs, and the dispatch handler at crates/executor/src/services/execution_handler.rs. The first 44 lines of retry.rs are the entire retry policy. The defaults are worth reading directly.

// crates/executor/src/config/retry.rs (lines 5-44)
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct RetryConfig {
    pub max_infrastructure_retries: u32,
    pub initial_delay_secs: u64,
    pub max_delay_secs: u64,
    pub backoff_multiplier: f64,
    pub enabled: bool,
}

impl Default for RetryConfig {
    fn default() -> Self {
        Self {
            max_infrastructure_retries: 3,
            initial_delay_secs: 30,   // first retry after 30s
            max_delay_secs: 600,      // cap at 10 minutes
            backoff_multiplier: 2.0,  // double each time
            enabled: true,
        }
    }
}

From crates/executor/src/config/retry.rs, lines 5 to 44. Three retries, 30s initial delay, 600s ceiling, double each time. Configurable per deployment, not hard-coded; this file is the default a fresh install ships with.

The next 120 lines are the classifier itself. Two named substring lists, in priority order. Anything that matches the infrastructure list returns Infrastructure. Anything that does not match infra but matches the workflow-logic list returns WorkflowLogic. Anything that matches neither falls through to Unknown, and the conservative default is to not retry the unknown. The conservatism is deliberate: an infrastructure pattern the classifier missed costs you one extra failed run; a logic pattern the classifier incorrectly retries can corrupt downstream data.

// crates/executor/src/config/retry.rs (lines 46-164)
#[derive(Debug, Clone, PartialEq)]
pub enum ErrorCategory {
    /// Infrastructure failure (VM down, network issue, MCP unreachable)
    /// These SHOULD be retried automatically
    Infrastructure,

    /// Workflow logic failure (step failed, validation error, business logic)
    /// These SHOULD NOT be retried automatically
    WorkflowLogic,

    /// Unknown/ambiguous error
    Unknown,
}

pub fn classify_error(error_message: &str) -> ErrorCategory {
    let error_lower = error_message.to_lowercase();

    let infrastructure_patterns = [
        "connection refused", "connection reset", "connection timeout",
        "503 service", "504 gateway timeout",
        "mcp service unavailable", "mcp connection failed",
        "vm is down", "machine not responding", "health check failed",
        "could not resolve host", "deadline exceeded",
        "out of memory", "resource temporarily unavailable",
        // (full list: 30+ patterns)
    ];

    let workflow_logic_patterns = [
        "validation failed", "invalid input", "missing required",
        "record not found", "permission denied", "unauthorized",
        "step failed", "assertion failed", "condition not met",
        "file not found", "parse error",
        // (full list: 18 patterns)
    ];

    for pattern in &infrastructure_patterns {
        if error_lower.contains(pattern) {
            return ErrorCategory::Infrastructure;
        }
    }
    for pattern in &workflow_logic_patterns {
        if error_lower.contains(pattern) {
            return ErrorCategory::WorkflowLogic;
        }
    }

    // Conservative default: do not retry the unknown
    ErrorCategory::Unknown
}

Excerpt from lines 46 to 164. The full file lists roughly 30 infrastructure substrings and 18 workflow-logic substrings. Run grep -c '"' crates/executor/src/config/retry.rs on the open-source repo to count the live patterns yourself.

0max infrastructure retries before permanent failure
0sinitial backoff delay before the first retry
0sceiling on backoff delay (10 minutes)
0 minconsecutive-failure window before cron auto-cancel

5. The auto-cancel rule, in one SQL query

The third primitive (consecutive-failure auto-cancel) is one SQL query inside the queue processor. Before the executor claims an execution off the queue, if the trigger source is the cron scheduler, it asks the database one question: did this workflow already fail at least 3 times in the last 10 minutes with the same exact error message? If yes, the execution is auto-cancelled with the reason “Auto-cancelled due to consecutive failures”. The next scheduled tick is also cancelled.

// crates/executor/src/db/queries.rs (line 358)
// 3 identical errors within 10 minutes -> kill the cron schedule
SELECT COUNT(*) as count
FROM (
    SELECT error_message
    FROM workflow_executions
    WHERE workflow_id = $1
      AND status = 'failed'
      AND completed_at > NOW() - INTERVAL '10 minutes'
    ORDER BY completed_at DESC
    LIMIT 3
) recent_failures
WHERE error_message IS NOT NULL
GROUP BY error_message
HAVING COUNT(*) >= 3;

From crates/executor/src/db/queries.rs, line 358. Manual and web-triggered executions skip the check; the user is at the keyboard and can decide what to do. Cron-triggered executions, which run unattended, get the auto-cancel.

The threshold (3 inside 10 minutes) is intentionally tight. Two identical failures inside 10 minutes can still be coincidence; three is unambiguous evidence that the target system is unreachable, the credentials have rotated, or the schedule itself has gone wrong. Continuing to fire is worse than stopping. Past that bar, the next firing is silenced and a single alert is raised. Every EPA platform should have something like this; the platforms whose alerts are unreadable do not.

6. The buyer's view: what to ask, before signing

A short list of questions worth asking any 2026 EPA vendor. None of these are gotchas. All of them have an honest answer. The pattern is that the vendors with strong fail-safe layers have answers in minutes; the vendors without one route the question through three layers of sales engineering and come back with a deflection.

The conversation a 2026 EPA buyer should be having

Capability claims, scale claims, AI claims. The retry behavior, error categorization, and auto-cancel logic are described as 'enterprise-grade reliability' or 'self-healing automation', without naming the constants, the substring lists, or the cancel threshold. The buyer cannot grep what they are about to sign.

  • Self-healing (no rule shown)
  • Enterprise-grade reliability (no constants)
  • AI-powered orchestration (no failure handling)
  • 99.9% uptime (without naming what counts as up)

Mediar publishes the answers to all four under AGPL-3.0 in the Terminator repository because we believe the only reliability claim worth making is one a customer can grep. The same approach is available to any vendor that wants to make it; that none of the large incumbents have is itself a piece of information for the 2026 buyer.

Bring an EPA program that has to survive past the demo.

Twenty minutes is enough to walk one of your stuck workflows against the classifier and the auto-cancel rule, live. We will name the constants you should be auditing on every vendor you talk to in 2026, ours included.

Frequently asked questions

What is enterprise process automation in 2026?

Enterprise process automation (EPA) is the coordinated automation of repetitive workflows across multiple systems and departments at organization scale. It is broader than RPA (which automates one task on one screen) and broader than departmental workflow tools (which connect SaaS APIs inside one team). The 2026 shift, called out at Appian World on 30 April 2026 and reported by SiliconANGLE, is that buyers no longer score EPA platforms by what their agents can technically do. Building agents is the easy part now. Keeping them accurate, auditable, and useful on live business workflows is the hard part. The platforms that survive production have a code-defined fail-safe layer: explicit error categories, retry policy bound to category, and an auto-cancel rule for stuck schedules. The platforms that do not have those primitives produce demo-quality reliability and ops-team escalations.

What is the difference between EPA and RPA?

RPA automates one structured task by mimicking a human at one screen: clicking, typing, copying values from one box to another. It is a per-task tool. EPA orchestrates an end-to-end business process across many screens, many systems, and often many departments, with a fail-safe layer that decides what happens when one step breaks. RPA is a component inside an EPA platform, not a replacement for it. The distinction shows up most sharply when something fails: a pure RPA bot stops, logs, and waits for a human. An EPA platform classifies the failure, decides whether to retry it (transient infrastructure issue) or stop the schedule entirely (workflow logic issue or repeated failure), and routes the right alert to the right team. Without that decision layer, RPA at enterprise scale becomes an alert firehose nobody reads.

Why do most enterprise process automation programs stall after the first quarter?

Three reasons, in order. First, the bots get built but the failure layer does not. Every transient flake (a 503, a VPN reconnect, a window-focus loss) marks the workflow failed, and ops drowns in tickets that would have self-healed if a retry policy existed. Second, the failure layer gets built but it retries everything, including business-logic errors, so a bad input gets resubmitted 50 times and corrupts data. Third, both of those problems are covered, but a stuck schedule (a cron workflow whose target system is down) silently fails 1,440 times overnight, fills the dashboard with red, and trains everyone to ignore the dashboard. The platforms that do not stall have separate code paths for these three classes of failure. The fail-safe layer is the program. Everything else is wiring.

What does a working error classifier actually look like in code?

Mediar publishes its error classifier in the open-source Terminator repository at github.com/mediar-ai/terminator. The file is crates/executor/src/config/retry.rs. It defines an ErrorCategory enum with three variants (Infrastructure, WorkflowLogic, Unknown) and a classify_error function that pattern-matches the error message string against two named lists. Roughly 30 substrings indicate infrastructure trouble (connection refused, 503, mcp service unavailable, vm is down, deadline exceeded). Roughly 18 indicate workflow logic trouble (validation failed, record not found, permission denied, step failed, file not found). Anything that matches neither list returns Unknown, and the conservative default is to not retry. The conservatism is deliberate. An infrastructure pattern that the classifier missed costs you one extra failed run; a logic pattern that the classifier incorrectly retries can corrupt downstream data.

How does a fail-safe EPA platform stop a stuck cron workflow from failing all night?

Mediar's executor checks a SQL pattern on every cron-triggered execution, defined at crates/executor/src/db/queries.rs:358. The query asks: in the last 10 minutes, did this workflow fail 3 or more times with the exact same error message? If yes, the workflow is auto-cancelled with the reason 'Auto-cancelled due to consecutive failures' and the next scheduled tick is skipped. Manual and web-triggered executions are exempt (the user is right there, they can decide what to do). This is the difference between an EPA platform that wakes ops at 3 AM with one actionable alert and one that fills the dashboard with 360 identical red rows. The threshold is intentionally tight: 3 failures inside 10 minutes is unambiguous evidence the target system is unreachable, and continuing to hammer it is worse than stopping.

What retry math does Mediar use for transient infrastructure failures?

Default RetryConfig: max_infrastructure_retries is 3, initial_delay_secs is 30, max_delay_secs is 600 (10 minutes), backoff_multiplier is 2.0. So the first retry runs 30 seconds after a failure, the second runs 60 seconds after that, the third runs 120 seconds after that. Past the third infrastructure retry, the execution is marked permanently failed. The cap matters: 600 seconds is the ceiling because past that, a transient infrastructure issue has stopped looking transient and the platform should stop pretending. These numbers are configurable per-deployment, not hard-coded; the file is the default for a fresh install. Pages that quote a vendor's retry policy without showing the configuration knob are quoting a marketing claim, not a system property.

How is this different from how UiPath, Automation Anywhere, or Power Automate handle failures?

All three large RPA platforms ship retry frameworks. The structural difference is that Mediar's classifier is a code-reviewable artifact in an open-source repository (github.com/mediar-ai/terminator, AGPL-3.0 for the executor), versioned in git, with the substring lists visible to anyone deciding whether to bet a workflow on the platform. UiPath, Automation Anywhere, and Power Automate handle this inside proprietary runtimes whose retry behavior is documented but not auditable, and whose error-category boundaries are implementation details. The practical consequence is that a Mediar customer's compliance team can grep the executor for the exact substrings that decide whether their healthcare workflow gets retried; a UiPath customer's compliance team has to take the vendor's word for it. Both approaches can be reliable in practice. Only one approach can be inspected.

Do I need a fail-safe layer if my workflows are simple?

If you run fewer than 100 executions per week and your team can babysit every failure manually, no. The argument for a fail-safe layer is volume: at 100 to 1,000 executions per week the failure rate is small enough that ops can keep up, but starts wasting senior time on the same 5 transient errors. At 10,000 to 100,000 executions per week, the volume of transient flakes alone overwhelms a manual triage queue, and the only way to keep the dashboard signal-bearing is automated category-based handling. Above 100,000 the fail-safe layer is the program; without it the workflows do not run reliably enough to be load-bearing for the business. Most EPA buying decisions in 2026 are happening at the 1,000 to 100,000 weekly volume range, which is exactly the range where the fail-safe layer is the deciding factor and the agent capability is not.