Enterprise AI agent governance

The audit log is the input layer

You cannot govern an AI agent acting on SAP GUI, Oracle EBS, or a Jack Henry green screen better than the agent's input layer lets you log it. Pixel agents leave a video. Accessibility-tree agents leave a structured per-step record: step_id, element role, application_name, field name, before and after value, duration_ms, retry_count, error_category. That shape is what maps cleanly onto SOX, HIPAA, and SOC 2 controls. The control plane sits on top of it; it cannot replace it.

Matthew Diakonov, Written with AI

Published May 5, 20269 min read

Direct answer (verified 2026-05-05)

Three things have to be true for AI agent governance on legacy systems to work: every action carries an identity (caller, org, execution), every action is structured (step id, element role, application, field, before and after value, duration, retry count), and every failure is attributable (Infrastructure vs WorkflowLogic vs Unknown). The first two are not policy choices, they are determined by the agent's input layer. Agents driving Windows accessibility APIs get all three by construction; pixel agents and pure-LLM “see and click” loops cannot without a redesign.

The thesis

Most writing about enterprise AI agent governance treats the control plane as the answer: identity for agents, policy engines, threat response, an enterprise “agent control plane” that sits above whatever the agents actually do. That layer is real and necessary. It is not sufficient.

The reason it is not sufficient on legacy systems is mundane. SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner, eClinicalWorks, mainframe terminals do not ship the kind of audit logs you need to reconstruct an automated session from the system side. SAP's CDPOS / CDHDR change documents help for some objects and not others. Banking-core green screens are usually mute. So the agent itself becomes the only complete record of what happened in the screen. And the shape of that record is fixed by the agent's input layer, not by the policy engine wrapped around it.

A pixel-matching or screenshot-based agent saw a 1920x1080 image, predicted a coordinate, and dispatched a click. What it can log is that image and that coordinate. A “see and click” LLM agent can log its own prose, but the link from the prose to the actual control is whatever the model chose to say, not a stable identifier. An accessibility-API agent, in contrast, walked a tree of named elements that the OS exposes for screen readers. It always knew the role, the application, the automation_id, the field name, and (for text input) the value before and after. The structured record is a side effect of how the agent had to find the control in the first place.

What lands in the log, exactly

This is the StepResult struct that every Mediar step writes intoworkflow_executions.execution_logs, alongside the columns on the row itself. It is unedited from the executor source on disk.

// crates/executor/src/models/execution.rs (lines 75-95)
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StepResult {
    pub step_id: String,
    pub tool_name: String,
    pub status: StepStatus,            // Pending | Running | Success | Failed | Skipped | Retrying
    pub result: Option<Value>,
    pub error: Option<String>,
    pub duration_ms: Option<u64>,
    pub retry_count: Option<u32>,
}

// On the workflow_executions row itself:
//   trace_id           OpenTelemetry trace, joins to ClickHouse logs
//   client_id          who called the agent
//   client_ip          where from
//   execution_params   the inputs passed in
//   results            the structured output
//   execution_logs     jsonb step-by-step
//   screenshots        per-step image array
//   error_message      on failure
//   error_category     Infrastructure | WorkflowLogic | Unknown
//   started_at, completed_at, execution_duration_seconds

The recorder side, which captures the workflow once before the agent runs it deterministically forever after, emits one log line per UI event with the element identity attached. The OS gives the recorder the role and application name for free; that is the same API a screen reader uses to read controls aloud.

// apps/desktop/src-tauri/src/workflow_recorder.rs (lines 620-700, abridged)
fn log_event_verbose(event: &TerminatorWorkflowEvent, n: i32) {
    match event {
        TerminatorWorkflowEvent::TextInputCompleted(e) => {
            // "field_name", "field_type", "text_value", "keystroke_count"
            // captured from the OS accessibility tree, not from a screenshot
            info!("TEXT INPUT {}: \"{}\" ({} keystrokes) field=\"{}\" type={}",
                n, e.text_value, e.keystroke_count, e.field_name, e.field_type);
        }
        TerminatorWorkflowEvent::Click(e) => {
            // role(), application_name() - the same identifiers a screen reader uses
            info!("CLICK {}: \"{}\" role={} app={}",
                n, e.element_text, e.metadata.ui_element.role(),
                e.metadata.ui_element.application_name());
        }
        // ...
    }
}

Source files: crates/executor/src/models/execution.rs and apps/desktop/src-tauri/src/workflow_recorder.rs in the Mediar product repo. The recorder uses the Windows UI Automation tree, the same surface area exposed in the open-source Terminator SDK.

Where the audit fields come from

The legacy desktop apps on the left have no API. They expose themselves to screen readers via OS accessibility, and that is the only structured surface there is. Mediar treats that surface as the ground truth for both finding the control and logging what was done to it. The same tree walk that locates a SAP customer field also names it for the audit row.

One walk of the tree, two outputs: an action and a record

Failures are attributable, not narrated

Half of governance is “what happened when something went wrong, and was the agent allowed to recover by itself”. Mediar splits failures into three categories at runtime via a static classifier in the executor; only Infrastructure failures auto-retry, WorkflowLogic ones halt and escalate.

// crates/executor/src/config/retry.rs (lines 46-164, abridged)
#[derive(Debug, Clone, PartialEq)]
pub enum ErrorCategory {
    Infrastructure,   // VM down, MCP unreachable, network reset; retry automatically
    WorkflowLogic,    // Validation failed, record not found, permission denied; do NOT retry
    Unknown,          // Ambiguous; surface for human review
}

pub fn classify_error(error_message: &str) -> ErrorCategory {
    let infrastructure_patterns = [
        "connection refused", "connection reset", "503 service",
        "vm is down", "machine not responding", "deadline exceeded",
        // 30+ patterns
    ];
    let workflow_logic_patterns = [
        "validation failed", "invalid input", "missing required",
        "record not found", "permission denied", "unauthorized",
        "step failed", "assertion failed",
        // 18 patterns
    ];
    // ...
}

The implication for a control owner: a “permission denied” or “validation failed” in a legacy app will not get retried into a thousand events that hammer the system. It surfaces, with its category and step, on the workflow_executions row. The retry budget itself (max_infrastructure_retries: 3, exponential backoff starting at 30s, capped at 10 minutes) lives in retry.rs and is part of the workflow definition the auditor reads.

Side by side: the audit log a control owner gets

This is the comparison that matters when you write the controls document. Same workflow on SAP GUI, two different agent architectures, two very different review experiences.

Feature	Pixel or vision agent	Accessibility-API agent (Mediar)
What the audit log shows for a single field write	screenshot before, screenshot after, mouse coordinates, key codes	step_id, tool_name, application_name, element role, automation/accessibility id, field name, prior value, new value, duration_ms, retry_count, status
Failure attribution	model output text plus a stack trace	error_category auto-classified Infrastructure vs WorkflowLogic by classify_error() against ~50 patterns; only Infrastructure failures auto-retry
Mapping to a SOX/HIPAA control	reviewer must read the screenshot and the prose to figure out which field changed and on which record	control owner can query workflow_executions by application_name, field name, or value pattern in SQL; trace_id joins to ClickHouse for the request-level log
Replay and step-by-step debugging	re-run the prompt and hope the model picks the same target; or replay coordinates against a possibly-changed screen	start_from_step / end_at_step / follow_fallback fields on the execution row; partial replay is a first-class API
When the legacy UI changes	selectors break silently or vision agent picks a similar-looking control; failures surface as 'something changed'	match by automation_id, then by window plus bounds, then by visible text; whichever matched is recorded so a human reviewer sees which strategy fired
Identity of the caller	depends on which orchestrator is in front; sometimes a service account writes everything	client_id and client_ip on every execution row; secrets loaded per-org from org_secrets via load_org_secrets(pool, org_id, execution_id) with execution_id stamped into the log line
Joining the agent log to the system-of-record log	model says it 'updated SAP'; you trust the screenshot, then check SAP CDPOS / change documents separately	the agent action and its target field are both logged with the same field name and application; SAP CDPOS or Epic audit-log entries line up by record id and timestamp

Browser and vision agents have legitimate uses on modern SaaS where the DOM and underlying system already provide audit. The comparison above is specifically for legacy Windows desktop systems with no API.

“Mediar is SOC 2 Type II certified and HIPAA compliant. Self-hosted deployments keep workflow_executions in the customer's Postgres and OpenTelemetry traces in the customer's collector via OTEL_EXPORTER_OTLP_ENDPOINT, so the audit log never leaves the perimeter.”

Mediar product brief, public/llms.txt

Counterargument: the policy layer still matters

A reasonable objection: agent identity, policy enforcement, and a centralized control plane are real governance work, and a structured per-step log does not by itself give you any of that. That is correct. The argument here is not that Forrester's AEGIS-style frameworks or a Google-Cloud-style agent control plane are wrong. They are necessary. They are also load-bearing on a substrate.

The substrate is the per-action log. A policy engine that says “an agent may write to customer master data only if X” works only when you can later prove the agent did or did not write to customer master data. On a legacy system whose own log is silent, that proof has to come from the agent. If the agent's log is a screenshot and a coordinate, the proof is weak. If it is a step record naming the application, the field, and the value, the proof is the same kind of evidence a system admin would write in a manual process.

So the practical recommendation is to pick the input layer first, then put the policy plane on top. Not the other way around. You can always add an AEGIS-style governance layer on top of an accessibility-tree agent. You cannot retrofit a structured per-step log onto an agent whose input layer never carried element identity.

What a review session actually looks like

For a workflow on SAP B1 or a Jack Henry teller, a Mediar review session is roughly:

Read the workflow. It is a TypeScript or YAML file under the deployed_workflows row, usually under 200 lines for one business process. The steps name the windows and fields they touch, in plain language.
Pull recent runs. Open the dashboard, filter workflow_executions by workflow_id, look at status counts, error_category breakdown, and the median duration_ms. Anything in Unknown gets read in full.
Spot-check a successful run. For one execution, read the StepResult array. Each step has tool_name, the element role, the field name, and the value. Check that the field names line up with the legacy app's data dictionary.
Spot-check a failed run. Pull the error_category and the failing StepResult. Confirm the failure was halted (WorkflowLogic) or retried within budget (Infrastructure). For each retry, the retry_count and duration_ms are on the step.
Tie back to the system of record. For SAP, the field name and timestamp on the step line up with CDPOS entries on the same record. For Epic, with the audit log entry. The agent log is a superset of what the system itself recorded; the two should agree on what changed.

RPA Center-of-Excellence leads finish first-pass review in a couple of hours per workflow. The bottleneck is auditor familiarity with the legacy app, not log parsing.

Picking the input layer for a legacy environment

A short, blunt guide for picking the agent architecture before you pick the governance framework on top:

If the workload is a modern SaaS app with a stable DOM and a real audit log, browser or vision agents are fine; the audit trail you keep on top of them is supplementary.
If the workload is a Windows desktop legacy system with no API (SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner, eClinicalWorks, mainframe terminal emulators), pick an accessibility-tree agent. The legacy app's own log is too thin to govern from, and the agent's log has to carry the weight.
If the workload mixes both, treat the legacy hop as the constraint and let the agent there decide the architecture. A browser agent calling out to an accessibility-tree agent for the SAP step is a normal shape.

Want to see the StepResult log on a real workflow?

We can run one of your legacy desktop processes against Mediar and walk through the per-step audit log on the actual screens you care about (SAP, Oracle EBS, Jack Henry, Epic, mainframe). 30 minutes, no slides.

FAQ

Frequently asked questions

Direct answer: how do you govern AI agents on legacy systems?

Three things have to be true at once. First, the agent's identity has to be on every action, not just every session — Mediar puts client_id, client_ip, and (when present) execution_id on every workflow_executions row. Second, the action itself has to be structured, not just visual — every step lands as a StepResult with step_id, tool_name, status, duration_ms, and retry_count, and every UI event the recorder captures carries the element's role, application, and field name. Third, failures have to be attributable — the executor's classify_error() function in crates/executor/src/config/retry.rs splits errors into Infrastructure (auto-retry), WorkflowLogic (do not retry, escalate), and Unknown (surface for review). Pixel and screenshot agents cannot give you 1 or 2 by construction; that is why most enterprise AI agent governance writeups punt on the legacy-systems part.

Why is this harder for legacy systems than for SaaS apps?

The SaaS layer comes with audit logs. Salesforce ships field-history tracking, Workday ships its audit hub, NetSuite ships saved searches over its system notes table. Legacy desktop systems mostly do not. SAP GUI has CDPOS/CDHDR for change documents on some objects but nothing for the steps that produced them. Jack Henry, Fiserv, FIS, Epic, Cerner, Oracle EBS, eClinicalWorks each have their own audit surface, often partial. So when an AI agent acts in those systems, your only complete log is whatever the agent itself emits. That is the whole governance question for legacy: not 'what policy fired in the cloud control plane' but 'what did the agent do in the screen and what record can we keep'.

What does Mediar actually log per step?

Every workflow execution is a workflow_executions row in Postgres with status, started_at, completed_at, execution_duration_seconds, error_message, error_category, screenshots[], execution_logs jsonb, results jsonb, retry_count, max_retries, and a trace_id that joins to ClickHouse for the OTLP-exported request log. Inside that, every step is a StepResult struct (crates/executor/src/models/execution.rs lines 75-95) with step_id, tool_name, status, result, error, duration_ms, retry_count. The recording side (apps/desktop/src-tauri/src/workflow_recorder.rs) emits one log entry per UI event with the element's role(), application_name(), and field_name from the Windows accessibility tree. None of those fields exist on a pixel-matching agent because the agent never saw the element's identity, only its appearance.

How does that map to SOX, HIPAA, and SOC 2 controls?

The shape of the controls is 'who did what to which record, when, and was it authorized'. Mediar's executor log gives a control owner enough columns to write that as a single SQL query: client_id (who), tool_name + element role + field name (what), record id from the result column or screenshot OCR (which record), started_at and duration_ms (when), error_category and retry_count (what happened). For HIPAA specifically, the recorder also captures the application_name so you can isolate Epic or Cerner activity from everything else. Mediar itself is SOC 2 Type II certified and HIPAA compliant; the audit log structure is what makes that certification carry through to the agent's actions instead of stopping at the platform.

Are there controls for what the agent is allowed to do?

Yes, three kinds. Workflow scope: a workflow is a code artifact (TypeScript or YAML) deployed under deployed_workflows; it can only call the steps it was authored to call. Secrets scope: secrets are decrypted per-org in load_org_secrets() using AES-256-GCM with the master key in SECRETS_ENCRYPTION_KEY, so a workflow running for org A cannot read org B's credentials. Retry scope: max_infrastructure_retries defaults to 3 in retry.rs, and only Infrastructure-classified errors retry, so a 'permission denied' failure halts immediately rather than hammering the legacy system.

How is this different from putting an AI agent behind UiPath orchestrator?

UiPath orchestrator gives you queue-level governance: who can submit a job, who can see results, what schedule it runs on. That is real and useful. The gap is below the queue: once a UiPath bot is running, the per-step record is whatever the developer wrote into the activity log, plus selectors that may or may not have matched. Mediar's per-step record is structural by default because the input layer is the OS accessibility tree, not a screenshot or a brittle selector. You can put orchestrator-style governance on top of either, but you cannot retrofit a structured per-action log onto an agent whose input layer never carried the element identity.

What about LLM-based 'see and click' agents?

Browser-based and vision-based agents are excellent for new SaaS where the underlying app already has good audit logs and a stable DOM. They are a poor fit for legacy desktop systems for the same reason that makes them flexible: they interpret pixels with a probabilistic model and dispatch coordinates. The audit trail you can keep is a video plus the model's prose. If your reviewer has to watch a screen recording to figure out whether the agent typed a customer's address into the right field, you do not have a governance regime, you have a forensic exercise. Pick the input layer that produces the log shape you need.

Where does the data live and who can query it?

workflow_executions and the surrounding tables are Postgres; row-level security is enabled, the default policy restricts SELECT to the row's client_id (matched against the JWT's client_id claim) and a service-role policy lets the platform manage all rows for support. The OpenTelemetry traces and logs are exported via OTLP to a ClickHouse-backed collector, with trace_id as the join key. For self-hosted deployments, both stores run inside the customer's perimeter; the connection between agent and collector uses the customer's OTLP endpoint set via OTEL_EXPORTER_OTLP_ENDPOINT.

What does 'self-healing when UIs change' have to do with governance?

Direct connection. When a SAP support pack rewrites a label or moves a panel, a selector-based bot fails or, worse, clicks the wrong control. Mediar tries match-by-automation_id first, then match-by-window-plus-bounds, then match-by-visible-text, then window-only focus. Whichever strategy actually matched is recorded on the step. So a reviewer can see 'this step ran but matched by visible text instead of automation_id', flag it, and update the workflow before the next run. Self-healing is not 'the agent guessed', it is 'the agent fell back deterministically and told you which fallback fired'.

How long does a governance review of a Mediar workflow actually take?

On the deployments we run today the review is roughly: read the workflow file (TypeScript or YAML, usually under 200 lines for a single business process), open three or four sample workflow_executions rows in the dashboard, spot-check the per-step element identities against the application's expected fields. RPA Center-of-Excellence leads we work with finish first-pass review in a couple of hours per workflow. The bottleneck is the auditor's familiarity with the legacy app, not the log format.

Architecture