The half that breaks

Data entry automation has two halves. Almost every guide stops at the first one.

Read the top ten articles on data entry automation and you will see the same shape. A flowchart with OCR, then an IDP box, then a JSON blob, and the JSON has an arrow pointing at a CRM logo or a happy stick figure. The arrow is the part that lies. In any regulated finance, claims, banking, or healthcare shop, the JSON does not auto-arrive in the destination system. Something has to type it in. That something is the half that breaks. This piece is about that half.

Matthew Diakonov, Written with AI

Published May 7, 20267 min

Direct answer (verified 2026-05-07)

Data entry automation is software that types data into business systems instead of a human. It has two halves. The first half is extract: turning a document, email, or feed into structured fields. OCR, IDP, vision models, EDI parsers all live here, and this half is well-covered. The second half is enter: those fields actually landing in the destination form, on the destination screen, in the destination app. On legacy desktop systems with no API (SAP GUI, Jack Henry, Fiserv, FIS, Epic, Cerner, Oracle EBS, mainframes), this is where most projects stall. The honest answer to “does this automate my data entry” is whether it can reliably do the second half on the systems your data actually lives in.

The two halves, drawn honestly

Every guide we read for the research on this page used some variant of the same diagram. Document on the left, OCR or AI in the middle, a green check on the right. The check is doing a lot of work. Here is the same diagram with the step that the green check actually stands in for, named.

Data entry automation, with the enter step drawn

Document arrives

PDF, email, scan, EDI feed

Extract

OCR, vision model, IDP returns JSON

Validate

Schema check, business rules

Enter

Type into destination UI, save, capture confirmation

The first three boxes are commodity. There are five honest options for any of them, more if you count the free ones. The fourth box is where every project we touch has stalled at least once. The destination is a Windows desktop app. There is no public API. There is a person sitting in front of it whose calendar will tell you that they spend three hours a day typing the JSON from the third box into the fourth box.

What does an entered field actually look like?

Most coverage of data entry automation talks about the destination step in marketing terms (“writes back to your system of record”) without ever showing what the bot is doing at the field level. There are two common shapes. One of them is what most legacy RPA stacks shipped through 2024. The other is what an accessibility-tree-bound recorder emits today. Switch the tabs to compare.

Same field, two ways to bind it

// What most "automate this form" stacks emit:
//   image template-match + XPath fallback
{
  "step": "data_entry",
  "image_template": "company_code_field@1920x1080.png",
  "image_threshold": 0.92,
  "xpath_fallback":
    "/html/body/div[3]/form/div[7]/div/div[2]/input[1]",
  "value": "1000"
}

// Two ways this dies:
//  1. The screen renders at 2560x1440 on the Citrix farm and the
//     template miss-matches by 4 pixels.
//  2. The vendor ships a UI patch and the XPath now points at
//     a hidden div that no longer accepts input. The bot types
//     into nothing, the script reports success, the journal is
//     never posted, and you find out at month-close.

-136% more durable signals

The right-hand version is not theoretical. The struct is in the open-source workflow recorder at terminator/crates/terminator-workflow-recorder/src/events.rs, as a Rust struct named TextInputCompletedEvent with the eight fields you saw above. The conversion to a runtime MCP step lives in mcp_converter.rs, which builds the type_into_element step with clear_before_typing set true and a 5000ms timeout. Anyone can clone the repo and read the lines for themselves.

Why the input method field exists

One detail in that struct is worth its own paragraph. The recorder does not just save what was typed. It saves how it was typed. The input_method field is one of five values:

Typed — character-by-character keystrokes. The runtime can replay this with a per-character delay so apps with debounced input handlers (SAP GUI, some Java Swing forms) see the same event stream they expect.
Pasted — a large block arrived quickly. The runtime should paste, not type, so a 240-character chart-note does not take 30 seconds.
AutoFilled — the field was completed by the app itself (a downstream value derived from a prior field). The runtime should skip it, because re-entering it can clobber the derivation.
Suggestion — the user took an autocomplete choice. The runtime needs to type the prefix and then arrow-down into the dropdown, not paste the final string.
Mixed — some combination, treated conservatively.

The top-ranking guides on data entry automation do not mention any of this, because most of them describe the extract step and treat the destination as an HTTP POST. Once you accept that the destination is a real Windows process and not a JSON endpoint, the shape of an entered field has to carry this much detail. Otherwise the playback either takes too long, crashes the destination, or looks correct and silently corrupts a derived value.

Anchor fact

One line is the difference between a brittle bot and a durable one.

In mcp_converter.rs, the function generate_element_selector ends with this fallback chain: try the recorded automation id, then the parent-window-scoped name, then the visible text label, then the element role plus a position hint. Three of the four are position-independent. Pixel coordinates are the option of last resort, used only when nothing else identifies the element. Most RPA stacks that vendors call “AI” have that order inverted.

Source: apps/desktop/src-tauri/src/mcp_converter.rs in the Mediar product repo, around line 1826. Same locator strategy is exposed in the open-source Terminator SDK at github.com/mediar-ai/terminator.

What this still does not solve

A page that only describes the easy cases is a brochure. Three honest limits.

Some destination apps render their fields as images of text with no accessibility tree. A small set of niche field-service tools and a few industrial HMIs are built this way. On those, the runtime falls back to image-based locators that are slower and less reliable. We have not seen this on any common ERP, EHR, banking core, or mainframe terminal emulator we have shipped against, but it does exist.
If the destination form changes shape (new required field, new tab, different navigation path), the recording has to be redone. The schema is fixed at recording time. Self-healing covers label and layout drift, not new fields the recording never saw.
Handwriting, multi-language carbon copies, and severely degraded scans still need a human review at the extract step. The enter step does not improve recognition quality of the input documents.

Within those limits, the round trip is what most ops, finance, claims, and intake teams are actually trying to buy when they search for data entry automation. They are not trying to buy a smarter extractor. They have an extractor. They are trying to get the values into the form without a human paste step.

$0.75/min

“No per-seat license, no implementation tax. The $10K turn-key program fee converts to credits, so it is effectively prepaid runtime.”

Mediar pricing, public on www.mediar.ai

Want this round trip running on one of your forms?

Bring a document you actually receive and the destination form you actually post into. We will record the form once, generate the destination-shaped schema, and show you the typed result on a real workflow run.

Frequently asked questions

What is data entry automation, in one sentence?

Software that types data into business systems instead of a human, covering both the extract step (turning a document into structured fields) and the enter step (those fields landing in the destination UI). Most coverage online describes only the extract half.

Why does the enter step matter more than the extract step?

Because extract has many honest answers (Tesseract, Textract, GPT-4 vision, Klippa, Rossum, Docparser). The destination is where the project actually has to land, and on legacy desktop systems like SAP GUI, Jack Henry, Fiserv, FIS, Epic, Cerner, or Oracle EBS there is no API to post into. A clean JSON payload with no honest way to type it into the destination is not automation, it is a queue for a human.

Why do most RPA bots break at the typing step?

Because they bind the typing event to either a pixel template (the screenshot of the field at the time of recording) or a CSS or XPath selector (the DOM path). Both move when the destination ships a UI patch, when the user runs at a different screen resolution, or when the app re-renders a section. The bot then types into nothing, often without raising an error.

What does an accessibility-tree selector look like in practice?

The selector is the field's UIA control type plus its accessible name, e.g. `role:Edit && text:Company Code`. It is exactly what Windows screen readers consume, which means the field is addressed by what the OS exposes about it, not by where it happens to render today. The same selector resolves after the layout reflows, as long as the label survives.

How does Mediar know which selector to record?

When you record a workflow once, the recorder captures a TextInputCompletedEvent for every field you type into. The event stores the field's accessible name, control type, input method (typed vs pasted vs autofilled), and focus method (mouse click vs keyboard nav vs programmatic). At conversion time, the strongest stable signal in that event becomes the selector. The full struct is in the open-source SDK at terminator/crates/terminator-workflow-recorder/src/events.rs around line 977.

Does this work for browser-based forms or only desktop apps?

Both, but the edge is on the desktop side. Browser forms expose an accessibility tree via the DOM and a separate UIA proxy, so the same selector pattern works. The reason this approach exists is that legacy desktop apps (SAP GUI, mainframe terminals, banking core green screens, EHRs) do not expose a DOM at all. Browser-only automation tools cannot reach them. This is where the pure browser agents stop and Mediar starts.

What about handwriting, image-only PDFs, or apps that render fields as images?

Handwriting and image-only documents need an OCR or vision pass before any structured field exists; that part is no different from any other extract step. A small set of niche desktop apps draw their fields as images of text without any accessibility tree at all (some field-service tools, some industrial HMIs). On those, the runtime falls back to image-based locators that are slower and less reliable. The mainframe terminal emulators we have shipped against (3270 sessions through Reflection or Rocket, AS/400 through IBM iAccess) all expose a tree.

What is the actual price?

$0.75 per minute of workflow runtime. There is no per-seat license. The $10K turn-key program fee converts to credits with a bonus, so it is effectively prepaid usage of the same per-minute meter.

Same architecture, different starting points and destinations.

Adjacent walkthroughs

Adjacent walkthrough

AI data entry from PDF, traced past the JSON

The same round trip starting from the document side. How the destination form's recording shapes the extraction schema before the vision pass even runs.

Read

Architecture

RPA agent UI input layer: accessibility tree vs pixels

Deeper on the input layer that this page references. Why a Windows UIA selector is a stable fact, while a pixel match is a guess.

Read

How it works

Where the AI in Mediar AI actually lives

The model writes the workflow once during recording. The runtime is a Rust binary calling Windows accessibility APIs, with zero LLM calls in the hot path.

Read