What happens after the extractor returns JSON

AI data entry from PDF, traced past the JSON: how the destination form gets typed.

Almost every guide on AI data entry from PDF stops at the same place. A vision model reads the PDF, returns a JSON blob with fields named the way the model thinks an invoice should look, and the article ends. In a real ops team, that JSON still has to land inside SAP, or Epic, or Jack Henry, or Oracle EBS, where the field labels do not match the model's invoice schema and where there is no public API to post into. This page traces what happens next, with the file and line numbers that make it concrete.

Matthew Diakonov, Written with AI

Published April 29, 20269 min

Direct answer (verified 2026-04-29)

AI data entry from a PDF works in three stages, but the order matters. First, the destination form (the SAP screen, the Epic chart, the Excel cell range) defines the schema by being recorded once, which produces a Zod input definition with the exact field names the form uses. Second, a vision-capable model parses the PDF into JSON whose shape is constrained by that schema, so the extractor returns the destination's field names rather than its own guessed invoice schema. Third, a Windows desktop runtime types each value into the form via OS-level accessibility APIs, the same interfaces screen readers use, with no human paste step in between. The whole round trip, including the workflow file, is audit-reviewable.

The detail every guide leaves out

If you read the existing playbooks for AI data entry from PDF, the shape is consistent. A flowchart shows a PDF entering a box labeled OCR, then a box labeled NLP, then a box labeled JSON, and the JSON box has an arrow pointing at a happy stick figure or at a CRM logo. The arrow is the part that lies. In every regulated finance, claims, banking, and EHR shop we work with, the JSON does not auto-arrive in the destination system. It gets pasted into a form by a human, or mapped through a brittle middleware layer that breaks the next time the form vendor ships a UI patch.

The reason is structural. The extractor uses one schema (its built-in invoice or receipt or claims schema), and the destination form uses another (Belegnummer instead of invoice_number, payer_id instead of insurer, chart_id instead of patient_id). Somebody has to map between them. That map is fragile, undocumented, and usually owned by the same person who is also typing the JSON into the form.

The fix is not a smarter extractor. The fix is letting the destination form decide what the extraction schema is, and then actually typing the values in.

The five-stage round trip

One PDF, one destination form, traced end to end. Each stage maps to something a reviewer can open in the open-source repo or in the generated workflow file.

1. The PDF arrives

A new document lands in OneDrive, an inbox, a SharePoint folder, or a network share. The trigger is a file watch, not a model call. Nothing is extracted yet, because the extractor does not know what fields it is looking for.

This is the boundary the typical AI data entry guide skips. Before any OCR or LLM call, the system needs to know what shape the destination wants. Without that, you end up with a generic invoice JSON that nobody can paste into SAP F-02 or Epic Welcome without a second pass.

2. The schema comes from the destination, not the document

Mediar's recording processor walks every substep of the previously recorded destination workflow, collects every field the operator typed into during the recording, and emits a Zod input schema with those exact field names. The extraction prompt is shaped from that schema.

This is the part you can verify in source. In the open repo, the file apps/desktop/src-tauri/src/recording_processor.rs iterates the workflow's substeps, builds a detected_inputs map, and writes one z.string().optional().describe(...) line per field into a generated terminator.ts. That same schema then becomes the structured-output target for the vision pass on the PDF.

3. The vision pass extracts only those fields

A vision-capable model reads the PDF (text plus rendered pages), and is asked to return JSON that conforms to the destination's schema. Not a generic invoice schema, the recorded form's schema. If the form has a journal_entry_type field that is one of three values, the extractor returns one of those three values.

Two consequences. First, you get fewer hallucinated fields, because the extractor is constrained by the schema rather than imagining a plausible invoice shape. Second, you avoid the human remapping step that every Docparser, Nanonets, Affinda, or Rossum walkthrough quietly adds at the end. The fields are already named what the destination calls them, because the destination wrote the names.

4. Each value is typed via Windows accessibility APIs

A Rust runtime walks the same accessibility tree screen readers use, finds each field by its UI Automation locator, and emits a type_into_element MCP call with the value. There is no model in the hot path, no pixel matching, and no SAP-side ABAP change.

The runtime resolves each field with several locator strategies in order: the recorded automation id, the window handle plus bounds, the visible text label, and the parent window. Three of those four do not depend on absolute pixel position, so a screen variant or a support pack that rearranges a tab usually still resolves. When all four miss, the step is flagged for re-recording rather than retried with a guessed click.

5. The audit trail is the workflow file

The generated TypeScript file is the contract. A reviewer can diff it the way they diff a stored procedure, redline a step, and re-run the deterministic replay. The PDF, the extracted JSON, and every keystroke are logged with a workflow run id.

Compliance teams ask for this. Sending a PDF to a black-box AI service and getting back a CSV is a different posture than running a checked-in workflow file with a known input schema and a per-step trace. The first answer is hard to defend in an audit. The second one looks like any other reviewable artifact in the codebase.

Why the schema starts at the destination, not at the document

The mechanically interesting part of the pipeline is in one file: apps/desktop/src-tauri/src/recording_processor.rs. When the recording pass finishes, the processor walks every substep of the workflow and collects the fields the operator typed into. Those field names get hashed into a deduplicated map called detected_inputs, and one Zod line gets written for each:

for input in detected_inputs.keys() {
    let field_name = to_camel_case(input);
    input_fields.push_str(&format!(
        "  {}: z.string().optional().describe(\"{}\"),\n",
        field_name, input
    ));
}

The result is a generated terminator.ts whose input schema names the fields the destination form named, in the order the destination form asked for them. For a SAP F-02 journal entry it is four lines. For an Epic patient registration it is closer to twenty. The contents are the destination's vocabulary, not an extractor's.

That same schema gets handed to the vision pass on the PDF as the structured-output target. The model is asked to return JSON of that shape, which means the JSON is already in the destination's vocabulary by the time it leaves the extractor. There is no manual mapping step, because the mapping was done at recording time by the form itself.

Where most AI data entry from PDF stops, and where Mediar keeps going

The extractor returns a JSON blob with whatever fields its built-in invoice schema thinks the document has. The consumer of the JSON is a human in a chair, or a brittle script that maps invoice_number to the SAP field called Belegnummer. When SAP rearranges Belegnummer, the script breaks and the human takes over until somebody fixes the mapping.

Generic invoice or receipt schema, not the destination form
Manual remapping from extractor field names to destination field names
Human still types the values into the destination app
Breaks when the destination UI changes; nobody knows until rows pile up

What the typing pass looks like at the field level

Once the vision pass returns the destination-shaped JSON, the values go through the executor crate (an open-source Rust binary in the Terminator SDK), which calls the Windows UI Automation tree. Each field is one MCP step. For the SAP company-code field, the serialized step looks like this:

{
  tool_name: "type_into_element",
  arguments: {
    text_to_type: "1000",
    clear_before_typing: true,
    timeout_ms: 5000,
    selector: "name:Company Code|role:Edit",
    process_name: "saplogon.exe"
  },
  description: "Type '1000' into Company Code"
}

The selector is the locator that the recording captured. It is read from the same UI Automation tree that powers Windows screen readers, which means the field is addressed by what the OS exposes (role plus accessible name), not by where the field happens to render on screen today. When SAP repaints F-02 after a support pack and the field shifts a row, the locator still resolves.

The runtime tries up to four locator strategies in order before giving up: recorded automation id, window-handle plus bounds, visible text label, and parent window. Three of the four are position-independent. When all four miss, the runtime stops the sequence and surfaces the failed step rather than retry with a guessed click.

What this still does not solve

A page that only describes the easy cases is a brochure. Three honest limits:

Handwritten cursive English, multi-language carbon copies, and severe ink bleed still need human review at the extraction step. The schema constraint helps the model stay honest about what it could not read, but it does not improve the underlying recognition quality.
A few destination apps render entirely as images of text without an accessibility tree. The mainframe terminal emulators we have seen all expose a tree (3270 sessions through Reflection or Rocket, AS/400 through IBM iAccess), but a small set of niche field-service apps do not. On those, the runtime falls back to image-based locators that are slower and less reliable.
Workflows that depend on values which change every run and were not present at recording time need to be re-recorded. The schema is fixed by the recording, so a new field on the destination form means a new recording, not a config change.

Within those limits, the round trip is what most ops, finance, and claims teams are actually trying to buy when they search for AI data entry from PDF. They are not trying to buy a smarter extractor. They have an extractor. They are trying to get the values into the form without a human paste step.

Want this round trip running on one of your forms?

Bring a PDF you actually receive and the destination form you actually post into. We will record the form once, generate the destination-shaped schema, run the extraction, and show you the typed result on a real workflow run.

Frequently asked questions

What does the destination-shaped schema actually look like?

When the recording pass finishes, Mediar writes a file called terminator.ts that includes a Zod inputSchema generated from the substeps of the recording. For an SAP F-02 journal entry the schema declares company_code, journal_entry_type, document_date, and posting_date as four optional strings. For an Epic patient registration the same machinery emits chart_id, dob, address_line, payer, and the rest. The extractor for the PDF is then constrained to that exact schema: it returns those keys, with values of the declared types, or it returns null and the workflow stops before it types anything wrong.

Does this need OCR, or does the model read the PDF directly?

It depends on the document. Modern vision-capable models (GPT-4o class, Gemini 1.5 Pro and 2.0, Claude with PDF support) read native text plus rendered page images directly, so a born-digital invoice or claims form does not need a separate OCR pre-pass. Scanned PDFs and faxed forms are still handled, but the pipeline runs an OCR pass first (Tesseract for cheap, AWS Textract or Azure Document Intelligence when handwriting is involved) and feeds the recognized text plus the page image into the vision pass. The destination schema is the same in both cases.

Why not just use Docparser or Nanonets and stop at JSON?

Docparser, Parseur, Nanonets, Klippa, Affinda, and Rossum are good at PDF to JSON. The honest part most of their guides leave out is what happens to the JSON after that. If the destination is an SAP GUI window, an Epic chart, a Jack Henry green-screen, or an Oracle EBS form, you still need someone or something to type the JSON into the form. That last leg is the leg that costs $750K a year in claims intake at a mid-market carrier we work with. Stopping at JSON is fine if your destination has a public API. Most of the legacy systems we see do not.

What happens when the PDF has fields the destination form does not?

They are dropped before the workflow runs. The extractor's structured output target is the destination schema, so anything outside of it does not get returned in the first place. If the PDF has fields the destination form expects but the model could not find with confidence, the workflow refuses to type those values and surfaces the gap to a human queue. The model is allowed to say 'not present' for a field; it is not allowed to invent one.

How does this handle handwriting and bad scans?

The same way good document AI pipelines handle them: pre-process with an OCR engine that does well on poor scans (Textract, Azure Document Intelligence, or Google Document AI), then feed both the OCR text and the rendered page into the vision pass. Confidence scores from the OCR step are passed through, so a low-confidence field can be routed to human review instead of being typed. The Mediar-specific part is not the OCR. The Mediar-specific part is what the schema looks like and what the runtime does once a value is accepted.

Can the same workflow accept structured input that did not come from a PDF?

Yes. The Zod schema does not care where the values originated. A Stripe webhook, a Slack slash command, a daily Snowflake query, a CSV upload, and a PDF extraction all produce the same shape and feed the same execute_sequence call. The PDF path is one input lane out of several. We have customers running the same SAP posting workflow from a OneDrive PDF watcher and from a database trigger, with a single workflow definition.

Does a model decide what to click during runtime?

No. After the offline recording pass writes the workflow file, the runtime is deterministic Rust that walks the accessibility tree and emits MCP tool calls. A grep of the executor crate at github.com/mediar-ai/terminator finds zero references to openai, anthropic, gemini, or any other inference SDK. The model writes the workflow once. The runtime replays it. Two identical inputs produce two identical action sequences, which is what makes the file an audit artifact rather than a probabilistic agent trace.

What does this cost per PDF processed?

Pricing is $0.75 per minute of runtime, and the runtime is the typing pass into the destination form, not the extraction. A four-field SAP journal entry typically lands at 25 to 60 seconds of runtime; a 30-field insurance claim form intake at 90 to 180 seconds. The vision-pass cost on the PDF itself sits inside the same per-minute envelope for short documents and is itemized separately for very long PDFs. The $10K turn-key program fee converts to credits with a bonus, so it is effectively prepaid usage.

Where does this fall down?

Three places. First, on documents that genuinely cannot be parsed by a vision model with current quality (handwritten cursive English, multi-language carbon-copy forms, severe ink bleed) the extraction step still requires human review. Second, on destination apps that render entirely as images of text without any UI Automation tree, the accessibility-API path does not work and we fall back to image-based locators that are slower and less reliable. Third, on workflows that depend on values that change every run and were not present at recording time, the recording needs to be redone. We do not pretend the schema-shaped pipeline solves the first or third case; we pretend it does not exist for the second.

Is this open source?

The execution layer is. The Terminator SDK that performs the UI Automation calls and locator resolution lives at github.com/mediar-ai/terminator under MIT, including the type_into_element, click_element, set_value, and get_text MCP tools. The recording processor that synthesizes the destination schema and the orchestration layer that runs the queue are commercial. A team that wants to wire PDF data entry into their own queue can build directly on Terminator without paying for the cloud product.

Same architecture, different starting points and destinations.

Adjacent walkthroughs

Walkthrough

SAP data entry automation: one journal entry, traced field by field

The same destination-shaped pipeline applied to a four-field SAP F-02 journal entry. File and line numbers a reviewer can open.

Read

Adjacent topic

AI tools for filling complex compliance forms

Why the form-fill products sold for SIG and CAIQ cannot fill the screen inside SAP, Epic, or Fiserv where the regulated record actually lives.

Read

Architecture

Where the AI in Mediar AI actually lives, and where it does not

The model writes the workflow once during recording. The runtime is a Rust binary calling Windows accessibility APIs with zero LLM calls in the hot path.

Read