A reference, not a listicle

Tools for robotic process automation: the 28 named primitives an RPA runtime is actually built from.

Every page that ranks for this topic answers the question with a ranked list of vendor names. This page answers it differently, because the engineer scoping a build is not asking which logo to buy, they are asking what operations a runtime has to expose. The honest answer is a catalogue of named primitives. Mediar publishes its full catalogue, 28 tools across five categories, as MCP tools inside the open-source Terminator agent. The rest of this page enumerates them.

Matthew Diakonov, Written with AI

Published April 30, 202613 min

Direct answer · verified 2026-04-30

An RPA runtime needs a small catalogue of named primitive operations, on the order of thirty of them, in five categories: element interaction, vision and detection, window and application management, browser automation, and workflow execution.

The Mediar runtime ships 28 of them, listed below by name. The full source is at github.com/mediar-ai/terminator under the terminator-mcp-agent crate. Vendor activity counts (UiPath at 200+, Automation Anywhere at 600+) are larger because those vendors add a new activity for every common workflow shape; the catalogue here is small on purpose, with composition doing the work that activities do elsewhere.

What other vendors publish as their tool count

UiPath: 200+ activitiesPower Automate Desktop: 400+ actionsBlue Prism: 80+ stagesAutomation Anywhere: 600+ actionsPega: 70+ shapesWorkFusion: closed catalogueIBM RPA: closed catalogue

A larger published count is easy to demo and hard to maintain. Mediar's catalogue is on the small side because the runtime is built around composition, not a per-shape activity for every workflow.

Category 1 of 5: element interaction (12 tools)

The largest category, and the one a buyer should read closest. Every step a recorder captures and every step a runner replays ultimately resolves to one of these twelve tools. The shared contract is a selector (role plus visible name plus optional tree path), and the receipt is a state change in the OS, not a pixel match.

Tool name	What it does
`click_element`	Single, double, or right click. Tries focus first, falls back to a synthetic click event when the platform refuses to focus.
`type_into_element`	Aggregated text entry against a focused element. Honors clear_before_typing so the recorder does not have to emit a separate Ctrl+A and Delete pair.
`press_key`	Standalone Enter, Tab, Escape, Delete, plus any modifier combo. The four bare keys are the same four codes the recorder treats as meaningful (0x0D, 0x09, 0x1B, 0x2E).
`invoke_element`	The accessibility-API equivalent of a click. On a Submit button this fires the same Win32 invoke pattern a screen reader would, regardless of where the visible pixels are.
`activate_element`	Bring an element to the foreground without changing its content. Used when the bot needs to focus a field that is rendered but covered by another window.
`scroll_element`	Scroll a specific element by pixels or pages. The unit is the element, not the screen, so a scroll inside a sidebar does not move the page underneath it.
`select_option`	Pick a value out of a combo box or list box by visible label, with the accessibility role enforced so a button with the same label is never selected by mistake.
`set_selected`	Toggle a checkbox or radio button to a target state. Idempotent: if the box is already checked, the call is a no-op.
`validate_element`	Returns whether an element exists, plus the element handle when it does. The replay loop calls this before every action so a missing element fails the step instead of clicking somewhere wrong.
`wait_for_element`	Poll until an element appears, disappears, or hits a target state. The condition language is small on purpose: visible, hidden, enabled, disabled, focused.
`highlight_element`	Draw a colored outline around an element on screen. Used during recording so the user can confirm the agent picked the right control before it tries to click it.
`stop_highlighting`	Clear all outlines drawn by highlight_element. Required before screenshot capture so the highlight does not bleed into a downstream OCR pass.

Category 2 of 5: vision and detection (1 tool, 5 sources)

The smallest category by tool count and the most interesting one architecturally. There is one published tool, get_window_tree, and behind it sits a five-value VisionType enum that names every source the runtime is allowed to read from. The enum is at lines 113-128 of node_modules/@mediar-ai/terminator/index.d.ts in the Mediar product monorepo. Each source is a different bet about what a screen actually exposes.

Tool / source	What it reads
`get_window_tree` tool	Returns the accessibility tree for a window. The five vision sources below feed back into this tree as alternative element sources, keyed by the VisionType enum.
`VisionType::UiTree` source	Default. The Windows UI Automation tree the OS publishes. Cheapest, deterministic, the source screen readers read from.
`VisionType::Dom` source	The DOM exposed by an attached Chromium tab. Used when the foreground process is a browser, with the same selector grammar applied to a different tree.
`VisionType::Ocr` source	Tesseract output run against the captured pixels. The fallback when the accessibility tree is empty (Java Swing, raw OpenGL, some Citrix sessions).
`VisionType::Omniparser` source	Microsoft Research's screen parser model. Bounded boxes plus role guesses for screens that have no useful tree at all. Highest cost, highest coverage.
`VisionType::Gemini` source	Multimodal Gemini Vision. Used inside the gemini_computer_use loop when no other source has the answer and a model has to decide where to click next.

“The five-source enum is the buyer's audit trail. If the runtime is honest, the trace tells you which source resolved the click, and you know how brittle that step is going to be at three in the morning.”

VisionType, node_modules/@mediar-ai/terminator/index.d.ts

Category 3 of 5: window and application (5 tools)

The orchestration layer between the OS and the workflow. These tools are how the runner decides where the next step is going to land. A workflow that opens SAP GUI, focuses an Outlook compose window, and screenshots a confirmation dialog uses one of these five primitives at every transition.

Tool name	What it does
`get_applications`	Enumerate the running processes that the OS reports as having a top-level window. Used to confirm SAP GUI is up before the recording starts.
`open_application`	Launch an application by name or by full path. Returns when the main window is focusable, not just when the process exists.
`activate_application`	Bring a window to the foreground. The receipt is a focus change in the OS, not a hover, so a stale highlight does not count as success.
`capture_screenshot`	Capture a window or an entire monitor. Lossless PNG by default. The recorder takes a before-and-after pair on every meaningful event for the analyser to reason over.
`list_monitors`	Multi-monitor metadata: id, name, scale factor, primary flag. The replay loop needs this to translate captured coordinates into the runtime's display geometry.

Category 4 of 5: browser automation (6 tools)

A surface inside a surface. Six tools that let the runtime drop out of the OS-level world and into a Chromium tab when the workflow needs the DOM rather than the accessibility tree. The tools mirror the structure of the OS-level catalogue (open, navigate, capture, list, close), which is deliberate: the same mental model carries over so the recordings stay legible.

Tool name	What it does
`open_url`	Open a URL in the default browser, or in a named one. The receipt is the browser tab landing in a focusable state, not just the OS handing the request off.
`navigate_browser`	Navigate the active tab. Equivalent to typing in the omnibox and pressing Enter, but without losing the tab handle.
`execute_browser_script`	Run a piece of JavaScript inside the active tab and return its stringified result. Used for scraping and for setting fields the accessibility tree refuses to expose.
`capture_dom`	Return a stripped DOM tree (ids, roles, names, bounds) for the active tab. The browser-side equivalent of get_window_tree.
`get_browser_tabs`	List open tabs across attached Chromium windows. Each tab carries a stable handle so the bot can switch back without re-finding it.
`close_tab`	Close a tab by handle. Distinct from navigating away because some flows need the tab gone before the next step starts.

Category 5 of 5: workflow execution (4 tools)

Four tools that turn the catalogue into a runnable program. Two executors (YAML and TypeScript) let humans and authoring agents compose the primitives above into named sequences. run_command is the escape hatch for steps that are genuinely shell calls. gemini_computer_use is the model-driven tool that the runtime falls into when a recording is incomplete.

Tool name	What it does
`execute_sequence`	Run a YAML sequence of steps with retry, timeout, and fallback per step. The default executor for the no-code recordings produced at app.mediar.ai/web.
`execute_ts_workflow`	Run a TypeScript workflow file authored against the @mediar-ai/workflow SDK. Gives engineers loops, conditionals, and typed inputs the YAML form does not.
`run_command`	Run a shell command on the host. The escape hatch when a step is genuinely a CLI invocation (a Powershell script, a curl against an internal endpoint).
`gemini_computer_use`	Hand a goal and a process name to a Gemini-driven loop that picks the next click using the vision tools above. Used when the recording is incomplete or the target UI changed since record time.

What a workflow built from these tools actually looks like

Reading the catalogue in the abstract is not enough. The following YAML is a real shape of a Mediar workflow that books a patient appointment in Epic Hyperspace. Every step names a tool from the catalogue above, every selector resolves through the four-strategy match cascade, and the entire program is ten arguments distributed across six steps.

book-appointment.yaml

A run of this workflow against a live system produces a trace that looks like the terminal below. Every step shows which tool was called, which selector it resolved against, and how long the OS took to confirm the change. The trace is the audit artifact; it is what gets handed to a compliance review and what gets replayed when a step starts failing two months later.

mediar run book-appointment.yaml

Mapping the catalogue to vendor activity names

Every RPA vendor publishes its own labels for the same operations. The grid below names the equivalents so a buyer comparing options can read across the row, not just down the column.

Feature	Vendor activity equivalents	Mediar primitive
click_element	UiPath: Click activity. Power Automate Desktop: Click UI element in window. Blue Prism: Press button stage. Automation Anywhere: Mouse Click action.	click_element. The same primitive, callable from every supported transport: MCP, the TypeScript SDK, the YAML executor, the no-code recorder.
type_into_element	UiPath: Type Into. Power Automate Desktop: Populate text field. Blue Prism: Write stage. Automation Anywhere: Set Text.	type_into_element with clear_before_typing as a required boolean. The contract is one named primitive, not four overlapping activities split across menus.
wait_for_element	UiPath: Element Exists, Find Element, Wait Element Appear, Wait Element Vanish. Power Automate Desktop: Wait for element. Blue Prism: Wait stage.	wait_for_element with a small condition vocabulary (visible, hidden, enabled, disabled, focused). One primitive, five conditions.
get_window_tree	UiPath: UI Explorer (designer-only, not callable from a workflow). Power Automate Desktop: no equivalent at the activity layer. Blue Prism: Application Modeller (designer-only).	get_window_tree, callable at runtime from any step. The accessibility tree is data, not a designer tab.
execute_browser_script	UiPath: Inject JS Script (browser activity package). Power Automate Desktop: Run JavaScript on web page. Automation Anywhere: Execute JavaScript.	execute_browser_script, scoped to the active tab handle. Returns the stringified result so a downstream step can read it.
gemini_computer_use	UiPath: Autopilot (closed source, vendor-hosted). Power Automate Desktop: Copilot in Power Automate (closed source, vendor-hosted). Blue Prism: Decipher IDP (document-only).	gemini_computer_use as a named runtime tool, source visible at github.com/mediar-ai/terminator. The model and the runtime are decoupled so the model is replaceable.

Why the catalogue is small, on purpose

A buyer comparing vendor counts (28 versus 200 versus 600) will reflexively rank larger as better. That intuition is backwards on this surface. Every activity in a vendor library is a maintained surface area: the vendor has to keep it working across OS versions, application versions, and release cycles, and the buyer has to learn it. A library of 600 actions is a warehouse, not a toolbox.

The opposite bet is to ship a small set of OS-level primitives and let composition do the rest. Send-an-email is not a tool; it is a four-step sequence. Read-a-cell-from-Excel is not a tool; it is a click_element on the cell plus a get_text on the editor. The composition layer is the YAML executor, which is itself one of the 28 tools (execute_sequence). The tradeoff is that the recordings are slightly longer; the win is that the vendor surface is small enough to read end to end and the runtime survives an OS update without a release.

The clearest evidence for this bet is the same one that drove an F&B chain off UiPath earlier this year: their CFO told the board they were saving 70 percent on costs after switching to Mediar. Most of that saving was not the per-minute price; it was the engineering hours that stopped being spent on activity-pack maintenance.

Reading the catalogue yourself

The full source for the 28 tools listed on this page is in the open-source Terminator repository, under the terminator-mcp-agent crate. Each tool is a registered handler with a name, a JSON schema for its arguments, and a Rust implementation that calls into the Terminator desktop library. The Mediar product binds the same crate as a sidecar binary (referenced at apps/desktop/src-tauri/src/mcp_server.rs:97) so that the desktop agent and the open-source SDK share one catalogue.

A team running an internal review can clone the repo, run the agent, and call list_tools over MCP to get the live catalogue back as JSON. Two years from now there will be more entries on that list than there are today; the discipline that keeps the number small is documented in the contributing guide. The discipline matters. Most of what feels like "another tool would be useful" is, on closer inspection, a sequence the author has not yet tried to compose.

Want to see this catalogue running on your own legacy system?

A 30-minute call where we record one of your real workflows on the spot, replay it, and walk through which tools resolved which steps. No deck.

Frequently asked questions

What does the phrase "tools for robotic process automation" actually refer to?

Two different things, depending on who is asking. A buyer at a CFO offsite uses the phrase to mean a vendor product, and the listicles that rank for the phrase return UiPath, Automation Anywhere, Power Automate, and Blue Prism. An engineer scoping a build uses the phrase to mean the catalogue of named primitive operations the runtime publishes: click_element, type_into_element, get_window_tree, and so on. Most listicles answer the first version of the question. This page answers the second one, because the second one is the layer at which any RPA program is actually graded. A vendor with a smaller, sharper toolset that you can read end to end will outperform a vendor with a larger toolset that you cannot.

Why does counting tools matter when every vendor publishes a different number?

The count is a tell. UiPath ships over 200 activities, Power Automate Desktop ships over 400 actions, Automation Anywhere ships over 600. The Mediar runtime ships 28 named tools. The first three numbers grow because the vendor adds a new activity for every common workflow shape they want to sell support for: send an email, post to Slack, query a SQL database, parse a CSV. Each new activity is a new surface to maintain and a new selector grammar to learn. The 28-tool catalogue takes the opposite bet: a small set of OS-level primitives plus a YAML executor, and the workflow itself composes the higher-order behavior. "Send an email" stops being an activity and becomes a four-step sequence (open_application Outlook, click_element compose, type_into_element body, click_element send). The smaller surface is the durable one, because the OS primitives change less often than the SaaS connectors do.

Why does Mediar's tool catalogue include five vision sources, not one?

Because no single vision source covers every screen a buyer brings. The cheapest and most reliable source is the Windows UI Automation tree, which works for SAP GUI, Oracle EBS, Jack Henry, Fiserv, FIS, Epic, Cerner, Office, and most line-of-business apps. The browser DOM source covers Chromium tabs. OCR catches Java Swing, Citrix sessions, and old terminal emulators. Omniparser catches custom OpenGL surfaces and games. Gemini Vision is the fallback for anything that none of the others can describe, used inside gemini_computer_use to pick a next-click target from a screenshot. The honest tool says which source it used for a given click in the trace, so a buyer auditing a deployment can see when the runtime fell off the deterministic path. The five-source enum is in node_modules/@mediar-ai/terminator/index.d.ts under the VisionType type.

How does this catalogue compare to UiPath's activity library?

UiPath organises its activity library around use cases (UI Automation, Mail, Excel, PDF, System, etc.) and ships hundreds of activities, many of them paid add-on packages. The Mediar catalogue is organised around runtime primitives. Excel is not a category; it is the application Mediar drives via click_element and type_into_element. Mail is not a category; it is a sequence against the active mail client. The activity-library approach is friendlier to a designer who wants to drag-and-drop a complete flow without reading a tree. The primitive-catalogue approach is friendlier to anyone who has to debug a flow at three in the morning when the screen has changed and the activity is still passing. Both shapes are real choices; the right one depends on whether your team is sized for designer-led builds or for engineer-led debugging.

Are these tools the same as the activities in UiPath Studio?

Functionally overlapping, structurally different. A UiPath Studio activity is an XAML node that wraps a .NET assembly. The activity definition lives in a vendor-published package, the runtime is the UiPath Robot, and the catalogue is closed source. A Mediar tool is a named handler exposed by the terminator-mcp-agent process over the Model Context Protocol. The handler source code is in the public repo at github.com/mediar-ai/terminator under the terminator-mcp-agent crate. An RPA team can read what click_element actually does, fork it, swap the underlying find strategy, and pin a custom build. UiPath does not allow this. Whether that matters depends on whether your security or compliance review needs to inspect the runtime end to end.

What about "AI agents" and computer-use models that promise zero tools?

Computer-use models still use tools. They just hide them. An OpenAI computer-use call ultimately calls click(x,y), type(text), screenshot(), and key(name) inside the harness. The tools are there; they are just unnamed and unloggable from the buyer's side. The honest computer-use product publishes the toolset it is calling and lets the buyer audit it. Inside Mediar's runtime, the model lives behind gemini_computer_use, and every click the model issues is captured as a click_element call with a selector synthesised from the screenshot. The trace shows which tool the model picked, what selector it gave, and which vision source it used. "Zero tools" is a marketing posture, not an architecture; if you cannot see the tool list, the product is not actually toolless, it is just opaque.

How do these primitives map to the four match strategies the runtime uses at replay time?

Element interaction tools (click_element, type_into_element, set_selected, validate_element) all consume a selector. The selector is resolved at runtime by a four-strategy match cascade: exact role-plus-name match against the live tree, fuzzy name match within the same role, tree-path match using the captured ancestry, and visual fallback through the vision sources above. The cascade is the reason the catalogue does not need a separate find_by_role, find_by_name, find_by_xpath, find_by_image quartet of activities, the way UiPath does. One selector type, four strategies under the hood, one named primitive on the public side. The cascade lives in apps/desktop/src-tauri/src/focus_state.rs in the Mediar product monorepo.

Is the published list of 28 tools final, or does it grow?

It grows, but slowly, and additions get pushed back hard. The default move when a customer asks for a new behaviour is to compose it from the existing primitives rather than to ship a new tool. A new tool gets added only when the existing ones cannot reach the behaviour at all (the addition of capture_dom, for example, was forced by browser flows where the accessibility tree was strictly weaker than the DOM). The discipline is deliberate. A larger catalogue is easier to demo, harder to maintain, and a worse experience for an engineer who has to remember which of 600 actions to pick. The current 28 is a long way from the 600+ that Automation Anywhere ships, and the gap is the point.

Keep reading

Companion

Workflow automation tools split by surface, not by feature count

The companion piece on the broader "workflow automation tools" category. Names the four runtime surfaces a buyer is actually choosing between, and shows the keystroke-level spec that decides which surface a given workflow lives on.

Read

Definition

What robotic process automation is, in three numbers: six event types, four stages, four match strategies

Sibling piece on the runtime architecture. Walks the four-strategy match cascade in focus_state.rs that the tools on this page resolve their selectors against.

Read

Architecture

The meaning of robotic process automation: a side-by-side decomposition of the modern bot vs the 2003 selector recorder

The architectural argument behind why the catalogue is built around the accessibility tree rather than around recorded selectors. Two definitions of RPA, scored against each other.

Read