AI · Agents & automation

AI agents engineered to reach production.

Most agent projects stall before they ship, and the model is rarely the reason. We build agents with the orchestration, guardrails, human oversight and observability that get them into production, on open-weight models inside the EU.

Scope an agent How we build →

◆ Supervisor-worker · MCP tools ◆ Human-in-the-loop ◆ Observable · EU-resident

An AI agent is a system that completes a task end to end: it plans the steps, calls tools and your data, and acts, with a human overseeing where it matters. The hard part in 2026 is not making one reason; it is making one that ships, because most agents stall in the orchestration and the operations around the model rather than the model itself. Argus Root builds that part first — bounded autonomy, guardrails and observability — on open-weight models hosted inside the EU.

In short

Gartner expects over 40% of agentic AI projects to be cancelled by the end of 2027 — the cause is escalating cost, unclear ROI and weak risk controls, not the model.
Beware agent washing: of the thousands of vendors claiming agentic products, Gartner estimates only around 130 offer genuine autonomous capability; the rest is automation with an agentic price tag.
The pattern that ships is bounded autonomy: allowlisted tools, a task with a measurable number on it, and production-grade logging — governed by the agent's autonomy level, not one uniform rule.
Four controls are non-negotiable before you loosen the reins: a cost ceiling per agent, human approval for irreversible actions, a kill switch, and a measured groundedness score.
Distinguish an agent's ability to act from its scope of access — conflating them is the governance gap Gartner expects to send 40% of enterprises rolling agents back by 2027.

The model is the easy part.

By most counts only about one agent use case in ten reaches production, and the cause is rarely model quality. The orchestration layer between the steps is where it breaks: a failed tool call with no retry, cost that runs away, context that overflows, no record of what the agent did. The binding constraint is usually how much a person can review, not how many agents you can spin up. Building for production means designing the failure handling, the cost ceilings and the oversight before the demo, rather than after it stalls.

what an agent needs to ship

The layers a production AI agent needs and the failure each one prevents.
Layer	What it does	Failure it prevents
Orchestration	Supervisor decomposes goals, delegates, retries	Tasks that stall on the first error
Tools (MCP)	Connects the agent to your systems and data	An agent that cannot do anything real
Memory	Working context, long-term store, episodic logs	An agent that forgets and repeats
Guardrails	Allow and deny lists, cost ceilings, residency	Runaway cost and unsafe actions
Human-in-the-loop	Escalates below a confidence threshold	Autonomous mistakes that matter
Observability	Traces every step, end to end	A system you cannot debug or audit

The failure rate is the headline of 2026, and it is stark. Depending on whose survey you read, somewhere between 86 and 88% of agent pilots never reach production, and Gartner expects more than 40% of agentic projects to be cancelled outright by the end of 2027. The cause is rarely the model, which is the part that works; it is everything around it, and the projects that die share a profile: no governance, no way to trace what the agent did, costs that ran away at scale, and no single person accountable for the outcome.

This is why the gap, rather than the model, is where we start. A demo agent reasoning well in a notebook is a long way from one that retries a failed step, stays inside a budget, keeps its context from overflowing, escalates when it is unsure, and leaves a record of everything it touched. Those are engineering problems with known answers, and designing them in before building the clever part is the difference between the one in ten that ships and the nine that impress once and quietly disappear. The model is the easy part precisely because the labs already solved it; the operations around it are still yours to get right.

An agent that can act needs rails.

Once an agent can take actions, governance is what makes it safe to deploy: allow and deny lists for the tools and domains it can reach, cost ceilings and rate limits, and a confidence threshold below which it escalates to a person instead of acting. These are design decisions, not settings you add after an incident.

The EU dimension lines up with good engineering here. Article 12 of the AI Act expects high-risk systems to log their activity, and an agent built on the Model Context Protocol produces that record as it runs, because every tool call is captured in the trace. We keep the models open-weight and inside the EU, so the traces and the data the agent touches stay in your jurisdiction. The monitoring sits on our observability work, and the classification on compliance.

Most organisations are not yet ready on this front: only about a fifth have a mature governance model for autonomous agents, which is the same gap the cancellations come from. We treat an agent as an accountable system rather than a clever feature, with the controls a regulator and a board now expect, an audit trail of its actions, a kill switch to stop it, and human sign-off on the consequential steps, built in rather than retrofitted after the first expensive mistake. Governance is not the brake on an agent; it is what lets you let it act at all.

How do we build them?

Scoped to one workflow with a payback, engineered for the failure modes that sink the rest.

Scope to a workflow

We start from one process with a measurable return and a named owner, rather than a broad agent that impresses in a demo and stalls in a month.

Supervisor-worker orchestration

A supervisor decomposes the goal and delegates to specialist steps, with retries, checkpointing and fallback routing so a single failure does not end the run.

MCP tools & RAG grounding

Tools connected over the Model Context Protocol and retrieval over your own data, so the agent acts on your systems and answers from your knowledge.

Human-in-the-loop

Checkpoints at the confidence and risk thresholds you set, so consequential actions get a person's sign-off before they happen.

Observability & evaluation

Every step traced and outputs scored from day one, because an agent you cannot see into is one you cannot trust in production. See observability →

Sovereign deployment

Open-weight models hosted inside the EU, on your environment or ours, with prompts, outputs and traces kept in-region.

We run agents in our own operation.

We build and run automation agents ourselves, on open-weight models on our own hardware, with the same tracing and guardrails we would put around yours. That keeps us honest about the part most pitches skip: a production agent is a real investment to build and to operate, with ongoing cost for maintenance and oversight. We give you that total cost upfront, so the decision is made on the real number rather than the demo.

The loop, with the rails that make it shippable: plan, call an allowlisted tool through a guardrail, hold irreversible actions for human approval, observe, and loop — all bounded by a step limit, a cost ceiling and a kill switch, and traced end to end. The model proposes; the boundaries decide what it is actually allowed to do.

agent.yaml — allowlisted tools, autonomy bounds, human approval

# bounded autonomy: the tools and limits define what it may do
goal: resolve_refund_request
tools:
  - name: lookup_order     # read-only
    scope: read
  - name: issue_refund     # mutating, capped + gated
    scope: write
    max_amount: 200        # over 200 → human approval
require_human_approval:
  - any: [delete, external_email, payment]
bounds:
  max_steps:    12           # no runaway loops
  cost_ceiling: 2.00         # hard cap per run
  kill_switch:  enabled
observability: otel/gen_ai     # every step traced + costed

We operate MCP Supervisor-worker RAG Human-in-the-loop OTel tracing Open-weight models

Why do agent pilots die before production?

The agents that fail rarely fail loudly. They work in the demo, impress a stakeholder, and then meet the messiness of real use: a tool times out and nothing retries, a long task overflows the context window and the agent forgets its own plan, a loop runs up a bill nobody capped, an output is wrong and there is no trace to explain why. Each of these is mundane, and together they are why the large majority of pilots never make it to a system anyone depends on.

None of them is a model problem, and all of them are predictable. We design the failure modes out before building the capability: retries and fallback routing so a single failed step does not end the run, checkpointing so a long task can resume rather than restart, cost ceilings and rate limits so a runaway is caught in minutes, and tracing so every run can be explained. Building for the bad day first is unglamorous and it is exactly the work the failed projects skipped, which is why their agents looked finished and were not.

An agent needs an owner as well as a builder.

The single strongest predictor of whether an agent reaches production is unglamorous: a named person accountable for it. Organisations with a named agent owner convert pilots to production at well over twice the rate of those without, and the ones without are heavily over-represented among the deployments that lose money. An agent with no owner is a science project; an agent with an owner is a function with a target, a budget and someone who answers for the result.

We build so that ownership is possible rather than theoretical. The agent is scoped to a workflow with a measurable return, instrumented so its owner can see what it is doing and what it costs, and bounded so its actions are accountable rather than mysterious. The owner does not have to be technical, but they do have to exist, and we will say plainly that an agent nobody owns is one we would not advise putting into production, because the data is clear about where those end up. Accountability is not a governance nicety here; it is the thing most correlated with the agent working at all.

Scope to a workflow with a number on it.

The agents that pay back are narrow. A specific workflow with a measurable return, an SDR agent that books meetings, a support agent that resolves a ticket type, a finance agent that reconciles a ledger, has a clear before and after, and the payback shows up fast: median time-to-value across functions runs around five months, with the sharpest cases paying back in three or four. A broad, do-everything agent has no number to hit and no way to prove it worked, which is how it ends up cancelled.

So we start from one process, not a platform ambition. We pick a workflow with documented value and a clear owner, build the agent for that, prove the return, and only then widen the scope, because the data is unambiguous that governed pilots in areas with known ROI succeed and sprawling ones fail. It is the less impressive starting point than a universal assistant, and it is the one that ships and earns its keep. The discipline is to resist the demo that wows and build the agent that pays, which are rarely the same thing.

A team of agents, not a single genius.

Complex work goes better as a team than a soloist. Rather than one agent trying to hold an entire goal in its head, a supervisor decomposes the goal and delegates each part to a specialist worker, then assembles the results, which keeps each step focused, its context small and its output checkable. Multi-agent orchestration of three or more cooperating agents is moving from a fifth of deployments toward roughly half over the next year, because it is how the harder workflows genuinely get done reliably.

The architecture is also what makes the whole thing resilient and observable. With retries, checkpointing and fallback routing between the steps, one worker failing is a recoverable event rather than a collapsed run, and because each agent has a defined job, a failure is traceable to a step rather than lost in a monolith's reasoning. We build supervisor-worker systems for the workflows that warrant them and keep a single agent where that is genuinely enough, because more agents is more to operate and the right number is the smallest that does the job well.

What is MCP, and why does it matter?

An agent is only useful if it can reach your tools and data, and the Model Context Protocol has become the standard way to connect it. Now stewarded by the Linux Foundation and adopted across the major labs, MCP lets an agent call tools through one open interface rather than a tangle of bespoke integrations, which reduces vendor lock-in and makes multi-vendor agent ecosystems normal. Alongside agent-to-agent protocols, it is the plumbing that lets the pieces of an agentic system be swapped without rebuilding the whole.

There is a governance dividend and a caution. Because every tool call passes through the protocol, it is captured, which gives you the activity record the AI Act expects of high-risk systems almost for free. But an open protocol abstracts risk rather than removing it: an agent that can call tools can call them wrongly, so the same connection needs centralised authorisation, scoped permissions and security monitoring around it. We build on MCP for the openness and the audit trail, and we put the controls around it rather than assume the standard makes the agent safe by itself.

Grounded in your data, not the model's guesses.

An agent that answers from the model alone will, sooner or later, answer confidently and wrongly. Grounding it in your own data through retrieval changes that: before the agent acts or answers, it retrieves the relevant passages from your documents, records and systems, so the response rests on your knowledge rather than the model's training. This is what makes an agent trustworthy on your business rather than merely fluent about the world in general.

The retrieval also keeps the sensitive data where it belongs. The documents the agent draws on stay in your sovereign store, searched in place rather than shipped to a third party, which matters as much for compliance as for accuracy. We connect the agent to your data over the same governed retrieval we build in our RAG and knowledge systems work, with the vector store inside the EU, so the agent is grounded in current, authoritative, in-jurisdiction information rather than a plausible guess. An agent acting on your systems has to be right about them, and retrieval is how it stays right as the underlying data changes.

Human-in-the-loop, set to your risk.

Autonomy is a dial, not a switch. The question is never whether a human is involved but where, and the answer follows the stakes: a low-risk, reversible action can run unattended, while a consequential or irreversible one, moving money, sending an external message, changing a record that matters, waits for a person's sign-off. We set those checkpoints at the confidence and risk thresholds you choose, so the agent acts freely where it is safe and pauses where it is not.

This is what lets an agent be useful without being reckless. An agent held back by a human on every trivial step is no faster than doing the work yourself; one let loose on every step is a liability waiting to happen. The right design puts the person exactly where their judgement is worth the interruption and nowhere else, and adjusts that line as the agent earns trust on a given task. The threshold is yours to set and ours to enforce, because how much autonomy you are comfortable with is a business decision rather than a technical default.

Guardrails that stop the costly mistake.

An agent that can act can act badly, and the damage is bounded only by what you let it reach. Guardrails are the limits drawn before it runs: allow and deny lists for the tools and domains it may touch, so it cannot wander into systems it has no business in; cost ceilings and rate limits, so a loop or a runaway cannot spend without bound; and a kill switch that stops it cleanly when something is wrong. These are not reactions to an incident; they are the conditions under which acting autonomously is acceptable at all.

The point of the limits is to make the worst case survivable rather than catastrophic. A guarded agent that hits a problem burns a capped amount, touches only sanctioned systems, and can be halted; an unguarded one can run up a bill, take an action it should never have been able to take, and leave you discovering the scope afterwards. We design the guardrails to your risk tolerance and the workflow's stakes, so the freedom the agent has is freedom you decided to grant rather than freedom it took because nobody fenced it in.

An agent you cannot see into is one you cannot trust.

The single most common gap in the failed projects is the inability to say what the agent did and why. A system that takes actions on your behalf has to be observable end to end: every step traced, every tool call recorded, every decision attributable, so that when something goes wrong you have a record to investigate rather than a black box to shrug at. Without that, the agent cannot be trusted in production, debugged when it misbehaves, or evidenced to a regulator, and that is precisely where most pilots stall.

Tracing is half of it; evaluation is the other. We score the agent's outputs from day one against what good looks like, so quality is measured rather than assumed, and so the slow drift, where an agent that worked at launch quietly degrades as data and prompts change, is caught as a falling number rather than a customer complaint. This runs on the same observability footing as the rest of the infrastructure, because an agent is a production system like any other, and a production system you cannot see into is one waiting to fail quietly. Visibility is what turns an agent from a hopeful experiment into something you can stand behind.

Sovereign by default: open-weight, in the EU.

An agent sees your most sensitive material: the prompts, the documents it retrieves, the data it acts on, and the traces of everything it did. Sending all of that to a model hosted by a foreign provider is a sovereignty decision made by accident, and for regulated data it is the wrong one. We deploy open-weight models on infrastructure inside the EU, on your environment or ours, so the prompts, the outputs and the traces stay in your jurisdiction rather than crossing it on every call.

Open weights matter beyond residency. A model you host is one no vendor can deprecate, reprice or change underneath you, which removes a dependency that a per-call API quietly builds in, and it makes air-gapped deployment possible where the data demands it. The capability gap between the best open-weight models and the proprietary frontier has narrowed to the point where, for most agentic workflows, the open option is more than enough and brings control the hosted one cannot. We default to it because sovereignty, cost predictability and freedom from lock-in all point the same way, and reach for a hosted model only where a workload genuinely needs the frontier.

What does an agent really cost at scale?

A demo costs almost nothing, which is exactly what makes the real bill a shock. A production agent carries a build cost and then an ongoing one: the model inference, which multiplies with every step and every run, the orchestration that keeps it reliable, and the human oversight the consequential actions require. Underestimating the cost of running agents at scale is one of the named reasons Gartner gives for the coming wave of cancellations, because the per-run economics that looked trivial in testing become real money at volume.

So we model the total cost of ownership before you commit, not after the first month's invoice. That means the inference cost per run at expected volume, the orchestration and oversight, and the maintenance an agent needs as models and data shift, set against the measurable return of the workflow it serves. Self-hosting open-weight models on EU infrastructure changes those economics in your favour at scale, which is the same crossover our cloud cost optimization work models for any heavy inference workload. The business case for an agent should rest on its real running cost, because an agent that works but loses money is still a failed project.

What is, and isn't, an agent?

The word agent has been stretched to cover almost anything with a model in it, and a good deal of what is sold as agentic is a chatbot with a new label or a fixed script with a language model bolted on. The distinction that matters is whether the system truly plans and acts, choosing steps and carrying them out toward a goal, or merely responds. Calling a deterministic workflow an agent does not make it one, and it does set up the disappointment that follows when it cannot do what the name implied.

We are deliberately plain about this, including against our own interest. Plenty of problems are better solved by a simple, predictable automation than by an agent, and where that is true we will tell you, because a fixed workflow is cheaper to build, cheaper to run and easier to trust than an agent doing the same job with more moving parts. An agent earns its complexity only where the task genuinely needs planning, judgement and adaptation; everywhere else, the honest answer is the simpler one, and selling you an agent you did not need would be the kind of agent washing we are warning you about.

Who needs an agent, and when should you not build one?

An agent fits a workflow that is repetitive enough to be worth automating, varied enough to need judgement rather than a fixed rule, and valuable enough that the build and running cost pays back, with a person available to own it. Customer service, sales development, software work and operational reconciliation are where the returns are most consistently proven, but the test is the shape of the workflow rather than the industry: a process with a clear goal, real variation, and a measurable outcome is a candidate; a vague ambition to add AI is not.

When those conditions are absent, the honest answer is to wait or to build something simpler. A workflow that never varies wants a script; a problem with no measurable return wants a clearer definition before any agent; an organisation with no one to own the result is not ready regardless of the use case. We would rather tell you a workflow is not an agent than build one that joins the large majority that never ship, because our interest is in the agents that reach production and earn their keep, not in the count of agents started. Name the workflow, and we will tell you honestly whether an agent is the right tool for it.

Questions buyers ask.

What is an AI agent?

A system that completes a task end to end rather than answering a question. It plans the steps, calls tools and data to carry them out, and acts, with a human overseeing the consequential parts. The difference from a chatbot is that an agent does the work instead of telling you how to.

Why do most agent projects fail?

Not for model reasons. Only around one use case in ten reaches production, and the cause is the orchestration and operations layer: no retry on a failed step, cost that runs away, context overflow, no trace of what happened. We design those failure modes out before building the rest.

What is MCP, and why does it matter?

The Model Context Protocol is the standard way to connect an agent to tools and data, now governed by the Linux Foundation and adopted across the major labs. It reduces vendor lock-in, and because every tool call is captured, it also produces the activity record the AI Act expects of high-risk systems.

How do you stop an agent doing something costly or wrong?

With guardrails designed in from the start: allow and deny lists for the tools and domains it can reach, cost ceilings and rate limits, and a confidence threshold below which it escalates to a person rather than acting. The human-in-the-loop checkpoint is set to your risk tolerance.

Can the models run inside the EU?

Yes. We deploy open-weight models on infrastructure inside the EU, on your environment or ours, including air-gapped setups. Prompts, outputs and the agent's traces stay in your jurisdiction, which is what GDPR and the AI Act require for sensitive data.

What does an agent really cost to run?

More than a demo suggests. A production agent carries build cost plus a monthly operating cost for the model, the orchestration and the human oversight, and maintenance adds to that over time. We model the total cost of ownership before you commit, so the business case rests on the real figure.

How long until an AI agent pays back?

For a well-scoped workflow, faster than most expect: median time-to-value across functions is around five months, with sharper cases like sales-development agents paying back in three to four. The key is scoping to one process with a measurable return rather than a broad assistant, because a vague agent has no number to hit and tends to be the one that gets cancelled.

Do we need a dedicated owner for the agent?

Yes, and it is the strongest predictor of success. Organisations with a named agent owner convert pilots to production at over twice the rate of those without, and the ones without are over-represented among deployments that lose money. The owner need not be technical, but they must exist; an agent nobody owns is one we would not advise putting into production.

What is multi-agent orchestration?

An architecture where a supervisor decomposes a goal and delegates each part to a specialist worker agent, then assembles the results, rather than one agent holding the whole task. It keeps each step focused and traceable and is moving from about a fifth of deployments toward roughly half over the next year. We use it where a workflow warrants it and keep a single agent where that is enough.

How do you keep an agent grounded and accurate?

By grounding it in your own data through retrieval: before it acts or answers, it pulls the relevant passages from your documents and systems, so the response rests on your knowledge rather than the model's guesses. The data stays in your sovereign store, searched in place, which keeps it accurate and compliant at once. It connects to the retrieval we build in our RAG and knowledge systems work.

Can an agent run on open-weight models in the EU?

Yes, and we default to it. We deploy open-weight models on infrastructure inside the EU, on your environment or ours, including air-gapped setups, so prompts, outputs and traces stay in your jurisdiction. Open weights also remove the risk of a vendor deprecating or repricing the model underneath you, and for most agentic workflows they are more than capable enough.

Is what we want really an agent, or just automation?

A fair question, and often the answer is automation. An agent plans and acts toward a goal with judgement; a fixed, predictable process is better served by a simple script, which is cheaper to build, run and trust. We will tell you when a workflow does not need an agent, because selling you one you do not need is the agent washing the industry is full of.

How do you make an agent auditable for the AI Act?

By building it so every step is traced and every tool call recorded, which the Model Context Protocol produces as it runs. That gives the activity log Article 12 expects of high-risk systems, kept in-region alongside a kill switch and human sign-off on consequential actions. The classification of whether your use is high-risk follows our compliance work.

Agent scoping

Name the workflow. We'll tell you if an agent ships it.

Bring the process you want an agent to take on. We assess whether it can reach production, what the orchestration and oversight would take, and the real cost to build and run it, before you commit to anything. If it is not ready to be an agent yet, we will say so.

Scope an agent Back to AI →

Built for production, not the demo Models run inside the EU One named operator, answerable