AI agents engineered to reach production.
Most agent projects stall before they ship, and the model is rarely the reason. We build agents with the orchestration, guardrails, human oversight and observability that get them into production, on open-weight models inside the EU.
An AI agent is a system that completes a task end to end: it plans the steps, calls tools and your data, and acts, with a human overseeing where it matters. The hard part in 2026 is not making one reason; it is making one that ships, because most agents stall in the orchestration and the operations around the model rather than the model itself. Argus Root builds that part first — bounded autonomy, guardrails and observability — on open-weight models hosted inside the EU.
In short
- Gartner expects over 40% of agentic AI projects to be cancelled by the end of 2027 — the cause is escalating cost, unclear ROI and weak risk controls, not the model.
- Beware agent washing: of the thousands of vendors claiming agentic products, Gartner estimates only around 130 offer genuine autonomous capability; the rest is automation with an agentic price tag.
- The pattern that ships is bounded autonomy: allowlisted tools, a task with a measurable number on it, and production-grade logging — governed by the agent's autonomy level, not one uniform rule.
- Four controls are non-negotiable before you loosen the reins: a cost ceiling per agent, human approval for irreversible actions, a kill switch, and a measured groundedness score.
- Distinguish an agent's ability to act from its scope of access — conflating them is the governance gap Gartner expects to send 40% of enterprises rolling agents back by 2027.
The model is the easy part.
By most counts only about one agent use case in ten reaches production, and the cause is rarely model quality. The orchestration layer between the steps is where it breaks: a failed tool call with no retry, cost that runs away, context that overflows, no record of what the agent did. The binding constraint is usually how much a person can review, not how many agents you can spin up. Building for production means designing the failure handling, the cost ceilings and the oversight before the demo, rather than after it stalls.
| Layer | What it does | Failure it prevents |
|---|---|---|
| Orchestration | Supervisor decomposes goals, delegates, retries | Tasks that stall on the first error |
| Tools (MCP) | Connects the agent to your systems and data | An agent that cannot do anything real |
| Memory | Working context, long-term store, episodic logs | An agent that forgets and repeats |
| Guardrails | Allow and deny lists, cost ceilings, residency | Runaway cost and unsafe actions |
| Human-in-the-loop | Escalates below a confidence threshold | Autonomous mistakes that matter |
| Observability | Traces every step, end to end | A system you cannot debug or audit |
The failure rate is the headline of 2026, and it is stark. Depending on whose survey you read, somewhere between 86 and 88% of agent pilots never reach production, and Gartner expects more than 40% of agentic projects to be cancelled outright by the end of 2027. The cause is rarely the model, which is the part that works; it is everything around it, and the projects that die share a profile: no governance, no way to trace what the agent did, costs that ran away at scale, and no single person accountable for the outcome.
This is why the gap, rather than the model, is where we start. A demo agent reasoning well in a notebook is a long way from one that retries a failed step, stays inside a budget, keeps its context from overflowing, escalates when it is unsure, and leaves a record of everything it touched. Those are engineering problems with known answers, and designing them in before building the clever part is the difference between the one in ten that ships and the nine that impress once and quietly disappear. The model is the easy part precisely because the labs already solved it; the operations around it are still yours to get right.
An agent that can act needs rails.
Once an agent can take actions, governance is what makes it safe to deploy: allow and deny lists for the tools and domains it can reach, cost ceilings and rate limits, and a confidence threshold below which it escalates to a person instead of acting. These are design decisions, not settings you add after an incident.
The EU dimension lines up with good engineering here. Article 12 of the AI Act expects high-risk systems to log their activity, and an agent built on the Model Context Protocol produces that record as it runs, because every tool call is captured in the trace. We keep the models open-weight and inside the EU, so the traces and the data the agent touches stay in your jurisdiction. The monitoring sits on our observability work, and the classification on compliance.
Most organisations are not yet ready on this front: only about a fifth have a mature governance model for autonomous agents, which is the same gap the cancellations come from. We treat an agent as an accountable system rather than a clever feature, with the controls a regulator and a board now expect, an audit trail of its actions, a kill switch to stop it, and human sign-off on the consequential steps, built in rather than retrofitted after the first expensive mistake. Governance is not the brake on an agent; it is what lets you let it act at all.
How do we build them?
Scoped to one workflow with a payback, engineered for the failure modes that sink the rest.
Scope to a workflow
We start from one process with a measurable return and a named owner, rather than a broad agent that impresses in a demo and stalls in a month.
Supervisor-worker orchestration
A supervisor decomposes the goal and delegates to specialist steps, with retries, checkpointing and fallback routing so a single failure does not end the run.
MCP tools & RAG grounding
Tools connected over the Model Context Protocol and retrieval over your own data, so the agent acts on your systems and answers from your knowledge.
Human-in-the-loop
Checkpoints at the confidence and risk thresholds you set, so consequential actions get a person's sign-off before they happen.
Observability & evaluation
Every step traced and outputs scored from day one, because an agent you cannot see into is one you cannot trust in production. See observability →
Sovereign deployment
Open-weight models hosted inside the EU, on your environment or ours, with prompts, outputs and traces kept in-region.
We run agents in our own operation.
We build and run automation agents ourselves, on open-weight models on our own hardware, with the same tracing and guardrails we would put around yours. That keeps us honest about the part most pitches skip: a production agent is a real investment to build and to operate, with ongoing cost for maintenance and oversight. We give you that total cost upfront, so the decision is made on the real number rather than the demo.
# bounded autonomy: the tools and limits define what it may do goal: resolve_refund_request tools: - name: lookup_order # read-only scope: read - name: issue_refund # mutating, capped + gated scope: write max_amount: 200 # over 200 → human approval require_human_approval: - any: [delete, external_email, payment] bounds: max_steps: 12 # no runaway loops cost_ceiling: 2.00 # hard cap per run kill_switch: enabled observability: otel/gen_ai # every step traced + costed
Why do agent pilots die before production?
The agents that fail rarely fail loudly. They work in the demo, impress a stakeholder, and then meet the messiness of real use: a tool times out and nothing retries, a long task overflows the context window and the agent forgets its own plan, a loop runs up a bill nobody capped, an output is wrong and there is no trace to explain why. Each of these is mundane, and together they are why the large majority of pilots never make it to a system anyone depends on.
None of them is a model problem, and all of them are predictable. We design the failure modes out before building the capability: retries and fallback routing so a single failed step does not end the run, checkpointing so a long task can resume rather than restart, cost ceilings and rate limits so a runaway is caught in minutes, and tracing so every run can be explained. Building for the bad day first is unglamorous and it is exactly the work the failed projects skipped, which is why their agents looked finished and were not.
An agent needs an owner as well as a builder.
The single strongest predictor of whether an agent reaches production is unglamorous: a named person accountable for it. Organisations with a named agent owner convert pilots to production at well over twice the rate of those without, and the ones without are heavily over-represented among the deployments that lose money. An agent with no owner is a science project; an agent with an owner is a function with a target, a budget and someone who answers for the result.
We build so that ownership is possible rather than theoretical. The agent is scoped to a workflow with a measurable return, instrumented so its owner can see what it is doing and what it costs, and bounded so its actions are accountable rather than mysterious. The owner does not have to be technical, but they do have to exist, and we will say plainly that an agent nobody owns is one we would not advise putting into production, because the data is clear about where those end up. Accountability is not a governance nicety here; it is the thing most correlated with the agent working at all.
Scope to a workflow with a number on it.
The agents that pay back are narrow. A specific workflow with a measurable return, an SDR agent that books meetings, a support agent that resolves a ticket type, a finance agent that reconciles a ledger, has a clear before and after, and the payback shows up fast: median time-to-value across functions runs around five months, with the sharpest cases paying back in three or four. A broad, do-everything agent has no number to hit and no way to prove it worked, which is how it ends up cancelled.
So we start from one process, not a platform ambition. We pick a workflow with documented value and a clear owner, build the agent for that, prove the return, and only then widen the scope, because the data is unambiguous that governed pilots in areas with known ROI succeed and sprawling ones fail. It is the less impressive starting point than a universal assistant, and it is the one that ships and earns its keep. The discipline is to resist the demo that wows and build the agent that pays, which are rarely the same thing.
A team of agents, not a single genius.
Complex work goes better as a team than a soloist. Rather than one agent trying to hold an entire goal in its head, a supervisor decomposes the goal and delegates each part to a specialist worker, then assembles the results, which keeps each step focused, its context small and its output checkable. Multi-agent orchestration of three or more cooperating agents is moving from a fifth of deployments toward roughly half over the next year, because it is how the harder workflows genuinely get done reliably.
The architecture is also what makes the whole thing resilient and observable. With retries, checkpointing and fallback routing between the steps, one worker failing is a recoverable event rather than a collapsed run, and because each agent has a defined job, a failure is traceable to a step rather than lost in a monolith's reasoning. We build supervisor-worker systems for the workflows that warrant them and keep a single agent where that is genuinely enough, because more agents is more to operate and the right number is the smallest that does the job well.
What is MCP, and why does it matter?
An agent is only useful if it can reach your tools and data, and the Model Context Protocol has become the standard way to connect it. Now stewarded by the Linux Foundation and adopted across the major labs, MCP lets an agent call tools through one open interface rather than a tangle of bespoke integrations, which reduces vendor lock-in and makes multi-vendor agent ecosystems normal. Alongside agent-to-agent protocols, it is the plumbing that lets the pieces of an agentic system be swapped without rebuilding the whole.
There is a governance dividend and a caution. Because every tool call passes through the protocol, it is captured, which gives you the activity record the AI Act expects of high-risk systems almost for free. But an open protocol abstracts risk rather than removing it: an agent that can call tools can call them wrongly, so the same connection needs centralised authorisation, scoped permissions and security monitoring around it. We build on MCP for the openness and the audit trail, and we put the controls around it rather than assume the standard makes the agent safe by itself.
Grounded in your data, not the model's guesses.
An agent that answers from the model alone will, sooner or later, answer confidently and wrongly. Grounding it in your own data through retrieval changes that: before the agent acts or answers, it retrieves the relevant passages from your documents, records and systems, so the response rests on your knowledge rather than the model's training. This is what makes an agent trustworthy on your business rather than merely fluent about the world in general.
The retrieval also keeps the sensitive data where it belongs. The documents the agent draws on stay in your sovereign store, searched in place rather than shipped to a third party, which matters as much for compliance as for accuracy. We connect the agent to your data over the same governed retrieval we build in our RAG and knowledge systems work, with the vector store inside the EU, so the agent is grounded in current, authoritative, in-jurisdiction information rather than a plausible guess. An agent acting on your systems has to be right about them, and retrieval is how it stays right as the underlying data changes.
Human-in-the-loop, set to your risk.
Autonomy is a dial, not a switch. The question is never whether a human is involved but where, and the answer follows the stakes: a low-risk, reversible action can run unattended, while a consequential or irreversible one, moving money, sending an external message, changing a record that matters, waits for a person's sign-off. We set those checkpoints at the confidence and risk thresholds you choose, so the agent acts freely where it is safe and pauses where it is not.
This is what lets an agent be useful without being reckless. An agent held back by a human on every trivial step is no faster than doing the work yourself; one let loose on every step is a liability waiting to happen. The right design puts the person exactly where their judgement is worth the interruption and nowhere else, and adjusts that line as the agent earns trust on a given task. The threshold is yours to set and ours to enforce, because how much autonomy you are comfortable with is a business decision rather than a technical default.
Guardrails that stop the costly mistake.
An agent that can act can act badly, and the damage is bounded only by what you let it reach. Guardrails are the limits drawn before it runs: allow and deny lists for the tools and domains it may touch, so it cannot wander into systems it has no business in; cost ceilings and rate limits, so a loop or a runaway cannot spend without bound; and a kill switch that stops it cleanly when something is wrong. These are not reactions to an incident; they are the conditions under which acting autonomously is acceptable at all.
The point of the limits is to make the worst case survivable rather than catastrophic. A guarded agent that hits a problem burns a capped amount, touches only sanctioned systems, and can be halted; an unguarded one can run up a bill, take an action it should never have been able to take, and leave you discovering the scope afterwards. We design the guardrails to your risk tolerance and the workflow's stakes, so the freedom the agent has is freedom you decided to grant rather than freedom it took because nobody fenced it in.
An agent you cannot see into is one you cannot trust.
The single most common gap in the failed projects is the inability to say what the agent did and why. A system that takes actions on your behalf has to be observable end to end: every step traced, every tool call recorded, every decision attributable, so that when something goes wrong you have a record to investigate rather than a black box to shrug at. Without that, the agent cannot be trusted in production, debugged when it misbehaves, or evidenced to a regulator, and that is precisely where most pilots stall.
Tracing is half of it; evaluation is the other. We score the agent's outputs from day one against what good looks like, so quality is measured rather than assumed, and so the slow drift, where an agent that worked at launch quietly degrades as data and prompts change, is caught as a falling number rather than a customer complaint. This runs on the same observability footing as the rest of the infrastructure, because an agent is a production system like any other, and a production system you cannot see into is one waiting to fail quietly. Visibility is what turns an agent from a hopeful experiment into something you can stand behind.
Sovereign by default: open-weight, in the EU.
An agent sees your most sensitive material: the prompts, the documents it retrieves, the data it acts on, and the traces of everything it did. Sending all of that to a model hosted by a foreign provider is a sovereignty decision made by accident, and for regulated data it is the wrong one. We deploy open-weight models on infrastructure inside the EU, on your environment or ours, so the prompts, the outputs and the traces stay in your jurisdiction rather than crossing it on every call.
Open weights matter beyond residency. A model you host is one no vendor can deprecate, reprice or change underneath you, which removes a dependency that a per-call API quietly builds in, and it makes air-gapped deployment possible where the data demands it. The capability gap between the best open-weight models and the proprietary frontier has narrowed to the point where, for most agentic workflows, the open option is more than enough and brings control the hosted one cannot. We default to it because sovereignty, cost predictability and freedom from lock-in all point the same way, and reach for a hosted model only where a workload genuinely needs the frontier.
What does an agent really cost at scale?
A demo costs almost nothing, which is exactly what makes the real bill a shock. A production agent carries a build cost and then an ongoing one: the model inference, which multiplies with every step and every run, the orchestration that keeps it reliable, and the human oversight the consequential actions require. Underestimating the cost of running agents at scale is one of the named reasons Gartner gives for the coming wave of cancellations, because the per-run economics that looked trivial in testing become real money at volume.
So we model the total cost of ownership before you commit, not after the first month's invoice. That means the inference cost per run at expected volume, the orchestration and oversight, and the maintenance an agent needs as models and data shift, set against the measurable return of the workflow it serves. Self-hosting open-weight models on EU infrastructure changes those economics in your favour at scale, which is the same crossover our cloud cost optimization work models for any heavy inference workload. The business case for an agent should rest on its real running cost, because an agent that works but loses money is still a failed project.
What is, and isn't, an agent?
The word agent has been stretched to cover almost anything with a model in it, and a good deal of what is sold as agentic is a chatbot with a new label or a fixed script with a language model bolted on. The distinction that matters is whether the system truly plans and acts, choosing steps and carrying them out toward a goal, or merely responds. Calling a deterministic workflow an agent does not make it one, and it does set up the disappointment that follows when it cannot do what the name implied.
We are deliberately plain about this, including against our own interest. Plenty of problems are better solved by a simple, predictable automation than by an agent, and where that is true we will tell you, because a fixed workflow is cheaper to build, cheaper to run and easier to trust than an agent doing the same job with more moving parts. An agent earns its complexity only where the task genuinely needs planning, judgement and adaptation; everywhere else, the honest answer is the simpler one, and selling you an agent you did not need would be the kind of agent washing we are warning you about.
Who needs an agent, and when should you not build one?
An agent fits a workflow that is repetitive enough to be worth automating, varied enough to need judgement rather than a fixed rule, and valuable enough that the build and running cost pays back, with a person available to own it. Customer service, sales development, software work and operational reconciliation are where the returns are most consistently proven, but the test is the shape of the workflow rather than the industry: a process with a clear goal, real variation, and a measurable outcome is a candidate; a vague ambition to add AI is not.
When those conditions are absent, the honest answer is to wait or to build something simpler. A workflow that never varies wants a script; a problem with no measurable return wants a clearer definition before any agent; an organisation with no one to own the result is not ready regardless of the use case. We would rather tell you a workflow is not an agent than build one that joins the large majority that never ship, because our interest is in the agents that reach production and earn their keep, not in the count of agents started. Name the workflow, and we will tell you honestly whether an agent is the right tool for it.
Questions buyers ask.
What is an AI agent?
Why do most agent projects fail?
What is MCP, and why does it matter?
How do you stop an agent doing something costly or wrong?
Can the models run inside the EU?
What does an agent really cost to run?
How long until an AI agent pays back?
Do we need a dedicated owner for the agent?
What is multi-agent orchestration?
How do you keep an agent grounded and accurate?
Can an agent run on open-weight models in the EU?
Is what we want really an agent, or just automation?
How do you make an agent auditable for the AI Act?
Name the workflow. We'll tell you if an agent ships it.
Bring the process you want an agent to take on. We assess whether it can reach production, what the orchestration and oversight would take, and the real cost to build and run it, before you commit to anything. If it is not ready to be an agent yet, we will say so.