AI · Production integration

Production AI integration that holds up under real traffic.

A pilot that works in a demo often falls over in production: cost spirals, latency balloons, and one provider outage takes the feature down. We bring the LLMOps discipline that keeps AI reliable and affordable inside your real systems, with the option to route to models hosted in the EU.

Book a production review What we build →

◆ Model gateway & routing ◆ Guardrails & eval gates ◆ EU routing

Production AI integration is the work of putting a model into your real systems and keeping it reliable, safe and affordable as traffic grows. A production AI feature is rarely one prompt: it is a compound system of routing, retrieval, guardrails, evaluation and fallback, each with its own way of failing, which is why a pilot that impresses in a demo so often stalls on the way to production. Argus Root builds and operates that system, with the option to route sensitive work to models hosted inside the EU, for teams that need an AI feature they can rely on rather than another stalled prototype.

In short

By MIT's 2025 review of 300+ deployments, about 95% of enterprise GenAI pilots delivered zero measurable ROI, and Gartner puts the share that never reach production near 85% — the model is rarely the problem.
The gap is context, not intelligence: querying raw data scores below 20% accuracy, while a governed, retrieval-grounded setup exceeds 95%.
A pilot succeeds because it is insulated — narrow scope, curated data, a human filling gaps; production removes those buffers, which is the system we build for.
From August 2026 the EU AI Act's high-risk obligations apply, so input and output guardrails moved from optional to mandatory for many use cases.
Sovereignty is decided in the routing: sensitive requests can be sent to models hosted inside the EU, while general work uses the best available model.

The pilot works. Production is a different system.

Language models do not behave like ordinary software. A feature that looks finished in a demo meets production, and the cost spirals, the latency balloons under load, and a single provider outage takes it offline. The 2026 deployment stack has six layers, and skipping any one of them is paid back in incidents and lost trust. The discipline that closes the gap between pilot and production is engineering, not a better model.

the six layers of production LLM

The six layers of a production LLM deployment and what breaks without each.
Layer	What it does	What breaks without it
Gateway & routing	One entry point across models, routed by cost and latency, with caching	Runaway cost, no fallback on outage
Guardrails	Block prompt injection and PII in, bad output out	Unsafe answers reach users
Versioning	Version prompts, retrieval and guardrails; A/B with rollback	Every change is a coin flip
Evaluation gates	Score groundedness and safety in CI/CD	Regressions ship unseen
Observability	Trace every call, end to end	A system you cannot debug or audit
Fallback & load	Failover and caching, tested under load	The feature falls over at peak

The gap is not in the model, and that is the part most teams get wrong. The pilot works because the model is good; what is missing in production is everything around it, the operational infrastructure that turns a clever prototype into a feature your systems can lean on. As one bank put it, the failure is rarely an LLM problem, it is a retrieval problem, a cost problem, a governance problem, the unglamorous layers a demo never has to have. The model quality that impressed the stakeholder is precisely the thing that was never the bottleneck.

A production AI feature is a compound system rather than a prompt. It has retrieval that finds the right context, routing that sends each request to the right model, guardrails on the way in and out, caching to control cost, evaluation to catch regressions, and fallback for when a provider fails, each a component with its own way of breaking. Building one means thinking as a systems architect rather than a model specialist, and the cost structure is inverted from traditional machine learning: the heavy spend is not the upfront training but the inference on every single request, which is what makes the bill move under real traffic in a way no pilot predicts.

Routing is also a sovereignty control.

Putting provider calls behind a gateway buys you cost control through routing, semantic caching and spend ceilings, and resilience through fallback when a model fails. It is also where residency happens. The gateway can send sensitive requests to open-weight models hosted in the EU and enforce data-residency rules in the guardrails, so the prompts and outputs that must stay in-region do, while less sensitive traffic can still use a hyperscaler model where that makes sense.

That puts the placement decision in one controllable place rather than scattered through your codebase. The same gateway feeds traces to our observability layer, and the residency rules follow compliance. Where the feature is an agent or a retrieval system, it builds on our agents and RAG work.

Centralising all of this in the gateway is what makes a production AI feature operable at all. Without it, routing logic, cost controls, residency rules and fallback are scattered through the application code, changed in a dozen places and consistent in none. With it, the operational concerns of running AI, where requests go, what they cost, what they are allowed to return, what happens when a model fails, live in one layer you can see, control and change, rather than tangled into the feature itself. The gateway is the difference between an AI feature you operate and one that operates you.

What do we build?

The six layers, wired into your systems and operated, not handed over as a diagram.

Model gateway & routing

One entry point across providers and open-weight models, routed by cost and latency, with semantic caching and spend ceilings that keep the bill predictable.

Input & output guardrails

Prompt-injection and PII checks on the way in, and hallucination, leakage and format checks on the way out, the minimum for anything customer-facing.

Prompt & config versioning

Prompts, retrieval settings and guardrail configs versioned and rolled out per user with automatic rollback, so a change is testable rather than a gamble.

Evaluation gates

Groundedness, relevance and safety scored in your CI/CD pipeline, blocking a deploy when the numbers regress. See observability →

Cost & latency control

Caching, routing and load testing that find the breaking point before your users do, with an alert when spend crosses your baseline.

EU routing & fallback

Sensitive requests routed to EU-hosted open-weight models, with tested failover so an outage degrades gracefully instead of going dark.

We run the gateway and the guardrails ourselves.

We route across open-weight models on our own infrastructure, with caching, guardrails and tracing in front of them, because we run AI features ourselves rather than only advising on them. That keeps us honest about the parts that hurt in production: the inference bill, the latency under load, and the failure mode when a provider has a bad day. We build for those before they become your incident.

The compound system we operate: a gateway that routes by sensitivity (regulated data to an EU-hosted model, general work to the best frontier model), retrieval and guardrails around the model, an evaluation gate that blocks a bad deploy, and observability over all of it. The routing split is where sovereignty is enforced, not bolted on.

gateway/policy.yaml — route by sensitivity, guard, and eval-gate the deploy

# route by data sensitivity — this split is the sovereignty control
routes:
  - match: pii            # carries personal / regulated data
    model:    eu/llm-sovereign   # hosted in-EU, in-jurisdiction
    fallback: eu/llm-small
  - match: default
    model:    frontier/large
    fallback: eu/llm-small
guardrails:
  input:  [pii-redaction, jailbreak-filter]
  output: [grounding-check, schema-validate]
eval_gate:
  block_deploy_if: pass_rate < 0.95
  suite: [faithfulness, safety, regression]

We operate Model gateway Guardrails Prompt versioning Eval gates OTel tracing EU open-weight

LLMOps is the discipline, not an add-on.

The work of running language models in production has a name, LLMOps, and treating it as an optional extra once the model is chosen is how projects stall. It is the production discipline that decides whether an AI feature built by capable engineers genuinely holds up at scale: the versioning, the evaluation, the guardrails, the observability and the cost control that a model on its own has none of. Calling it an add-on is like calling the foundations an add-on to the house.

The shift it demands is from model specialist to systems architect. A traditional machine-learning project versioned model weights; an LLM feature versions prompt templates, retrieval settings and guardrail configurations, and tests them against fuzzy, semantic criteria rather than exact outputs. We bring that discipline as the substance of the service, because the teams that invest in it early ship more reliable features for less, and the ones that skip it spend their time debugging production by hand. The model is the ingredient; LLMOps is the kitchen that turns it into something you can serve.

What does an AI gateway do?

An LLM gateway sits between your application and the models, presenting one interface across many providers and open-weight models behind it. Through that single layer pass the routing, the authentication, the rate limiting, the cost tracking, the caching, the fallback and the guardrails, so your application code is freed of all of it and you gain one place to control how AI behaves. The alternative, calling providers directly from a dozen points in the codebase, scatters every operational concern and makes none of them consistent.

The unified layer is also what makes the model a swappable choice rather than a permanent commitment. Because the application talks to the gateway and the gateway talks to the models, you can change which model serves a request, add a new provider, or move sensitive work to an EU-hosted model, without touching the feature. We build the gateway as the spine of the integration, because it is the one component that turns the operational concerns of production AI from scattered code into a controllable layer, and it is the difference between an AI feature you can change in one place and one you have to chase through the whole application.

Not paying twice for the same answer.

A great deal of real traffic is repetitive. Different users ask the same question in different words, and a naive integration pays the full inference cost every time, as if each were novel. Semantic caching recognises that what are your opening hours and when are you open are the same question and returns the cached answer rather than making a second call, which on a high-repetition workload removes a large slice of the inference bill outright.

The saving compounds with routing. Not every request needs the largest, most expensive model: a simple classification can go to a small fast one and only the hard cases to the frontier, so the spend tracks the difficulty of the work rather than defaulting to the priciest option for everything. We build caching and right-sized routing into the gateway, with spend ceilings and an alert when usage crosses your baseline, so cost is a controlled variable rather than a monthly surprise. The cheapest inference is the call you did not have to make, and the second cheapest is the one you sent to the right-sized model.

What guardrails does customer-facing AI need?

A model put in front of customers without guardrails is an incident waiting to happen. On the way in, the checks catch prompt injection and strip or flag personal data before it reaches the model; on the way out, they screen the response for hallucination, data leakage, toxic content and format compliance before it ever reaches the user. Output guardrails in particular are not a refinement but the baseline requirement for any customer-facing deployment, because the alternative is shipping whatever the model happened to produce, unchecked.

For a system that can act, a third layer guards the actions themselves, so a generated instruction to do something consequential is checked before it executes rather than after. We build the guardrails at input, output and action levels to the stakes of the feature, tuned so they catch the real failure modes without smothering the model in false positives. The point is not to distrust the model but to bound what it can send and do, because a probabilistic system in front of customers needs deterministic limits around it, and those limits are the cheapest insurance in the whole stack.

Prompts are code now.

A prompt is not a throwaway string; it is logic that determines how the feature behaves, and a careless edit to it can break the system as surely as a bad code change. Treating prompts, retrieval settings and guardrail configurations as versioned artifacts, changed deliberately, tested before release and traceable after, is what turns an AI feature from a thing people tweak in production into a system you can manage. Without it, nobody can say which version of a prompt produced last week's good results or this week's bad ones.

We version those artifacts and roll changes out progressively, to a fraction of traffic first, with automatic rollback when the numbers move the wrong way, the same canary discipline mature teams apply to code. A change becomes something measured and reversible in minutes rather than a release everyone holds their breath through, and an immutable record ties every output back to the exact configuration that produced it. That trail is also what an audit needs, because tracing a decision to a specific version of the stack is precisely what a regulator asks of a system that affects people.

How do you test AI that has no single right answer?

The hard part of testing AI is that there is no single right answer to assert against, so traditional exact-match tests do not work. The answer is fuzzy, semantic evaluation: an automated suite that scores groundedness, relevance and safety on every prompt and model change, combining a model-as-judge with deterministic checks like format and citation presence and a rotating sample of human spot-checks. Wired into the deployment pipeline, it blocks a release when the scores regress, so a quality drop is caught before it ships rather than after a user finds it.

This matters because AI quality degrades in ways code does not. A model update, a prompt tweak, a shift in the underlying data can each quietly lower quality with nothing throwing an error, and evaluation is the only reliable early-warning system for that drift. We build the eval suite as a gate in CI/CD rather than a report nobody reads, so the question is it still good enough is answered by a number on every change instead of a hope. A feature that cannot fail its own tests is a feature whose quality is whatever the last edit happened to leave it at.

Metrics tell you something broke; traces tell you why.

Running AI in production needs two kinds of sight. Metrics track the system's health in aggregate, latency percentiles, cost per request, error rate, token usage, so a statistical anomaly surfaces as a number moving; observability goes to the level of the individual request, tracing every step so you can see why one specific output was wrong. Monitoring tells you something is off; tracing tells you what and where, and a production system needs both, because an average hides the failing case and a single trace cannot show a trend.

We instrument the integration so both exist from day one rather than bolted on after the first incident. Every request is traceable end to end, the cost and latency are visible per route and per model, and the signals that predict trouble are watched rather than discovered. This runs on the same observability footing as the rest of the infrastructure, because a production AI feature is a production system like any other, and one you cannot see into is one you cannot keep reliable. Visibility is not a luxury here; it is the precondition for trusting the feature at all.

An outage should degrade, not go dark.

AI providers have outages, rate limits and bad minutes like any other dependency, and a feature wired to a single model with no plan for its absence simply stops when that model does. Resilience is a fallback chain: when the primary model is unavailable or too slow, the gateway routes to a secondary, and where a full answer is impossible the feature degrades to a reduced one rather than an error page. The goal is that a provider's bad day is a quieter feature rather than a dead one.

Building this in is the difference between an AI feature that is a hard dependency on one vendor's uptime and one that survives the vendor failing. We design the fallback so the degradation is graceful and tested, not theoretical, because a failover path nobody has exercised tends to fail when it is finally needed. Running open-weight models inside the EU as part of the chain also means there is a fallback you operate rather than one you only rent, so the feature is not wholly at the mercy of a provider you do not control. An AI feature that cannot survive its model having an outage is one outage away from an outage of its own.

Find the breaking point before your users do.

A feature that is fast and cheap with ten users can be slow and ruinous with ten thousand, and the inverted cost structure of AI is what makes this bite. Because the heavy cost is the inference on every request, scale multiplies both the latency and the bill in ways a small pilot never reveals, and the breaking point, where the feature falls over or the spend becomes untenable, is far better found in a test than in production at peak. The demo that delighted at low volume is silent about what happens under load.

So we load test the integration against realistic traffic before it meets real traffic, finding where latency degrades and where cost climbs faster than value, and tuning the caching, routing and limits to push that point out. Peak load becomes a planned case with known behaviour rather than the day the feature and the budget both fail at once. It is the same discipline as capacity planning anywhere else, applied to a system whose costs scale per call, and skipping it is how a successful launch becomes an expensive emergency the week it gets popular.

The cost flip: inference is the bill now.

Traditional machine learning front-loaded its cost: heavy training upfront, then cheap inference on commodity hardware. Language-model features invert that. There is little upfront training, and instead a per-request inference cost that recurs on every single call and grows directly with usage, which means a feature's economics are decided not at build time but every day it runs. A business case built on the pilot's cost is built on the cheapest the feature will ever be.

Controlling that ongoing cost is therefore central rather than incidental. Caching removes repeat calls, routing sends easy work to cheap models, and at sufficient volume self-hosting an open-weight model on EU infrastructure beats paying per call to a provider, the crossover our cloud cost optimization work models for any heavy inference workload. We design the integration so the per-request economics are understood and managed from the start, because an AI feature that works technically and loses money on every call is a failure with good reviews. The question is never just whether it works, but whether it works at a cost the business can carry at scale.

Context and governance: the real production blocker.

Even teams with a well-built operational stack are finding their AI stalls, and the reason is no longer orchestration. It is context and governance: the gateway logs everything but cannot connect those logs to who is accountable, the retrieval is wrong because the data feeding it was never curated, the model information lives in a slide deck nobody can query. The missing layer is the connection between the running system and the organisation's knowledge and rules, and it is where the more advanced projects now get stuck.

We treat that layer as part of the integration rather than someone else's problem. The traces connect to ownership, the retrieval is grounded in governed, current data through our knowledge systems work, and the whole feature is built so its behaviour can be explained to the people accountable for it. An AI feature that runs but cannot be governed is one a regulated business cannot keep, regardless of how well the model performs, because the question that stops a deployment is rarely does it work and usually can you stand behind it. Getting the context and governance right is what moves a feature from technically live to genuinely deployable.

Sovereign by default, decided in the routing.

Every request to a hosted model sends your data to wherever that model runs, and at production volume that is a continuous flow of prompts and outputs crossing a jurisdiction on every call. The gateway is where that becomes a decision rather than an accident: sensitive requests routed to open-weight models hosted inside the EU, residency enforced in the guardrails, traces kept in-region, while genuinely non-sensitive traffic can still use a hyperscaler model where that is the right tool. The placement is set once, in one layer, rather than scattered through code that each engineer wires differently.

This also produces the audit trail the AI Act expects of higher-risk systems almost as a by-product, because every request, its route and its output pass through a layer that records them. We default to the EU-hosted, open-weight option for the sensitive path, both for residency and because it removes the dependence on a provider that can deprecate or reprice a model underneath a live feature. The frontier hosted models remain available through the same gateway for the work that genuinely needs them, so sovereignty is the default rather than a constraint, and the exceptions are deliberate. The classification of what counts as sensitive follows our compliance work.

Who needs this, and when is a prototype enough?

This work is for AI that has to be relied on: a customer-facing feature, anything touching regulated data, a feature whose cost matters at scale, or any AI that is becoming part of how the business runs rather than an experiment on the side. The signal is the moment a prototype is asked to carry real weight, because that is exactly when the absence of the operational layers, the cost runs away, a provider outage takes it down, an unguarded output reaches a customer, stops being hypothetical and starts being the reason it fails.

Where the AI is genuinely low-stakes, we will say so. An internal tool used by a handful of people, a throwaway experiment, a feature with no sensitive data and no real cost exposure may not warrant the full production stack, and building it would be cost without payoff. The honest test is what breaks if this fails and what it costs if it scales; where the answers are little and little, a simpler build is right, and where they are not, the production discipline is what stands between a promising pilot and the large majority of AI features that never reliably ship. Show us the pilot that will not ship, and we will tell you which of these layers it is missing.

How does it start?

Most engagements begin with a prototype that works in the demo and stalls on the way to production, and the first job is to find which of the layers it is missing. We take the working pilot, put it under realistic conditions, and surface where it breaks: the cost that spirals under traffic, the absent guardrails, the lack of a fallback, the prompt nobody versioned, the quality nobody is measuring. That diagnosis is usually quick, because the gaps follow a familiar pattern across the projects that fail.

From there we build the missing layers in priority order, the gateway and cost control, the guardrails, the versioning and evaluation, the observability and fallback, each added and tested rather than assembled in one pass, with the EU routing in place for the sensitive path. The pilot becomes a feature your systems can lean on, operated rather than merely launched. The aim is the unglamorous outcome that most AI projects miss: not a more impressive demo, but the same feature still working, affordably and safely, the day after real users arrive and the month after that.

Questions buyers ask.

What is production AI integration?

It is the engineering that turns a working AI prototype into a feature your systems can rely on: routing across models, guardrails, versioning, evaluation, observability and fallback. The practice is often called LLMOps, and it is what determines whether an AI feature keeps working once real users hit it.

Why does a working pilot fail in production?

Because a production AI feature is a compound system, not a single prompt. Under real traffic the inference cost spirals, latency rises, and a provider outage with no fallback takes the feature down. The 2026 deployment stack has six layers precisely because each one prevents a failure that demos never reveal.

What is an LLM gateway?

A layer that sits between your application and the models, handling routing, authentication, rate limiting, cost tracking, caching, fallback and guardrails so your application code does not have to. It centralises the operational concerns of running AI and gives you one place to control cost, resilience and where requests are sent.

How do you keep AI cost and latency under control?

Through the gateway: routing each request to the right-sized model, semantic caching to avoid repeat calls, and spend ceilings with alerts when usage crosses your baseline. We load test to find the breaking point ahead of time, so peak traffic is a planned case rather than an outage.

How do you stop a change from breaking the feature?

Prompts, retrieval settings and guardrail configs are versioned and rolled out per user with automatic rollback, and an evaluation suite in your pipeline blocks a deploy when scores regress. A change becomes something you can measure and revert in minutes rather than a release you hope holds.

Can sensitive requests stay inside the EU?

Yes. The gateway routes sensitive requests to open-weight models hosted in the EU and enforces residency rules in the guardrails, while less sensitive traffic can use a hyperscaler model where that is appropriate. The placement decision sits in one controllable layer, with the traces kept in-region.

What is LLMOps?

The production discipline of running language models in live systems: versioning prompts and configs, evaluation, guardrails, observability, routing, caching and cost control. It is not a separate track from AI engineering but the part that decides whether a feature built by capable engineers genuinely holds up at scale. Treating it as an optional add-on is how projects stall after a good pilot.

Why do production AI costs surprise people?

Because the cost structure is inverted from traditional machine learning. There is little upfront training and instead a per-request inference cost that recurs on every call and grows with usage, so the bill is decided every day the feature runs rather than at build time. A business case built on the pilot's cost rests on the cheapest the feature will ever be.

How does semantic caching save money?

By recognising that differently-worded questions are the same question, so what are your hours and when are you open return one cached answer rather than two API calls. On a high-repetition workload that removes a large slice of inference cost. Combined with routing easy requests to small models and only hard ones to the frontier, spend tracks the difficulty of the work rather than defaulting to the priciest option.

How do you test something with no single right answer?

With fuzzy, semantic evaluation rather than exact-match tests. An automated suite scores groundedness, relevance and safety on every change, combining a model-as-judge with deterministic checks like format and citation presence and a rotating sample of human spot-checks, wired into the pipeline to block a deploy when scores regress. Evaluation is the only reliable early-warning system for quality drift.

What happens if the AI provider has an outage?

The gateway falls back to a secondary model, and where a full answer is not possible the feature degrades to a reduced one rather than an error. We build and test the fallback chain so a provider's bad day is a quieter feature rather than a dead one, and running open-weight models in the EU as part of the chain means there is a fallback you operate rather than only rent.

Do we need all of this for an internal tool?

Often not. A low-stakes internal tool with no sensitive data and little cost exposure may not warrant the full production stack, and building it would be cost without payoff. The honest test is what breaks if it fails and what it costs if it scales; where both answers are small, a simpler build is right, and we will tell you so rather than over-engineer.

Why do well-instrumented AI projects still stall?

Because the blocker is no longer orchestration but context and governance: logs that cannot be connected to who is accountable, retrieval grounded in uncurated data, model information nobody can query. We treat that layer as part of the integration, connecting traces to ownership and grounding retrieval in governed data, because a feature that runs but cannot be governed is one a regulated business cannot keep.

Why treat prompts and configs as versioned code?

Because a prompt is logic that decides how the feature behaves, and a careless edit can break it like any bad code change. Versioning prompts, retrieval settings and guardrail configs makes a change testable, traceable and reversible, rolled out to a fraction of traffic first with automatic rollback. The immutable record also ties every output to the exact configuration that produced it, which is what an audit needs.

Production review

Show us the pilot that won't ship. We'll tell you what's missing.

Bring the AI feature stuck before production. We map it against the six layers, show where it would break under real traffic and cost, and lay out what getting it live takes, before you commit to anything.

Book a production review Back to AI →

Built to survive real traffic Sensitive work routed to the EU One named operator, answerable