Observability and sovereign AI, run from one place inside the EU.
Most AI in production runs without anyone watching how it behaves. We monitor your networks, servers and AI agents from one place, and host private models inside the EU, so the systems making decisions stay visible and stay yours.
AI observability is the ability to see what a model-based system is doing and explain why: tracing every model call, tool use and retrieval step, then scoring whether the output was actually right. It matters because traditional monitoring assumes the same input gives the same output, and language models break that assumption, so an unwatched AI fails quietly rather than loudly. Argus Root runs the monitoring and can host the models, inside the EU, so the traces, the prompts and the evaluation never leave your jurisdiction.
In short
- Traditional monitoring assumes determinism; LLMs break it, so you have to capture the exact input, model and parameters of each call to reproduce a problem at all.
- The standard is the OpenTelemetry GenAI semantic conventions — vendor-neutral
gen_ai.*attributes for model, tokens, latency and cost — so you instrument once and avoid lock-in. - Cost and latency track tokens, not requests, so AI monitoring has to be token-aware to attribute spend and catch a creeping bill.
- OpenTelemetry captures traces, metrics and logs but not output quality; a separate evaluation layer has to score faithfulness, safety and policy compliance.
- Prompts carry PII, so the Collector redacts and filters before telemetry leaves your network — an observability control that is also a residency and privacy one.
Most AI in production runs blind.
73% of enterprises now require AI agent monitoring in production, yet 63.4% say they lack the tooling to do it. Traditional monitoring tracks latency and errors. It cannot tell you whether an agent made the right call, which tool it reached for, or why it went wrong.
| Signal | Traditional APM | What an AI system needs |
|---|---|---|
| Latency & errors | Yes | Baseline |
| Model calls & tokens | Partial | Full trace |
| Tool invocations | No | Every call, with arguments |
| Reasoning & decisions | No | Nested spans, plan drift |
| Output quality | No | Scored against expectations |
| Where traces live | Third-party SaaS | In-region / self-hosted |
The trouble is that AI fails in ways traditional monitoring cannot see. Application monitoring tells you a request succeeded, how long it took and what it cost; an AI call can return in fifty milliseconds, report success, and still hand the user an answer that is wrong, unsafe or off-policy. Latency and error rate say nothing about whether the model picked the right tool, retrieved the right document, or reasoned its way to a sensible conclusion. The signal that matters, whether the output was any good, is invisible to the dashboards most teams already run.
Agents make this harder, because they are not deterministic and their failures build across turns rather than in a single call. A conversation drifts from its context, a hallucination compounds on an earlier one, an agent picks the wrong tool and the mistake cascades through the steps that follow. Watching only the final reply hides where the behaviour broke. Seeing it takes trace-level spans across every model call, tool invocation and hand-off, stitched into one execution tree so the step that introduced the problem is visible rather than guessed at.
On top of the trace sits the part that separates monitoring from observability: evaluation. Each output is scored, for faithfulness to its sources, for hallucination, for safety, for relevance, so quality becomes a number you can watch and alert on rather than a complaint you hear from a user later. Drift detection then tracks those scores across prompt versions, model updates and user segments, so a quality regression after a model change shows up at its source instead of as a slow, unexplained decline in how well the system works.
Sovereign models, traces that stay in-region.
Sending production traces and prompts to a third-party AI service can put you on the wrong side of your own data-governance rules. For teams under GDPR, NIS2 or the AI Act, self-hosted observability and EU-hosted models are often the only setup that holds up in a review.
Europe is building its own AI capacity, from the Franco-German sovereign initiative with Mistral and SAP to national AI units. Your stack can follow the same principle at your scale: open-weight models on infrastructure we run inside the EU, with prompts, outputs and traces that never leave the jurisdiction.
The sovereignty problem is sharper than it first appears, because observability itself moves your most sensitive data. Shipping prompts, outputs and logs to a US-headquartered monitoring service such as Datadog, Splunk or New Relic sends the content of your AI system, and often personal data inside it, to a provider reachable under US law. For a regulated workload that quietly undoes the in-region story the rest of the stack tells, in the same way a foreign cloud underneath you would.
Self-hosted observability avoids it. Open standards such as OpenTelemetry keep the instrumentation portable rather than tied to one vendor, so the traces can land in a store we run inside the EU instead of a foreign platform, paired with open-weight models hosted on the same in-region infrastructure. The prompts, the outputs, the traces and the evaluation records all stay in your jurisdiction, which is the setup that survives a GDPR, NIS2 or AI Act review rather than complicating it.
| US observability SaaS | Self-hosted, in the EU | |
|---|---|---|
| Traces & prompts | On a US provider's cloud | Inside the EU, on our metal |
| CLOUD Act exposure | Through the provider | None |
| Lock-in | Vendor-specific agents | OpenTelemetry-portable |
| Data-volume cost | Priced per GB ingested | Capacity we run for you |
| AI Act log custody | A third party's | Yours, in-region |
What do we run for you?
Infrastructure and AI watched together, on tooling you can keep in-house.
Unified monitoring
Networks, servers and AI agents in one view, so an incident in any layer surfaces in the same place rather than three disconnected dashboards.
Host intrusion detection
Wazuh HIDS across your fleet: file integrity, log analysis and active response, with geo-blocking and rate limiting at the edge.
AI agent tracing & evaluation
Every model call, tool invocation and decision captured as nested spans, with outputs scored so you see quality drift before users do.
Sovereign LLM hosting
Open-weight models hosted on infrastructure we run inside the EU, so prompts and outputs stay in your jurisdiction.
Alerting & response
Thresholds tuned to your baseline, routed to where your team already works, so signal reaches a person and noise does not.
AI Act logging & evidence
The traceability and record-keeping high-risk AI systems need, produced as a by-product of the monitoring rather than a separate scramble.
Argus had a hundred eyes.
We run this for ourselves before we run it for you: Wazuh across our own fleet, open-weight models hosted on our own hardware, and monitoring that watches the watchers. The name is the promise. Argus Panoptes never closed every eye at once, and neither does the monitoring we put behind your systems. Because the same OpenTelemetry pipeline carries both your infrastructure logs and your model traces, one incident review can follow a single request from the load balancer through the retrieval step to the model's answer, instead of stitching three disconnected tools together by hand at the worst possible moment.
# one LLM call, instrumented to the OTel GenAI conventions span: chat eu/llm-sovereign attributes: gen_ai.system: argus-gateway gen_ai.request.model: eu/llm-sovereign # in-EU routing gen_ai.usage.input_tokens: 1843 gen_ai.usage.output_tokens: 211 gen_ai.response.latency_ms: 920 gen_ai.response.finish_reason: stop app.tenant: acme app.data_residency: eu eval.faithfulness: 0.97 # scored by the eval layer # prompt text is a span EVENT, so the Collector can redact/drop it
One view over the infrastructure and the AI on top of it.
An AI failure rarely respects the boundary between the model and the machine it runs on. A spike in latency might be the model, or it might be the server starving for memory underneath it; a tool call that times out might be the agent's logic, or a network path that degraded. When the AI traces sit in one product and the infrastructure metrics in another, an incident becomes a hunt across disconnected dashboards while the cause hides in the gap between them.
We keep both in one picture. The server, network and host signals sit alongside the agent traces and their evaluation scores, correlated on a common timeline and built on OpenTelemetry so the instrumentation is consistent as your stack changes. When something breaks, the trail runs from the user-facing symptom down to the layer that caused it without a handoff between tools, which is the difference between an incident closed in minutes and one chased for an afternoon.
The same telemetry is your security and your evidence.
The signals that tell you whether a system is healthy are the same ones that tell you whether it is under attack, and the same ones an auditor asks to see. Host intrusion detection through Wazuh watches file integrity, log activity and process behaviour across the fleet, with geo-blocking and rate limiting at the edge, so the monitoring is also the security operation that NIS2 and DORA expect to find running rather than described in a policy. The detail of the security side lives in our vulnerability management and server management work.
For AI, the traces do double duty as compliance evidence. High-risk systems under the AI Act carry logging and traceability duties, and the record of what a model was asked, what it answered and how it scored is exactly what those duties require. Produced as a by-product of the observability rather than a separate exercise, that record is current when an auditor asks for it instead of reconstructed under pressure, which ties this pillar directly into the compliance and sovereignty work.
What does running AI unwatched cost?
Unwatched AI fails quietly, which is what makes it expensive. A model update ships and a subtle drop in answer quality goes unnoticed for weeks because nothing in the dashboards measures quality; a retrieval step starts pulling stale documents and the system keeps returning confident, wrong answers; an agent's tool-selection degrades and users get worse help without anyone seeing a single error in the logs. By the time the complaints arrive, the damage to trust has already been done and the cause is buried in traces nobody captured.
Cost is the other blind spot. A fast, cheap response that hallucinates is more expensive than a slower, accurate one, yet infrastructure monitoring cannot make that connection because it sees the tokens and the latency but not the quality. Watching cost and quality together is what turns AI from a line item that only grows into one you can reason about: which steps are expensive, which are worth it, and where a cheaper model would do the job without a drop in the scores that matter.
Who needs this?
Teams running agents or retrieval systems in production are the clearest case: once an AI is making decisions for real users, the gap between shipping it and watching it becomes a standing risk rather than a detail. Regulated organisations under the AI Act, NIS2 or DORA need the logging and traceability as an obligation, not a nicety. Companies already paying a US observability SaaS, and uneasy about the prompts and personal data it carries abroad, reach this when sovereignty stops being theoretical. And operations teams drowning in disconnected dashboards come to it for the single picture rather than the AI angle.
The common thread is that the AI or the infrastructure has become important enough that not seeing it clearly is a liability. For a prototype or an internal experiment, basic logging is fine, and we will say so. The teams that benefit are the ones for whom a silent failure reaches a customer, a regulator or a budget, and who would rather catch it in a trace than in a complaint.
How does an engagement start?
We begin by instrumenting what you already run rather than asking you to rebuild it. Using OpenTelemetry, the existing services, agents and infrastructure are wired to emit traces, metrics and logs into a store we operate inside the EU, with the AI calls captured as nested spans and the host signals gathered through Wazuh. Nothing about your application has to change for the picture to appear; the instrumentation sits around it and reads what it is doing.
From there the evaluation layer is shaped to your system. We define what a good output looks like for your use case, set the scores that matter, faithfulness, hallucination, safety, relevance, and tune the alert thresholds to your baseline so the signals reflect your reality rather than a generic default. The result is a view that is live within days for the technical layers and refined over the following weeks as the evaluation criteria settle against real traffic, rather than a six-month integration project before anything is visible.
What you get once it is running.
In steady state, the value shows up as problems caught before anyone outside the team notices them. A quality score that slips after a model update raises an alert at the source rather than surfacing weeks later as a vague sense that the product got worse. A retrieval step that starts pulling weak context is visible in the traces before users complain about wrong answers. An infrastructure fault under an AI service is correlated to the symptom it caused rather than investigated separately. The recurring fire drills that come with running AI blind become routine signals handled early.
The same record answers the questions that arrive from outside engineering. When a regulator or a customer's auditor asks how a high-risk system behaves and what it was asked, the logs and traces are already there. When finance asks why the AI bill moved, the cost is broken down by step and tied to the quality it bought. The observability stops being a dashboard nobody has time to read and becomes the place the team, the auditor and the budget owner all get a straight answer.
We watch what you build, wherever it runs.
Seeing AI clearly is a different job from building it. Where you need the agents, retrieval systems or production integrations themselves designed and stood up, that is our AI work, and the production AI integration that puts a model safely behind real traffic. This pillar sits across whatever has been built, by us or by your own engineers, and reads how it behaves once it is live.
That independence is deliberate. We can instrument an AI system we had no hand in building, one assembled from open-weight models on our infrastructure, or one calling external services, and give each the same trace-level view and evaluation scoring. The observability is tied to no particular framework or builder, because OpenTelemetry keeps the instrumentation neutral. You get a clear view of the system you have in front of you, rather than one that only works if we wrote it for you.
What about the rest of the stack — Prometheus, Grafana, Loki?
The same pipeline runs them. The AI tracing is the demanding end of the work, but the foundation underneath is the standard open observability stack, and we run all of it: Prometheus for metrics, Grafana for dashboards and alerting, Loki for logs, and OpenTelemetry tying the traces together — open tools, hosted inside the EU on infrastructure we operate, with no per-host or per-gigabyte meter ticking against you. That last point matters more than it sounds, because SaaS observability platforms have a habit of billing aggressively for ingest and host count, to the point where the bill for watching the system can rival the bill for running it. Built on open components we host, the observability stops being a line item that grows with your success.
Two shifts in 2026 make the open stack the better foundation rather than just the cheaper one. The first is that observability is converging with cost: the most useful dashboards now put performance next to spend, so a slow query or an oversized service is seen against what it is costing, which is the same picture our FinOps practice works from. The second is that observability is the bedrock of site reliability engineering — service-level objectives and error budgets are meaningless without the metrics and traces to measure them, so the instrumentation we lay down is also what makes the reliability work in our platform engineering possible. One open, EU-hosted stack serves the AI, the infrastructure, the cost view and the reliability practice at once, rather than four tools and four bills.
Questions buyers ask.
What is AI or LLM observability?
Why isn't traditional monitoring enough for AI?
Can I keep AI traces and data inside the EU?
What is sovereign LLM hosting?
Do you monitor infrastructure as well as AI?
How does this connect to the AI Act?
What is the difference between LLM monitoring and observability?
What does OpenTelemetry have to do with it?
What exactly gets scored in an AI evaluation?
Can you trace multi-step agents across a whole conversation?
Why not simply use Datadog or Splunk?
Which models do you host?
How do you keep alerts from becoming noise?
Do you observe RAG and retrieval systems specifically?
Does the instrumentation slow our systems down?
How does this connect to a SOC or SIEM?
Can you observe AI that calls third-party APIs like OpenAI or Anthropic?
Does this replace our existing APM?
Can we start with only the AI part, or only the infrastructure?
Which observability tools do you use?
Point us at your stack. We'll show you what you can't see.
Tell us what you run and where your AI sits. We map the blind spots across infrastructure and agents, and show what it takes to keep the traces in the EU. If a fix is yours to make in-house, we will say so.