Pillar · Observability & AI

Observability and sovereign AI, run from one place inside the EU.

Most AI in production runs without anyone watching how it behaves. We monitor your networks, servers and AI agents from one place, and host private models inside the EU, so the systems making decisions stay visible and stay yours.

Book an observability review See how we run it →

◆ Network · server · AI-agent monitoring ◆ Wazuh HIDS ◆ Sovereign LLM hosting

AI observability is the ability to see what a model-based system is doing and explain why: tracing every model call, tool use and retrieval step, then scoring whether the output was actually right. It matters because traditional monitoring assumes the same input gives the same output, and language models break that assumption, so an unwatched AI fails quietly rather than loudly. Argus Root runs the monitoring and can host the models, inside the EU, so the traces, the prompts and the evaluation never leave your jurisdiction.

In short

Traditional monitoring assumes determinism; LLMs break it, so you have to capture the exact input, model and parameters of each call to reproduce a problem at all.
The standard is the OpenTelemetry GenAI semantic conventions — vendor-neutral gen_ai.* attributes for model, tokens, latency and cost — so you instrument once and avoid lock-in.
Cost and latency track tokens, not requests, so AI monitoring has to be token-aware to attribute spend and catch a creeping bill.
OpenTelemetry captures traces, metrics and logs but not output quality; a separate evaluation layer has to score faithfulness, safety and policy compliance.
Prompts carry PII, so the Collector redacts and filters before telemetry leaves your network — an observability control that is also a residency and privacy one.

Most AI in production runs blind.

73% of enterprises now require AI agent monitoring in production, yet 63.4% say they lack the tooling to do it. Traditional monitoring tracks latency and errors. It cannot tell you whether an agent made the right call, which tool it reached for, or why it went wrong.

what monitoring sees · APM vs AI systems

What traditional application monitoring captures versus what an AI system needs.
Signal	Traditional APM	What an AI system needs
Latency & errors	Yes	Baseline
Model calls & tokens	Partial	Full trace
Tool invocations	No	Every call, with arguments
Reasoning & decisions	No	Nested spans, plan drift
Output quality	No	Scored against expectations
Where traces live	Third-party SaaS	In-region / self-hosted

The trouble is that AI fails in ways traditional monitoring cannot see. Application monitoring tells you a request succeeded, how long it took and what it cost; an AI call can return in fifty milliseconds, report success, and still hand the user an answer that is wrong, unsafe or off-policy. Latency and error rate say nothing about whether the model picked the right tool, retrieved the right document, or reasoned its way to a sensible conclusion. The signal that matters, whether the output was any good, is invisible to the dashboards most teams already run.

Agents make this harder, because they are not deterministic and their failures build across turns rather than in a single call. A conversation drifts from its context, a hallucination compounds on an earlier one, an agent picks the wrong tool and the mistake cascades through the steps that follow. Watching only the final reply hides where the behaviour broke. Seeing it takes trace-level spans across every model call, tool invocation and hand-off, stitched into one execution tree so the step that introduced the problem is visible rather than guessed at.

On top of the trace sits the part that separates monitoring from observability: evaluation. Each output is scored, for faithfulness to its sources, for hallucination, for safety, for relevance, so quality becomes a number you can watch and alert on rather than a complaint you hear from a user later. Drift detection then tracks those scores across prompt versions, model updates and user segments, so a quality regression after a model change shows up at its source instead of as a slow, unexplained decline in how well the system works.

Sovereign models, traces that stay in-region.

Sending production traces and prompts to a third-party AI service can put you on the wrong side of your own data-governance rules. For teams under GDPR, NIS2 or the AI Act, self-hosted observability and EU-hosted models are often the only setup that holds up in a review.

Europe is building its own AI capacity, from the Franco-German sovereign initiative with Mistral and SAP to national AI units. Your stack can follow the same principle at your scale: open-weight models on infrastructure we run inside the EU, with prompts, outputs and traces that never leave the jurisdiction.

The sovereignty problem is sharper than it first appears, because observability itself moves your most sensitive data. Shipping prompts, outputs and logs to a US-headquartered monitoring service such as Datadog, Splunk or New Relic sends the content of your AI system, and often personal data inside it, to a provider reachable under US law. For a regulated workload that quietly undoes the in-region story the rest of the stack tells, in the same way a foreign cloud underneath you would.

Self-hosted observability avoids it. Open standards such as OpenTelemetry keep the instrumentation portable rather than tied to one vendor, so the traces can land in a store we run inside the EU instead of a foreign platform, paired with open-weight models hosted on the same in-region infrastructure. The prompts, the outputs, the traces and the evaluation records all stay in your jurisdiction, which is the setup that survives a GDPR, NIS2 or AI Act review rather than complicating it.

where your AI telemetry lives

A US observability SaaS compared with self-hosted, in-EU observability run by us.
	US observability SaaS	Self-hosted, in the EU
Traces & prompts	On a US provider's cloud	Inside the EU, on our metal
CLOUD Act exposure	Through the provider	None
Lock-in	Vendor-specific agents	OpenTelemetry-portable
Data-volume cost	Priced per GB ingested	Capacity we run for you
AI Act log custody	A third party's	Yours, in-region

What do we run for you?

Infrastructure and AI watched together, on tooling you can keep in-house.

Unified monitoring

Networks, servers and AI agents in one view, so an incident in any layer surfaces in the same place rather than three disconnected dashboards.

Host intrusion detection

Wazuh HIDS across your fleet: file integrity, log analysis and active response, with geo-blocking and rate limiting at the edge.

AI agent tracing & evaluation

Every model call, tool invocation and decision captured as nested spans, with outputs scored so you see quality drift before users do.

Sovereign LLM hosting

Open-weight models hosted on infrastructure we run inside the EU, so prompts and outputs stay in your jurisdiction.

Alerting & response

Thresholds tuned to your baseline, routed to where your team already works, so signal reaches a person and noise does not.

AI Act logging & evidence

The traceability and record-keeping high-risk AI systems need, produced as a by-product of the monitoring rather than a separate scramble.

Argus had a hundred eyes.

We run this for ourselves before we run it for you: Wazuh across our own fleet, open-weight models hosted on our own hardware, and monitoring that watches the watchers. The name is the promise. Argus Panoptes never closed every eye at once, and neither does the monitoring we put behind your systems. Because the same OpenTelemetry pipeline carries both your infrastructure logs and your model traces, one incident review can follow a single request from the load balancer through the retrieval step to the model's answer, instead of stitching three disconnected tools together by hand at the worst possible moment.

The pipeline behind the hundred eyes: every model and tool call emits a standard OpenTelemetry GenAI span, the Collector strips personal data before it leaves your network, and traces, metrics and an evaluation layer land in one view over both the infrastructure and the AI — with the same telemetry doubling as your security evidence.

OpenTelemetry GenAI span — gen_ai.* semantic conventions

# one LLM call, instrumented to the OTel GenAI conventions
span: chat eu/llm-sovereign
attributes:
  gen_ai.system:              argus-gateway
  gen_ai.request.model:       eu/llm-sovereign   # in-EU routing
  gen_ai.usage.input_tokens:  1843
  gen_ai.usage.output_tokens: 211
  gen_ai.response.latency_ms: 920
  gen_ai.response.finish_reason: stop
  app.tenant:                 acme
  app.data_residency:         eu
  eval.faithfulness:          0.97     # scored by the eval layer
# prompt text is a span EVENT, so the Collector can redact/drop it

We operate Wazuh HIDS Ollama Open WebUI nftables Fail2Ban Agent tracing

One view over the infrastructure and the AI on top of it.

An AI failure rarely respects the boundary between the model and the machine it runs on. A spike in latency might be the model, or it might be the server starving for memory underneath it; a tool call that times out might be the agent's logic, or a network path that degraded. When the AI traces sit in one product and the infrastructure metrics in another, an incident becomes a hunt across disconnected dashboards while the cause hides in the gap between them.

We keep both in one picture. The server, network and host signals sit alongside the agent traces and their evaluation scores, correlated on a common timeline and built on OpenTelemetry so the instrumentation is consistent as your stack changes. When something breaks, the trail runs from the user-facing symptom down to the layer that caused it without a handoff between tools, which is the difference between an incident closed in minutes and one chased for an afternoon.

The same telemetry is your security and your evidence.

The signals that tell you whether a system is healthy are the same ones that tell you whether it is under attack, and the same ones an auditor asks to see. Host intrusion detection through Wazuh watches file integrity, log activity and process behaviour across the fleet, with geo-blocking and rate limiting at the edge, so the monitoring is also the security operation that NIS2 and DORA expect to find running rather than described in a policy. The detail of the security side lives in our vulnerability management and server management work.

For AI, the traces do double duty as compliance evidence. High-risk systems under the AI Act carry logging and traceability duties, and the record of what a model was asked, what it answered and how it scored is exactly what those duties require. Produced as a by-product of the observability rather than a separate exercise, that record is current when an auditor asks for it instead of reconstructed under pressure, which ties this pillar directly into the compliance and sovereignty work.

What does running AI unwatched cost?

Unwatched AI fails quietly, which is what makes it expensive. A model update ships and a subtle drop in answer quality goes unnoticed for weeks because nothing in the dashboards measures quality; a retrieval step starts pulling stale documents and the system keeps returning confident, wrong answers; an agent's tool-selection degrades and users get worse help without anyone seeing a single error in the logs. By the time the complaints arrive, the damage to trust has already been done and the cause is buried in traces nobody captured.

Cost is the other blind spot. A fast, cheap response that hallucinates is more expensive than a slower, accurate one, yet infrastructure monitoring cannot make that connection because it sees the tokens and the latency but not the quality. Watching cost and quality together is what turns AI from a line item that only grows into one you can reason about: which steps are expensive, which are worth it, and where a cheaper model would do the job without a drop in the scores that matter.

Who needs this?

Teams running agents or retrieval systems in production are the clearest case: once an AI is making decisions for real users, the gap between shipping it and watching it becomes a standing risk rather than a detail. Regulated organisations under the AI Act, NIS2 or DORA need the logging and traceability as an obligation, not a nicety. Companies already paying a US observability SaaS, and uneasy about the prompts and personal data it carries abroad, reach this when sovereignty stops being theoretical. And operations teams drowning in disconnected dashboards come to it for the single picture rather than the AI angle.

The common thread is that the AI or the infrastructure has become important enough that not seeing it clearly is a liability. For a prototype or an internal experiment, basic logging is fine, and we will say so. The teams that benefit are the ones for whom a silent failure reaches a customer, a regulator or a budget, and who would rather catch it in a trace than in a complaint.

How does an engagement start?

We begin by instrumenting what you already run rather than asking you to rebuild it. Using OpenTelemetry, the existing services, agents and infrastructure are wired to emit traces, metrics and logs into a store we operate inside the EU, with the AI calls captured as nested spans and the host signals gathered through Wazuh. Nothing about your application has to change for the picture to appear; the instrumentation sits around it and reads what it is doing.

From there the evaluation layer is shaped to your system. We define what a good output looks like for your use case, set the scores that matter, faithfulness, hallucination, safety, relevance, and tune the alert thresholds to your baseline so the signals reflect your reality rather than a generic default. The result is a view that is live within days for the technical layers and refined over the following weeks as the evaluation criteria settle against real traffic, rather than a six-month integration project before anything is visible.

What you get once it is running.

In steady state, the value shows up as problems caught before anyone outside the team notices them. A quality score that slips after a model update raises an alert at the source rather than surfacing weeks later as a vague sense that the product got worse. A retrieval step that starts pulling weak context is visible in the traces before users complain about wrong answers. An infrastructure fault under an AI service is correlated to the symptom it caused rather than investigated separately. The recurring fire drills that come with running AI blind become routine signals handled early.

The same record answers the questions that arrive from outside engineering. When a regulator or a customer's auditor asks how a high-risk system behaves and what it was asked, the logs and traces are already there. When finance asks why the AI bill moved, the cost is broken down by step and tied to the quality it bought. The observability stops being a dashboard nobody has time to read and becomes the place the team, the auditor and the budget owner all get a straight answer.

We watch what you build, wherever it runs.

Seeing AI clearly is a different job from building it. Where you need the agents, retrieval systems or production integrations themselves designed and stood up, that is our AI work, and the production AI integration that puts a model safely behind real traffic. This pillar sits across whatever has been built, by us or by your own engineers, and reads how it behaves once it is live.

That independence is deliberate. We can instrument an AI system we had no hand in building, one assembled from open-weight models on our infrastructure, or one calling external services, and give each the same trace-level view and evaluation scoring. The observability is tied to no particular framework or builder, because OpenTelemetry keeps the instrumentation neutral. You get a clear view of the system you have in front of you, rather than one that only works if we wrote it for you.

What about the rest of the stack — Prometheus, Grafana, Loki?

The same pipeline runs them. The AI tracing is the demanding end of the work, but the foundation underneath is the standard open observability stack, and we run all of it: Prometheus for metrics, Grafana for dashboards and alerting, Loki for logs, and OpenTelemetry tying the traces together — open tools, hosted inside the EU on infrastructure we operate, with no per-host or per-gigabyte meter ticking against you. That last point matters more than it sounds, because SaaS observability platforms have a habit of billing aggressively for ingest and host count, to the point where the bill for watching the system can rival the bill for running it. Built on open components we host, the observability stops being a line item that grows with your success.

Two shifts in 2026 make the open stack the better foundation rather than just the cheaper one. The first is that observability is converging with cost: the most useful dashboards now put performance next to spend, so a slow query or an oversized service is seen against what it is costing, which is the same picture our FinOps practice works from. The second is that observability is the bedrock of site reliability engineering — service-level objectives and error budgets are meaningless without the metrics and traces to measure them, so the instrumentation we lay down is also what makes the reliability work in our platform engineering possible. One open, EU-hosted stack serves the AI, the infrastructure, the cost view and the reliability practice at once, rather than four tools and four bills.

Questions buyers ask.

What is AI or LLM observability?

It is the ability to inspect and explain what an AI system does at runtime: the model calls it makes, the tools it invokes, the data it reaches, the decisions it takes, and whether the output was good enough. It goes beyond logging what happened to scoring whether it was right.

Why isn't traditional monitoring enough for AI?

Application monitoring tracks latency and error rates, but it cannot see whether an agent picked the right tool, why it chose a wrong branch, or whether the answer was correct. Agents need trace-level spans across calls, tools and memory, plus an evaluation layer that scores the output.

Can I keep AI traces and data inside the EU?

Yes. Self-hosted observability paired with EU-hosted models keeps prompts, outputs and traces in-region. This matters when sending that data to a third-party SaaS would breach GDPR, NIS2 or AI Act obligations.

What is sovereign LLM hosting?

Running open-weight models on infrastructure you control inside the EU, so prompts and outputs never leave your jurisdiction or pass through a provider that a foreign government could compel.

Do you monitor infrastructure as well as AI?

Yes. Networks, servers and AI agents are watched from one place, with host intrusion detection through Wazuh and active response at the edge. The infrastructure and the AI running on it share a single picture.

How does this connect to the AI Act?

High-risk AI systems carry logging and traceability duties. The observability we put in place produces that record as it runs, so the evidence an auditor asks for already exists instead of being reconstructed later.

What is the difference between LLM monitoring and observability?

Monitoring tracks known signals such as latency, cost, usage, errors and drift thresholds. Observability adds the harder question of whether the output was any good, scoring faithfulness, hallucination, safety and relevance. A response can be fast, cheap and technically successful while still being wrong, and only the observability layer catches that.

What does OpenTelemetry have to do with it?

OpenTelemetry is the open standard for instrumenting systems so the telemetry is portable rather than locked to one vendor. Building on it means your traces can land in a store we run inside the EU instead of a foreign SaaS, and the instrumentation survives changes to your model or framework without a rewrite.

What exactly gets scored in an AI evaluation?

Depending on the system: faithfulness to the source material, hallucination, answer relevance, safety, and for retrieval systems the relevance and quality of the context fetched. Each output gets a score so quality is a number you can alert on, and drift detection tracks those scores across prompt versions and model updates.

Can you trace multi-step agents across a whole conversation?

Yes, and for agents it is the point. Most real failures emerge across turns rather than in one call: context drift, compounding hallucinations, a wrong tool choice that cascades. We capture nested spans across every call, tool and hand-off, stitched into one execution tree, so you see which step broke rather than only the final reply.

Why not simply use Datadog or Splunk?

They are capable platforms, but they are US-headquartered SaaS, so your prompts, outputs and logs travel to a provider reachable under US law, which is a problem for regulated workloads. They also price by data volume, which grows fast with AI traces. Self-hosted, in-EU observability keeps the data in your jurisdiction and the cost a capacity we run rather than a per-gigabyte meter.

Which models do you host?

Open-weight models on infrastructure we operate inside the EU, chosen to fit the task and the hardware rather than a single default. Running open weights on our own metal is what keeps prompts and outputs in-region and free of a provider that a foreign government could compel, which is the sovereign part of sovereign AI.

How do you keep alerts from becoming noise?

Thresholds are tuned to your baseline rather than set to defaults, and alerts route to where your team already works so a real signal reaches a person while routine variation does not. The aim is that an alert means something needs attention, which is the only way alerting stays useful rather than ignored.

Do you observe RAG and retrieval systems specifically?

Yes. Retrieval systems have their own failure modes, so the trace covers both the retrieval and the generation, with scores for context relevance, retrieval quality, faithfulness and answer relevance. That is how you tell a hallucination caused by a weak retrieval step from one caused by the model, which a single quality number would hide.

Does the instrumentation slow our systems down?

The overhead is small and bounded. Telemetry is emitted asynchronously and sampled where volume is high, so the trace capture does not sit on the critical path of a user request. The point of observability is to make the system more reliable, so adding meaningful latency to achieve it would defeat the purpose, and we tune the capture to avoid it.

How does this connect to a SOC or SIEM?

The security telemetry from Wazuh and the edge controls feeds the same picture and can forward to a SIEM or a security operations workflow where you run one. Because the infrastructure, security and AI signals share a store, a security event and the system behaviour around it sit on one timeline rather than in separate tools, which shortens the path from alert to understanding what happened.

Can you observe AI that calls third-party APIs like OpenAI or Anthropic?

Yes. Calls to an external model are traced and scored the same way as calls to a model we host, so you see prompts, responses, latency and quality across providers in one view. The caveat is sovereignty: when a prompt goes to a third-party API, that content leaves your jurisdiction, and the observability makes that visible rather than hiding it. Where it matters, the trace data becomes the case for moving the workload to an EU-hosted open-weight model.

Does this replace our existing APM?

It does not have to. If your team is committed to an application performance tool for infrastructure health, the AI observability and evaluation can sit alongside it, owning the quality layer that an APM cannot see while the APM keeps the infrastructure layer. Where you would rather consolidate, we can carry both in one in-EU stack. The choice follows your setup rather than a demand to remove what already works.

Can we start with only the AI part, or only the infrastructure?

Yes. Some teams begin with AI observability because that is where they are flying blind, and add infrastructure and security monitoring later; others start with the host and intrusion-detection side and bring AI in as it reaches production. Because it is one stack rather than separate products, adding a layer later means instrumenting more rather than adopting a second tool, so you can start where the pain is and grow the picture from there.

Which observability tools do you use?

The standard open stack, hosted inside the EU: Prometheus for metrics, Grafana for dashboards and alerting, Loki for logs, and OpenTelemetry for traces, with Wazuh on the security side. Because they are open and we host them, there is no per-host or per-gigabyte SaaS meter, and no lock-in — the instrumentation is portable if you ever move. The same stack covers ordinary infrastructure, applications and AI, so it is one pipeline rather than separate products bolted together.

Observability review

Point us at your stack. We'll show you what you can't see.

Tell us what you run and where your AI sits. We map the blind spots across infrastructure and agents, and show what it takes to keep the traces in the EU. If a fix is yours to make in-house, we will say so.

Book an observability review Back to the pillars →

Operated within the European Union Traces stay in-region One named operator, answerable