AI · RAG & knowledge

RAG that grounds AI in your own data, with citations.

An AI is only as trustworthy as what it retrieves. We build retrieval systems that answer from your documents and data, cite their sources, and keep the index inside the EU rather than on a vendor's cloud.

Book a RAG assessment How we build →

◆ Hybrid search · reranking ◆ Cited answers ◆ EU-resident index

Retrieval-augmented generation, or RAG, gives a language model your own documents and data to answer from, instead of relying on what it memorised in training. That grounding turns a confident guesser into a system you can trust, because every answer can cite the source it came from. The gap between a RAG demo and a RAG system you can deploy lives in the retrieval quality and the governance around it, which is the part Argus Root builds and operates, with the option to keep the whole pipeline inside the EU.

In short

Retrieval quality is the ceiling, not the model: state-of-the-art RAG answers about 63% of factual questions correctly and a naive setup just 44% — the gap is retrieval engineering.
Hybrid search (dense embeddings + BM25 keywords) is the minimum viable baseline, lifting recall 20–40% over dense-only across enterprise documents.
Cross-encoder reranking is the single biggest accuracy gain available, improving precision 18–42% by reordering the top candidates before the model sees them.
A real retrieval pipeline cuts hallucinations 70–90%, and grounded answers carry citations you can check — trust you can verify, not assert.
Match the complexity to the question: naive RAG for simple lookups, hybrid + rerank for most, GraphRAG or agentic retrieval only when the query genuinely needs multi-hop reasoning.

Grounding is what stops the guessing.

A model left to its training invents detail with total confidence. RAG fixes that by retrieving the relevant passages from your knowledge and putting them in front of the model, cited back to the source. Built well, it cuts hallucination from double digits into the low single digits. Built naively, it retrieves the wrong passages and grounds the model in noise, and since roughly 70% of RAG systems run without an evaluation framework, that failure usually goes unmeasured.

naive RAG vs production RAG

How a naive RAG setup compares with a production-grade one across the pipeline.
Stage	Naive RAG	Production RAG
Retrieval	Semantic search only	Hybrid: vector + keyword (BM25)
Ranking	Raw top-K results	Cross-encoder reranking
Chunking	Fixed-size splits	Semantic, with overlap
Grounding	An uncited answer	Citations to source documents
Evaluation	None	Hallucination rate, Precision@K
Where the index lives	Provider's cloud	EU, on infrastructure we run

The grounding works, and the numbers bear it out. A well-engineered RAG stack reduces hallucination by somewhere between 40 and 96% depending on the domain and the build, turning a model that guesses confidently into one whose answers rest on retrieved, citable source material. That is the whole point: not a model that knows your business, which it never will, but one that looks it up correctly every time and shows its working.

The catch is that retrieval quality is the ceiling on everything above it. If the retrieval step returns three irrelevant passages, the model either hallucinates to fill the gap or produces a useless non-answer, and no amount of prompt cleverness rescues it. This is why naive single-shot RAG, the kind that demos beautifully and stalls in production, is no longer enough: hybrid search, reranking and evaluation have moved from optional polish to baseline requirements. Most teams learn this the hard way, because the part that decides quality is the unglamorous retrieval layer rather than the model everyone wants to talk about.

Your knowledge base is your data.

A RAG system ingests your documents and turns them into a searchable index, which means your institutional knowledge now lives wherever that index lives. Managed RAG services keep it on the provider's cloud, which reintroduces the residency and jurisdiction question at the level of your most sensitive content.

We build the index on pgvector or Qdrant, on infrastructure inside the EU, with document-level access control, PII redaction and audit logging across the pipeline. The retrieval layer sits naturally on the Postgres we already run, and the data governance follows the same line as compliance. A retrieval system also makes the grounding layer for the agents we build.

The exposure is larger than it first looks. A knowledge base is not one document but the distilled institutional memory of the organisation, the contracts, the policies, the internal know-how, concentrated into a single searchable index, which makes it both the highest-value asset to ground an AI on and the highest-stakes one to place. Handing that index to a managed service on a foreign cloud means handing over the organisation's memory wholesale, which is precisely the data a regulated business cannot afford to let cross a jurisdiction.

How do we build them?

The retrieval quality and the evaluation, not a wrapper over a vector store.

Ingestion & connectors

Your documents, wikis and databases brought into one index, with incremental updates so the knowledge stays current rather than frozen at launch.

Semantic chunking & embedding

Documents split on meaning rather than a fixed length, embedded with models suited to your content, so a retrieval returns whole ideas instead of fragments.

Hybrid retrieval & reranking

Vector and keyword search run together, then a cross-encoder reranks for true relevance, which catches exact strings that semantic search alone misses.

Cited generation

Answers that point back to the source passage, so a person can verify the claim instead of trusting the model on faith.

Evaluation harness

Hallucination rate, Precision@K and provenance coverage measured from day one, so a quality regression shows up before your users find it. See observability →

EU-resident index

The index on pgvector or Qdrant inside the EU, with access control and PII redaction, so your knowledge never leaves your jurisdiction.

We run the retrieval stack ourselves.

The vector index runs on the same PostgreSQL we operate for everything else, with pgvector handling millions of embeddings without a separate managed service to leak data into. We build retrieval systems for our own content and search, so the grounding, the reranking and the evaluation are techniques we run rather than read about. When the data is yours, keeping the index on infrastructure we operate inside the EU means your knowledge stays where it belongs. We measure retrieval quality against a held-out set of real questions rather than trusting a demo, because the distance between a system that answers 44% of questions and one you can rely on lives entirely in that measurement loop.

The retrieval pipeline, not a wrapper over a vector store: chunk and embed on ingest into a hybrid index; at query time retrieve across dense and keyword, rerank with a cross-encoder, enforce access control and PII handling at the retrieval layer, then generate a grounded answer with citations. The two stages — broad recall, then precise reranking — are where quality is won.

retrieval.yaml — chunking, hybrid search, rerank, ACL, grounded answer

# ingest: how documents are split and embedded (chunking decides quality)
ingest:
  chunk_size:    512          # tokens, recursive split on structure
  chunk_overlap: 64
  embed_model:   eu/bge-m3     # multilingual, EU-hosted

# query: hybrid recall, then precise reranking — the two-stage pattern
retrieve:
  mode:    hybrid            # dense + BM25, reciprocal-rank fusion
  top_k:   50
  rerank:
    model: bge-reranker-v2    # cross-encoder, +18-42% precision
    keep:  8
  filter:  acl(user.groups)   # permissions enforced at retrieval
answer:
  require_citations:   true
  refuse_if_unsupported: true     # no source → no answer

We operate pgvector Qdrant Hybrid search Cross-encoder rerank Citations EU-resident

Retrieval quality is the ceiling.

Everything a RAG system produces is bounded by what it retrieved. Give the model the right passages and it answers well; give it the wrong ones and the cleverest model on the market will either confabulate around the gap or refuse unhelpfully. The generation step cannot exceed the quality of the retrieval feeding it, which means the work that matters most is the part that finds the right context, not the part that writes the answer.

This is the blind spot in most RAG projects and most of the tools that claim to evaluate them. They measure whether the answer reads well, which tells you little, rather than whether the retrieval surfaced the right evidence, which tells you everything. We build and tune the retrieval as the core of the system, treating answer quality as a downstream consequence of getting the context right, because a RAG system that retrieves badly cannot be prompted into reliability. The model is interchangeable; the retrieval is where the engineering lives.

RAG, fine-tuning, long context or an agent?

These are not rivals so much as tools for different constraints, and the useful question is which your problem really has. RAG wins when the constraint is knowledge freshness and accuracy over private data that changes, because you update the index rather than retrain the model. Fine-tuning wins for style, format and lower latency on a stable task. Long context windows suit holding a whole conversation or document in view at once. An agent suits a multi-step task that needs decisions along the way.

The mature architectures in 2026 use them together rather than picking one. A question like recall the exact termination clause wants RAG and its citation; summarise what we discussed wants long context; reason across the relationships between vendors wants a graph and an agent. We choose the combination the problem calls for instead of forcing everything through one pattern, and we are candid when your need is better served by fine-tuning or a long-context call than by a retrieval system at all. The point is the right tool for the constraint, not RAG for its own sake.

How does chunking decide RAG quality?

Before anything is retrieved, the documents have to be split into pieces to index, and how they are split quietly determines how well retrieval works. Chunks too large bury the relevant sentence in noise; too small and they lose the context that makes them meaningful; split in the wrong place and a single idea is severed across two pieces that each retrieve poorly. It is the least discussed part of RAG and one of the most decisive, because a retrieval can only return the chunks the indexing created.

There is no single right answer, which is why we treat chunking as something to tune against your content rather than a default to apply. Splitting on meaning suits prose where ideas run across paragraphs; a simpler fixed split with overlap is faster and, on some structured content, measurably better, so the choice is empirical rather than dogmatic. We test the chunking strategy against your documents and your real queries, with the metadata and structure preserved so a retrieved passage carries where it came from. Getting this layer right is cheaper and more effective than any amount of work further up the stack.

Hybrid search, and the reranker teams skip.

Semantic search is excellent at meaning and poor at exact strings: ask it for a part number, a statutory reference or a specific name and pure vector similarity often misses, because those are matches of the literal token rather than the idea. Hybrid search runs semantic and keyword retrieval together so both kinds of query are caught, which on real enterprise content is the difference between a system that finds the clause and one that almost does.

Then comes the step most teams skip and later regret: reranking. A cross-encoder takes the top candidates from retrieval and reorders them by true relevance to the query, removing the near-misses before the model ever sees them, and it frequently delivers more improvement than weeks spent choosing a better embedding model. The cost is latency, since reranking can multiply retrieval time several-fold, so we apply it where answer quality justifies the wait and tune the depth to the use case. Hybrid retrieval with a reranker is the combination that lifts a RAG system from demo to dependable, and it is exactly the part the failed builds left out.

How do you evaluate a RAG system?

A RAG system you cannot measure is one you cannot trust or fix, and evaluation is genuinely harder than it looks because quality is not one number. It is at least four: whether retrieval found the relevant material, whether the retrieved context was on-point, whether the answer stayed faithful to that context, and whether the answer was genuinely correct. Each needs its own method, and a system that scores well on answer fluency can be failing badly on retrieval recall without anyone noticing.

We build the evaluation harness as part of the system rather than an afterthought, with a labelled question set and the metrics tracked from day one, so a regression shows up as a falling number before a user hits it. This is also how the slow drift is caught, the decay where a system that worked at launch degrades as documents change and queries shift, which is invisible without baselines to compare against. Teams that skip structured evaluation ship systems they cannot diagnose when they degrade, and a RAG system that cannot be diagnosed is one that quietly stops being trustworthy. Measurement is what keeps it honest over time, and it runs on our observability footing.

Agentic and corrective retrieval.

The fire-and-forget RAG of a single retrieve-then-answer step is giving way to something that checks its own work. Agentic RAG lets the system reason about whether what it retrieved is good enough, retrieve again with a refined query if not, and fall back to another source when the knowledge base comes up short, the corrective pattern that keeps an answer current when the local index is stale. It trades a little latency and token cost for a meaningful gain in reliability, which for a serious application is a trade worth making.

The same thinking turns a single shot into a multi-stage pipeline: decomposing a complex question into sub-queries, routing each to the right source, filtering the context before synthesis, each stage with its own quality signal. It is more to build and considerably more reliable in production, which is the recurring lesson of RAG: the versions that work are the ones that did the unglamorous engineering the demos skip. We build the staged, self-correcting pipeline where the workload's reliability bar warrants it, and keep a simpler single-pass system where that is genuinely enough, because added stages are added cost and the right design is the simplest one that clears the bar.

GraphRAG: when relationships matter more than similarity.

Vector search is built for similarity, and there is a whole class of question it cannot answer well. Find documents about a topic is its strength; show every supplier connected to this vendor who also holds a contract with a competitor is beyond it, because that is a question about relationships, not resemblance. For those, GraphRAG builds a knowledge graph where the entities, the people, companies and products, are nodes and the relationships between them are edges, so the system can reason across connections rather than only retrieve similar text.

The capability comes at a price, and we are plain about it. Building and maintaining the knowledge graph costs several times more than baseline RAG, so it earns its place only where the questions are genuinely relational and that extra cost returns real value, in due diligence, fraud and supply-chain analysis, investigative work where the answer lives in the connections. For the majority of knowledge-base use cases, hybrid retrieval over well-chunked documents is the right and far cheaper tool, and we will tell you when your questions do not warrant a graph rather than sell the more elaborate build by default.

Citations: trust you can check.

An answer you have to take on faith is worth little in a setting that matters. The defining feature of a good RAG system is that every answer points back to the source passage it came from, so a person can read the original and confirm the claim rather than trust the model's word. That single property changes the relationship with the system: it stops being an oracle you either believe or doubt and becomes a research assistant whose work you can verify in seconds.

Provenance also does double duty as governance. The same citation that lets a user check an answer is the record that lets an auditor see what a regulated decision was based on, and the trail that lets you find and fix a wrong answer at its source rather than guessing at the model. We build cited generation as a requirement rather than a feature, with the provenance carried through the pipeline from the chunk to the answer, because in the domains where RAG is most useful, finance, law, healthcare, compliance, an answer that cannot be traced is an answer that cannot be used.

Freshness: a knowledge base that does not go stale.

A knowledge base is only as good as its currency, and the quiet failure of many RAG systems is that they answer accurately from documents that are months out of date. The index has to keep pace with the source: new documents ingested, changed ones re-indexed, removed ones taken out, so the system answers from what is true now rather than what was true at launch. An answer correctly retrieved from a superseded policy is wrong in the way that matters most, because it looks right.

We build ingestion as a continuous pipeline rather than a one-time load, with incremental updates that keep the index current as your documents change, and connectors to the wikis, drives and databases where the knowledge really lives. This is also one half of guarding against drift: a system whose answers degrade because its knowledge aged is as broken as one whose retrieval regressed, and both are caught only by treating the knowledge base as a living system that is maintained. The freshness is what makes the difference between a RAG system you can rely on for a current answer and one that confidently tells you last year's truth.

Access control and PII at the retrieval layer.

A knowledge base flattens your documents into one searchable surface, which is exactly the risk if access is not enforced inside it. Without document-level permissions, a RAG system will happily retrieve and surface a passage from a file the asker was never allowed to see, turning a helpful assistant into a data-leak with a friendly interface. The access control has to live at the retrieval layer, so the system can only ever return what the person asking is entitled to.

We enforce permissions at retrieval rather than hope a prompt holds the line, so an agent or a user cannot pull what they cannot access, and we redact personal data in the pipeline so sensitive identifiers are handled to the standard GDPR expects. Combined with the EU-resident index and audit logging, this makes the knowledge base a governed system rather than an open door, and it sits inside the detection our managed security runs. The system holding your institutional memory should be the most carefully gated thing you operate, not a search box that quietly ignores who is allowed to see what.

From a search box to a retrieval pipeline.

The single-shot retrieve-and-answer that defined early RAG is being replaced by pipelines with distinct stages, because real questions are rarely answered well in one pass. A complex query is decomposed into parts, each part routed to the source most likely to hold its answer, the retrieved context filtered to remove the near-misses, and only then is the answer synthesised, each stage carrying its own quality signal so a weakness can be located rather than guessed at.

This is more to build than a search box bolted to a model, and it is the reason production RAG is an engineering discipline rather than a weekend integration. Each stage is an opportunity to improve quality and a place to measure it, which is how a system gets reliable enough to depend on and stays diagnosable when it slips. We build the pipeline to the depth the workload needs, no more, because every stage adds latency and cost, and the right design is the fewest stages that reach the reliability the use case demands rather than the most elaborate one available.

Who needs RAG, and when does something else fit?

RAG fits when people need accurate answers from a body of knowledge that is yours and that changes: a support team answering from current documentation, a professional services firm querying its own precedents, a regulated business that needs every answer traceable to a source. The common thread is private, changing knowledge where accuracy and provenance matter, which is precisely where a general model on its own is least reliable and most confidently wrong.

Where those conditions are absent, we will point you elsewhere. A stable body of knowledge that never changes might be served by fine-tuning; a single long document to reason over might fit a long-context call without any retrieval at all; a task that is really about taking actions wants the agent the retrieval would only feed. We build RAG where retrieval is the right answer and say so when it is not, because a retrieval system built for a problem that did not need one is cost and complexity with no payoff, and our interest is in the knowledge systems that earn their keep rather than the ones that merely sound modern.

How does it start?

An engagement begins with a body of knowledge and the questions you want answered from it. We take a representative slice of your documents, build a working retrieval pipeline over them, and test it against the real queries your people would ask, so you see early what the system can and cannot answer. That first pass usually reveals as much about the state of the documents as the technology, because retrieval exposes the gaps, duplicates and stale content a knowledge base accumulates.

From there it becomes an engineered system: the chunking tuned to your content, hybrid retrieval and reranking where they earn their cost, the evaluation harness in place, access control and the EU-resident index built in, and ingestion keeping it current. The work is sequenced and measured rather than assembled in one pass, because production RAG is a discipline of data quality, retrieval quality and evaluation loops.

Questions buyers ask.

What is RAG?

Retrieval-augmented generation pairs a language model with a search system over your own data. When a question comes in, the system retrieves the relevant passages and gives them to the model to answer from, with citations to the source. It grounds the model in your knowledge instead of its training alone.

Does RAG stop hallucinations?

It reduces them sharply rather than eliminating them. Grounding answers in retrieved source material and citing it can take hallucination from double digits to low single digits. The honest framing is that good RAG makes the model far more reliable and lets a person verify each answer, not that it makes it perfect.

Is RAG better than fine-tuning?

For most enterprise cases, yes, especially when the knowledge changes. RAG updates by changing the index rather than retraining the model, so it is cheaper and stays current. Fine-tuning still has its place for tone and format, but for factual grounding RAG is the more practical and scalable route.

Why are hybrid search and reranking important?

Pure semantic search struggles with exact strings such as a serial number or a statutory reference. Hybrid search runs semantic and keyword search together to catch both, and a cross-encoder reranker then scores the results for true relevance, removing noise before the model ever sees it. The two together are what lift answer quality.

Can the knowledge base stay inside the EU?

Yes. We build the index on pgvector or Qdrant on infrastructure inside the EU, not on a managed service that keeps your documents on a foreign cloud. Your content, the embeddings and the retrieval all stay in-region, with access control and PII redaction in the pipeline.

How do you know it is working?

We measure it. An evaluation harness tracks hallucination rate, Precision@K and provenance coverage from the start, which most RAG systems skip. Without those baselines a quality regression is invisible until a user hits it, so we treat evaluation as part of the build rather than an afterthought.

Doesn't a million-token context window make RAG obsolete?

No, it changes the cost maths without removing the need at enterprise scale. Long context suits holding one conversation or document in view; RAG suits precise, cited answers over a large, changing knowledge base, and it stays far cheaper per query at volume. The mature pattern in 2026 uses both: long context where it fits, retrieval where precision and citations matter.

What is agentic or corrective RAG?

A system that checks its own retrieval rather than answering in one blind shot. It can judge whether what it retrieved is good enough, retry with a refined query, and fall back to another source when the index is stale or thin. It costs a little more latency and tokens for a real gain in reliability, which is why it has become the baseline for serious applications.

What is GraphRAG, and do we need it?

GraphRAG builds a knowledge graph of entities and their relationships, so the system can answer questions about connections that vector similarity cannot, such as which suppliers link to which competitors. It costs several times more than baseline RAG, so it earns its place only for genuinely relational questions in areas like due diligence or fraud. For most knowledge bases, hybrid retrieval is the right, cheaper tool, and we will say so.

How do you keep the knowledge base current?

With continuous ingestion rather than a one-time load: new documents indexed, changed ones re-indexed, removed ones taken out, through connectors to the wikis, drives and databases where your knowledge lives. An answer correctly retrieved from a superseded document is wrong in the way that matters most, so freshness is treated as part of keeping the system reliable, alongside guarding against retrieval drift.

Can a RAG system leak documents a user shouldn't see?

Only if access control is missing, which is why we enforce it at the retrieval layer. The system can return a passage only from files the asker is entitled to see, permissions are checked at retrieval rather than left to a prompt, and personal data is redacted in the pipeline. Without document-level access control, a knowledge base becomes a data leak with a friendly interface.

Why does chunking matter so much?

Because retrieval can only return the chunks the indexing created. Split documents badly and the relevant idea is buried in noise or severed across two pieces that each retrieve poorly. There is no universal right answer, so we tune the chunking against your content and real queries, preserving structure and metadata, because getting this layer right is cheaper and more effective than any work further up the stack.

When is RAG the wrong choice?

When the problem is not really about retrieving from changing private knowledge. A stable, unchanging body of knowledge may suit fine-tuning; a single long document may fit a long-context call with no retrieval; a multi-step task that takes actions wants an agent the retrieval would only feed. We build RAG where it is the right answer and point you elsewhere when it is not.

RAG assessment

Point us at your documents. We'll show you what the AI can answer.

Tell us the knowledge you want a model to draw on and the questions it should handle. We assess retrieval quality on a sample of your data, show the answers with their citations, and tell you what production would take, before you commit to anything.

Book a RAG assessment Back to AI →

Answers grounded and cited Index stays inside the EU One named operator, answerable