How do you build RAG systems that work in production?
Most RAG implementations are built to produce answers. The ones that work in high-stakes environments are built to produce answers with calibrated confidence, and to behave differently when confidence is low. The hard engineering work is not retrieval quality. It is teaching the system to know what it does not know.
Almost every RAG demo looks impressive. The user asks a question, the system retrieves the relevant content, the language model produces a fluent and authoritative answer with citations. The audience nods. The proof of concept gets approved.
Production is where the demo runs into the wall. The same system that answered confidently in testing answers confidently when it is wrong. Sources get cited that do not actually contain the claim. Edge cases produce plausible-sounding hallucinations. The fluency that made the demo compelling becomes the failure mode that makes the system unsafe. The gap between adoption and outcome is well documented: McKinsey's 2025 State of AI report finds that while 71% of organizations report regular generative AI use, only 17% attribute more than 5% of EBIT to it. The problem is rarely the technology. It is the gap between what works in testing and what works under load.
The conventional response to this is to tune the model. Better embeddings, smarter chunking, stricter prompts, temperature adjustments. All of which help, marginally. None of which solves the underlying problem, which is that the system is being asked to produce an answer regardless of whether it has the basis for one.
The teams that get reliable retrieval-augmented generation into production approach it differently. They treat confidence as a product feature, not a model property. They design the system to behave differently when it does not know, and they expose that uncertainty to the user in ways that are useful rather than alarming. This article covers what that approach looks like in practice, drawn from a project where the cost of a confident hallucination would have been a safety issue, not a customer service one.
What is retrieval-augmented generation and how does it work
Retrieval-augmented generation is a pattern for grounding language model outputs in specific source content rather than relying on the model's training data. The architecture has three core components. A retrieval layer takes the user's query, converts it into a vector embedding, and finds the most relevant chunks of content from a connected knowledge base. A generation layer takes those chunks, combines them with the original query, and uses a language model to produce an answer. A presentation layer formats the answer for the user, often with source attribution.
The appeal of RAG is obvious. It gives the language model access to current, specific, organization-owned information without retraining the model itself. It provides a path to citation and verifiability. It works against unstructured content that would be expensive to convert into structured data. For domains where the source material is too specialized, too current or too proprietary for a general-purpose model to handle, RAG is the standard pattern.
The architecture diagram makes this look like a solved problem. Pick a vector database, pick a chunking strategy, pick an embedding model, connect it to a language model, and you have a RAG system. The diagram does not show where reliability actually lives.
Why RAG is harder in production than the architecture diagrams suggest
The gap between a working RAG demo and a production RAG system is wide, and the gap is not where most teams expect it.
Most production failures do not come from the retrieval layer failing to find relevant content. They come from the generation layer producing a confident answer when the retrieved content does not actually support one. The system retrieves three documents, none of which contain the answer to the user's question, and the language model produces a fluent response anyway. The citations look real. The reasoning sounds coherent. The output is wrong, and the user has no way to tell.
This is not a theoretical concern. A 2025 Stanford study published in the Journal of Empirical Legal Studies evaluated the leading RAG-based legal research tools, including LexisNexis Lexis+ AI and Thomson Reuters Westlaw AI-Assisted Research, both marketed as "hallucination-free." The study found that these tools hallucinate between 17 and 33% of the time. The hallucinations included fabricated citations and, more dangerously, real citations attached to claims the source did not actually support. The systems were confident either way. The users could not always tell.
This is a different category of failure from what RAG was originally designed to address. The technique was introduced to reduce hallucination by grounding responses in source content. In practice, it reduces hallucination only when the source content actually contains the answer, and when the system is honest about whether it does. Neither condition is automatic.
The deeper engineering problem is therefore not retrieval quality. It is confidence. The system needs to know how much support the retrieved content actually provides for the answer it is about to generate, and it needs to behave differently when that support is weak. This is not something that emerges from the architecture. It has to be designed in.
What confidence looks like as a product feature
The project that taught us this most clearly was a generative AI assistant built for a maritime technology venture. The system answers questions from ship operators, fueling suppliers, crewing agencies and logistics firms on topics ranging from regulatory compliance to safety protocols to operational guidance. The source material lives across vessel-specific documentation, company procedures, regulatory updates and external advisories.
In maritime operations, the cost of a confident wrong answer is not a poor customer experience. It is a safety issue. A hallucinated regulatory citation or a fabricated safety procedure could result in real-world consequences far beyond reputational damage. The system had to be designed from the start around the assumption that the model would sometimes fail, and that those failures had to be visible to the user rather than hidden behind fluent prose.
The design response was to treat confidence as a first-class feature of the output, not as an internal model state. Every answer the system produces is presented alongside the source documents it was generated from. The user can verify the claim against the original text rather than relying on the model's assertion. When the retrieved content is thin or ambiguous, the system surfaces that fact rather than papering over it. The answer is structured so that the user knows what the system found, where it found it, and what to do if the answer does not look right.
This sounds straightforward when described. In practice, it requires deliberate engineering at every layer. The retrieval layer has to score and rank results in ways that the presentation layer can use. The generation layer has to be prompted to acknowledge uncertainty rather than smooth over it. The presentation layer has to display source links in ways that invite verification rather than discourage it. And the underlying product has to be designed for a user who is willing to verify, not one who wants to trust the system blindly.
The result is a system that is less impressive in demos and more reliable in production. That is not a coincidence. The two are connected.
What teams underestimate when deploying RAG in production
The architectural decisions that determine whether a RAG system reaches production reliability tend to concentrate in areas that look secondary in the planning phase.
Chunking strategy
How source content is divided into retrievable units shapes everything downstream. Chunks that are too small lose context and produce fragmented answers. Chunks that are too large include irrelevant content that dilutes retrieval quality. The optimal strategy depends on the structure of the source material, the typical question shape, and the language model's context window. Most teams pick a default chunking strategy from a tutorial and never revisit it. That is usually the first thing that needs to change in production.
Embedding quality
The embedding model determines how the system understands semantic similarity between queries and content. A general-purpose embedding model works for general-purpose content. For specialized domains, particularly ones with technical vocabulary that is not well represented in the model's training data, retrieval quality drops significantly. Domain-specific embeddings or fine-tuned embeddings can produce substantial accuracy improvements that no amount of prompt engineering will match.
Confidence calibration
This is the layer most teams skip entirely. The system needs a way to estimate how well the retrieved content supports the requested answer, and that estimate needs to drive downstream behaviour. Strong support means a confident answer. Weak support means a hedged answer, or no answer at all. The exact mechanism varies: retrieval scores, model self-evaluation, dedicated verification steps, but the principle is consistent: the system should not produce a confident answer in the absence of confident grounding.
Source attribution and verifiability
The output is not just the answer. It is the answer plus the means to verify it. Every claim should be traceable to the source content the system retrieved, and the source should be exposed in a way that makes verification practical. This is partly a presentation problem and partly an architectural one. The retrieval layer has to preserve enough metadata to make source attribution meaningful, the generation layer has to maintain the link between claims and sources, and the interface has to surface that information at the moment of decision.
Common mistakes when implementing RAG in production
1. Tuning the model when the problem is the system
Teams that hit reliability problems in production often respond by adjusting the language model: stricter prompts, lower temperature, different model providers. Sometimes this helps. More often the problem is not the model. The problem is that the system is asking the model to produce an answer regardless of whether the retrieved content supports one. No amount of model tuning fixes that.
2. Treating answers as binary
A RAG system that produces an answer when it should produce uncertainty, or that produces uncertainty when a confident answer is justified, is failing in different but equally costly ways. The system needs gradients, and those gradients need to be visible in the output. A flat answer format that does not distinguish between high and low confidence is not a complete RAG implementation. It is half of one.
3. Designing source attribution as decoration
Many RAG systems display source links underneath the answer as a kind of visual reassurance, without designing the links to actually support verification. The links go to long PDFs without highlighting the relevant passage. The cited document does not contain the claim. The user has no practical way to check. Source attribution is only meaningful when it can be acted on, and designing it for acting requires more engineering than displaying it as a footer.
4. Optimizing for the demo, not the user
The demo audience wants impressive answers. The production user wants reliable ones. These are different design goals, and the system designed for one is rarely the system needed for the other. Teams that optimize for the demo end up with fluent, confident, frequently wrong outputs. Teams that optimize for the user end up with systems that sometimes say "I do not know" and that occasionally produce no answer at all. The second category is what production needs.
Where to start if you are evaluating a RAG architecture
Most teams reach the end of an article like this with the same question: where do we begin?
Here are a few practical starting points, in the order they should be addressed:
Define the confidence behavior before the retrieval architecture
What does a high-confidence answer look like? What does a low-confidence answer look like? What does the system do when it cannot answer at all? If you cannot clearly describe these three states, the conversation about architecture is premature. The hardest decisions in RAG are not technical; they are about what the product should do when it does not know.
Identify your accuracy threshold in operational terms
What error rate is acceptable, and what happens when an error occurs? In some domains, a wrong answer is an inconvenience. In others, it is a liability. The architecture decisions differ significantly between the two. As we covered in our piece on how to build a business case for AI before writing a line of code, the accuracy threshold needs to sit in the requirements before any model evaluation begins.
Test the system against the hardest queries in your domain, not the easiest ones
Most RAG demos succeed on broad, simple questions and fail on narrow, complex ones. The narrow complex queries are the ones that will determine whether the system is actually useful in production. Build your evaluation set from the questions that have historically been hardest to answer, not the ones that look impressive in a screenshot.
Design the source attribution as a verification tool, not a citation
Source links need to point to the specific passage that supports the claim, not the document that contains it. This is a small engineering distinction that has a large effect on how much the user trusts the system, and how much they should.
Treat the embedding model as a product decision, not a default
The right embedding model for your domain may not be the most popular one. It may need to be fine-tuned on domain-specific content. The investment is significant, but the accuracy gains are often the largest single lever available.
The teams that get RAG into production reliably are the ones who treat confidence design as the actual product, and retrieval as one of the components that supports it. The teams that struggle build retrieval first and assume confidence will follow.
Why RAG remains essential despite the rise of agentic systems
There is a current narrative that retrieval-augmented generation is being superseded by agentic architectures, tool use and longer context windows. The narrative is partly right. The capabilities adjacent to RAG have evolved significantly, and the boundary between RAG and adjacent patterns is increasingly blurred.
What has not changed is the underlying problem RAG solves. Most enterprise content lives outside the training data of any general-purpose model, and most decisions in regulated or high-stakes domains require traceable sourcing rather than model recall. Until both of those conditions change, some form of retrieval-augmented generation will remain part of the architecture. For regulated environments specifically, where source data cannot leave a security perimeter, the constraint shapes the system from the start, as we explored in our article on how regulated companies build AI without third-party APIs.
What is changing is what the surrounding system looks like. The retrieval layer is being augmented by tool calls, structured queries, multi-step reasoning and dynamic context construction. The principle stays the same: the system needs to be honest about what it knows. The implementation gets more sophisticated, but the engineering discipline does not change.
The teams that get this work right are not the ones building the most sophisticated RAG architectures. They are the ones who understand that the user's question is not "what is the answer" but "can I trust this answer enough to act on it." The architecture exists to make the second question answerable.
Build a RAG system that knows what it does not know
If you are designing or improving a RAG system for a high-reliability environment and want a clear-eyed view of where confidence engineering fits in your architecture, we are happy to work through it with you.