Hard-earned lessons from building and deploying real-world AI Retrieval systems.
What I got wrong about RAG in my first three months
Have you ever felt that particular confidence that comes from understanding a system conceptually before you’ve built it? You’ve read the documentation. You understand the architecture. Recovery layer, embedding model, vector database, generation model, request. You can draw the diagram and explain it to someone who hasn’t read the documentation. You feel prepared.
Then you build it. And the diagram becomes useless almost immediately.
This is what happened to me in my first three months of working seriously with RAG systems. I had a solid background in production software architecture: systems that serve thousands of users, where reliability is critical and failures have real consequences. I assumed that experience would transfer seamlessly. And it did, but not in the way I expected. And where it didn’t transfer is where I made my biggest mistakes.
Here’s where I went wrong.
I treated recovery as a solved problem
The first mistake was assuming that if vector search returned results, recovery worked.
It didn’t. Returning results and returning relevant results are not the same thing. A vector database will always return the k most similar fragments to a query, even when none are actually useful. The system doesn’t tell you it doesn’t know. It tells you the closest match it has, and then the generation model does its best with insufficient context.
I work in mobile development, where a query to the database returns matching records (or nothing if there’s no match). The failure mode is explicit. In RAG, the failure mode is silent. The system returns something that seems plausible, but it may be incorrect or incomplete, and nothing in the process signals this.
The solution isn’t a better vector database, but rather better instrumentation. It’s necessary to record retrieval scores, inspect what’s actually being retrieved for representative queries, and define a threshold below which the retrieved context should trigger an alternative instead of a generation attempt. That threshold is a design decision, not a configuration value, and it took me a while to learn to treat it as such.
I thought the system prompt was a guardrail
The second mistake was relying on prompt-level restrictions to enforce behavioral limits.
When you tell a model “don’t answer questions off-topic” or “don’t suggest follow-up questions”, the model tries to comply. But compliance is probabilistic, not guaranteed. The model weighs your instruction against everything it learned during training, including a strong tendency toward helpful, conversational behavior. Sometimes the instruction succeeds. Sometimes it doesn’t.
I realized this when the responses started deviating in ways that contradicted the explicit instructions in the system prompt. The instructions hadn’t changed, but the model had, following a vendor update, and the new version weighted the instructions differently.
The lesson should have been obvious: the behavior you need to guarantee can’t reside solely in a layer you don’t control. The prompt instructions are a first filter.
Critical behavioral constraints must be applied at the output layer, with validation logic executed before the response reaches the user. If a response contains inappropriate patterns (such as conversational closings, out-of-context content, or specific phrases), these must be detected programmatically, rather than simply ignored by a simple request.
This is not a workaround. This is how reliable systems are designed.
I underestimated the importance of snippet quality
Third mistake, I focused on the retrieval mechanism and ignored the data from which it was extracted. The chuncking strategy (i.e., how the source documents are divided before indexing) seemed like a preprocessing detail. But it isn’t. It’s one of the most important architectural decisions in a RAG system, and it goes almost unnoticed until something goes wrong.
Chunks that are too small lose context. A sentence extracted from the middle of a technical explanation might be semantically similar to a query, but useless as the basis for a correct answer.
Chunks that are too large dilute the signal. The vector representation averages too many ideas, and retrieval becomes imprecise.
The problem is that poor chuncking degrades quality gradually, not catastrophically. The system works. The answers are, for the most part, reasonable. But the quality limit is set by the fragment structure, and you don’t realize this until you start systematically evaluating the output by comparing it to the expected responses and the results are consistently mediocre.
Optimizing these parameters is usually a lengthy process, but changes made to the fragmentation often have a greater impact than any modifications made to the retrieval layer.
There was no baseline
This was the mistake that made diagnosing all the others difficult.
If you don’t establish a documented baseline when the system is first implemented, you have no record of what the correct answers should look like, no set of reference queries with expected results, and no recovery quality metrics for when you were satisfied with the system’s behavior.
When the behavior starts to change, there’s no point of reference. Therefore, when you see that things are different, you realize you can’t demonstrate how much, which layer has changed, or when it started.
In traditional software, regression testing is standard practice. You define the expected behavior, run the tests, and detect deviations. RAG systems require the same discipline, adapted to probabilistic results.
This means a set of representative queries with documented acceptable responses, recovery quality metrics monitored over time, and a process for running that benchmark test after any significant change—to the model, the index, the request, or any dependency.
This isn’t complicated. It’s the same engineering hygiene applied to a different system.
What needs to be bone
To avoid these types of situations, I would consider these four aspects essential from day one:
-
Validate the recovery process first and foremost. Before adjusting requests or evaluating the quality of the data generation, it’s crucial to verify that what we’re retrieving is truly relevant. Therefore, we must record the scores, inspect the results, and define a minimum acceptance threshold.
-
Validate the output as an architectural layer. Behavioral constraints applied only at the request level are fragile; a change in the model can break them. Any constraint important enough to specify is important enough to verify programmatically.
-
Create a segmentation strategy as a first-class decision. It should be deliberately defined, its rationale documented, and tested with representative queries before indexing production data.
-
Establish a documented baseline from implementation. A set of reference queries, expected results, and recovery metrics recorded when the system is functioning as intended. Everything else is compared to that.
Realize that none of these are advanced techniques. They are the application of basic production engineering discipline to a system that most teams still treat as experimental, even after releasing it to users.
That gap between the experimental mindset and production reality is where most of RAG’s shortcomings lie. Closing it doesn’t require better models or broader context windows. It requires treating the system as what it is: a system.
Conclusion
The four mistakes here aren’t exotic. They’re the predictable result of applying a traditional engineering mindset to a system that behaves differently in ways that aren’t always visible. The corrections are equally straightforward. What they require isn’t advanced technique, it’s the same production discipline applied earlier, more deliberately, to a system that most teams are still treating as a prototype after they’ve already shipped it.
Some information may be outdated