Observability in RAGs. What You Need to Measure Before Hitting the "Deploy" Button

There’s a version of RAG systems that looks production-ready, but it’s an illusion.

It passes manual tests. The responses are convincing. The architecture diagram on the whiteboard is fantastic. The team is happy. But three months after launch, nobody can say whether the system is working better or worse than on day one. Basically, because nobody bothered to define what “working” meant with concrete data.

This isn’t theory; it’s the default state of most RAGs out there right now.

In traditional software, observability is familiar territory: latency, error rates, alerts, and dashboards. In RAG systems, we need that same discipline, but applied to failures that don’t usually give warnings and to metrics that are nothing like what we’re used to.

This is what you need to have under control before going into production.

The pipeline has four layers (and each can break on its own)

A RAG system is not a monolithic block. It consists of at least four distinct layers, and each can degrade without the others noticing:

Retrieval layer. Returns chunks of text from the corpus. It fails if it returns garbage, if it returns something that sounds similar but is unrelated, or if the data is outdated.
Context assembly layer. Assembles these chunks for the model. It fails if the chunking strategy is poor, if the token limit is exceeded, or if the chunks are inconsistent.
Generation layer. The LLM writes the response. It fails with hallucinations, changes in tone, or if the model provider updates the version and changes how it responds.
Evaluation layer. Validates the output before the user sees it. Most systems don’t even have one, and that’s a design flaw.

If you only look at the end result, you’ll know something is wrong, but you’ll have no idea why.

Retrieval quality: the metric everyone forgets

This is the most important indicator, and the one almost no one monitors in production. Why? Because it’s difficult. It’s easy to read a response and say “I like it,” but it’s much more laborious to analyze retrieval scores and decide if they are acceptable.

If you want to monitor this seriously, you need the following:

Track the distribution of scores. Record the similarity of the chunks in each query. If you see that the average starts to drop, it means your index is degrading or the embedding model is doing something strange.
Perform random manual reviews. Numbers aren’t everything; take a sample of real queries and manually check if what the system retrieved was truly useful or just “similar words.”
Low-confidence logs. Define a minimum threshold, and if the retrieval doesn’t reach that level, prevent the system from trying to fabricate a response. These logs will tell you where you have gaps in your knowledge base.

Without this, you’re going in blind to the problem that most degrades the quality of a RAG.

It’s not enough to “review” the generation; you need a baseline

Reviewing responses before launch is fine, but it’s useless if you don’t have a reference point to compare against afterward. In production, keep in mind that the model is dealing with inputs you didn’t see in your tests.

A decent generation baseline needs:

A set of reference queries with normal and hard cases. You don’t need the exact answer, but rather the criteria to ensure it’s appropriate, avoids certain phrases, and remains faithful to the retrieved context.
Some constraint tests so that, for example, if you’ve told the prompt not to discuss certain topics, you actively test that this constraint is met every time you update the model or the prompt.
Look for unusual patterns in the output (anomaly detection), such as excessively long responses, pre-built phrases from the model that shouldn’t be there, or out-of-scope content.
Review latency and costs, as reliability is also a matter of performance and money (don’t consider them “side effects”).
Keep in mind that latency is additive. That is, each layer adds time: retrieval takes time, assembly takes time, generation takes time, and validation takes time. What was fast in testing can be unbearable with real-world traffic.
Finally, the token tax can skyrocket if retrieval results in too many chunks or if they are very large. Monitor token consumption per request; if it suddenly increases, your architecture is inefficient.

Minimum viable requirements for a production RAG

Before releasing it to users, here’s the minimum you should have:

A retrieval log with the query, chunk IDs, scores, and a “low confidence” flag.
A generation log with the sent context, model version, the complete response, and the validation results.
A process to run your reference queries after each change.
A metrics dashboard that allows you to see similarity scores, latency per stage, and token consumption.
A document outlining what to do if quality drops or what to do if the model starts malfunctioning after an update.

Conclusion

Launching a RAG in production without observability isn’t “taking risks,” it’s a lack of information. You won’t know if the system improves or worsens, or who is responsible when it fails.

Observability doesn’t magically make the system better, but it’s the only thing that allows you to see reality in order to fix it.

Reliable Enterprise AI

The pipeline has four layers (and each can break on its own)

Retrieval quality: the metric everyone forgets

It’s not enough to “review” the generation; you need a baseline

Minimum viable requirements for a production RAG

Conclusion