Here's How Re-ranking Fixes It
The problem nobody talks about until it happens in production
Three weeks ago, while reviewing the recovery logs from some tests, I noticed something that made me stop and take a closer look.
These tests asked: “What are the governance requirements for AI systems in the EU?”
The system returned five documents. All of them scored above 0.75 on the similarity threshold from my previous article.
In theory, all of these documents were relevant to AI governance. But upon reading them, four dealt with process governance: audit logs, data lineage, version control, and the fifth dealt with regulatory governance: Article 6 of the EU AI Act, risk classifications, and documentation requirements.
In this case, by chance, the LLM selected the regulatory document and therefore answered correctly.
But what would have happened if the LLM had prioritized the process documents and started building a response from them? The context would have seemed correct, and the answer would have sounded convincing. However, most likely, no one would have noticed until a compliance officer pointed out that we were documenting something incorrect.
It was then that I realized that the minScore threshold I had established in the previous article was insufficient. Relevance is not binary. And naive retrieval ranks documents based on a single signal: vector similarity. That signal alone is unreliable.
This is the failure mode that separates prototypes from production systems.
What naive retrieval actually does (and doesn’t)
In the article I published on how RAG works, I showed this configuration:
ContentRetriever retriever = EmbeddingStoreContentRetriever.builder() .embeddingModel(embeddingModel) .embeddingStore(embeddingStore) .maxResults(5) .minScore(0.75) .build();This retrieval system performs a single function: input the user’s question, search the vector database, and return the N best results that exceed the similarity threshold.
The implicit assumption in this design is: “If the vector similarity is high enough, the document is relevant enough.”
But this assumption can be wrong more often than you might think.
Here’s why: Vector similarity measures semantic closeness in a mathematical space. However, relevance to a specific question is a matter of judgment. A document can be semantically similar to the question and still not answer it correctly.
Example
A user asks, “What is the deadline for compliance with the AI Act?”. The system retrieves a document that deals extensively with deadlines and timelines. High similarity score. But that document deals with project management deadlines, not regulatory deadlines. Thus, the question is answered incorrectly, and the system has no mechanism to detect this error.
The vector model believes it has fulfilled its function, and the descriptive logic model believes it has a good understanding of the context. Only the user realizes the answer is incorrect.
In a business environment, especially in regulated sectors, this is unacceptable. According to the EU AI Act, high-risk AI systems must maintain documentation demonstrating that decisions are explainable and justifiable. If data recovery fails silently, the audit log becomes misleading. As I’ve argued before, architecture matters more than prompts, a confident but unreliable system is worse than a slower but accurate one.
Two phases: broad search, precise ranking
The solution involves dividing the search into two distinct phases.
Phase 1: Broad Search. The similarity threshold should be reduced. This will yield a larger number of candidates (more than necessary, which will introduce some noise). However, the cost is low: only vector comparisons are performed.
Phase 2: Precise Ranking. Next, we reduce the candidates by applying a second filter. This filter is not based solely on vector similarity; it uses a more intelligent question: “Given the user’s specific question, is this document truly useful?”
This is called re-ranking, and it’s where the search quality improves significantly.
Here’s how it looks in code:
@Servicepublic class ReRankingRagService {
private static final int RETRIEVAL_CANDIDATES = 10; private static final double RERANK_MIN_SCORE = 0.8;
private final ChatModel chatModel; private final ContentRetriever retriever;
public ReRankingRagService(ChatModel chatModel, EmbeddingModel embeddingModel, EmbeddingStore<TextSegment> embeddingStore) { this.chatModel = chatModel;
// Phase 1 — retrieve broadly this.retriever = EmbeddingStoreContentRetriever.builder() .embeddingModel(embeddingModel) .embeddingStore(embeddingStore) .maxResults(RETRIEVAL_CANDIDATES) // 10, not 5 .minScore(0.5) // lower threshold .build(); }Observe the changes:
- maxResults: has increased from 5 to 10 (greater reach)
- minScore: has decreased from 0.75 to 0.5 (a lower threshold, resulting in greater acceptance of candidates)
We are deliberately retrieving more information and accepting documents that the first filter would have rejected.
This may seem incorrect at first glance: it introduces noise into the system. But that is precisely the goal. Phase 1 focuses on comprehensiveness: not omitting anything that could be relevant. Phase 2 will handle accuracy.
The re-ranking process: quality at the cost of latency
Once Phase 1 returns 10 candidates, Phase 2 processes them using a reclassifier. In the current implementation, this reclassifier is the LLM itself:
public ReRankingResponse query(String userQuestion) { log.info("[RERANKING] Query: {}", userQuestion);
// Phase 1: retrieve broadly List<Content> candidates = retriever.retrieve(Query.from(userQuestion)); log.info("[RERANKING] Candidates retrieved: {}", candidates.size());
// Phase 2: re-rank by asking the LLM about relevance List<Content> reranked = candidates.stream() .filter(c -> isRelevant(userQuestion, c.textSegment().text())) .collect(Collectors.toList());
log.info("[RERANKING] After reranking: {} chunks (from {} candidates)", reranked.size(), candidates.size());
if (reranked.isEmpty()) { return ReRankingResponse.noContext(userQuestion, candidates.size()); }
String context = reranked.stream() .map(c -> c.textSegment().text()) .collect(Collectors.joining("\n\n---\n\n"));
String answer = chatModel.chat(buildPrompt(userQuestion, context)); return ReRankingResponse.of(userQuestion, answer, candidates.size(), reranked.size());}The process is performed using the following filter:
private boolean isRelevant(String question, String chunk) { String prompt = """ Is the following text relevant to answer this question? Question: %s Text: %s Reply with only YES or NO. """.formatted(question, chunk.substring(0, Math.min(200, chunk.length())));
String response = chatModel.chat(prompt).trim().toUpperCase(); return response.startsWith("YES");}For each candidate, we ask the LLM: “Is it relevant?” The LLM responds YES or NO, which determines whether we accept or reject the candidate.
However, this comes at a high cost in terms of latency. With 10 candidates and a response time of 120 ms per LLM evaluation, it takes 1.2 seconds just to reorder. Nevertheless, it is accurate. The LLM understands the context, nuances, and intent. This allows it to reject a semantically similar document that doesn’t actually answer the question.
In production, this system would be replaced with a specialized reorderer, such as Cohere Rerank, BGE Reranker, or similar. These models are specifically trained for the classification problem and are much faster. But the principle remains the same: a smarter second selection filter.
What changes when reordering?
Below, you can observe the log data.
In naive retrieval (Module 01):
Query: "What are the key AI governance requirements?""Candidates retrieved: 5Candidates passing minScore: 5Chunks used for generation: 5Context sufficiency: ✓ (contextFound=true)In re-ranking retrieval (Module 02):
Query: "What are the key AI governance requirements?""Candidates retrieved: 10Candidates after re-ranking: 4Chunks used for generation: 4Rejected by re-ranker: 6Context sufficiency: ✓ (contextFound=true)That’s a 40% rejection rate after the first phase. Those six documents scored above 0.5 similarity but didn’t actually answer the question.
The trade-off you’re making
This is crucial for business reliability and regulatory compliance.
For reliability. You’re sacrificing latency for accuracy. The system is slower, but more precise. In some areas, this is the right trade-off. In others, it isn’t.
The answer quality improves. Not because the LLM got better. Because the context it received was curated more carefully.
A conversational chatbot would certainly not tolerate an extra second of latency. But what about an audit trail system, a compliance document assistant, or a legal research tool? They will most likely accept latency in exchange for accuracy.
For regulatory compliance. This is where the EU AI Act comes into play.
Article 13 of the EU AI Act requires that high-risk AI systems maintain transparency: they must be able to explain their decisions. Such explanations must be based on evidence.
The hidden cost: more calls to the LLM
There’s a problem that’s often overlooked when promoting the re-ranking process.
With simple retrieval, the LLM makes one call per query: to generate the final response.
With reclassification, the LLM makes n+1 calls per query: n calls to assess relevance, plus 1 to generate the response.
This means that if 1,000 queries are run daily on 10 candidates, 10,000 relevance assessments are generated. This is where the token tax becomes a reality. Costs multiply significantly.
This is where specialized reclassifiers make more sense, as they are smaller, cheaper, and faster (although they introduce an additional dependency).
A Real-World Example: Where Naive Recovery Fails
Let’s look at a case where the system failed until we added re-ranking.
Document A (retrieved by the naive system):
"The organization must maintain detailed records of all system decisions.Records must be timestamped and immutable. Records must include the inputdata, the decision, and the reasoning. This supports audit compliance."Document B (retrieved by the naive system):
"Record management systems in enterprise environments should prioritizeavailability and performance. Use modern database technologies. Considerdistributed systems for scalability."Both documents scored above 0.75 on the query “What are the documentation requirements for AI systems?”
Notice that one answers the question and the other deals with database design.
The LLM, receiving both documents as context, had to choose which one to build upon. Sometimes it was right. Sometimes it wasn’t. The system lacked a mechanism to deprioritize Document B.
With re-ranking, Document B is immediately filtered out. The reclassifier asks: “Is this about the documentation requirements for AI systems?” It answers NO. Document B never reaches the LLM context.
The answer becomes reliable. The audit log shows why it became reliable.
The pattern: phase 1 recall, phase 2 precision
This two-phase pattern appears throughout retrieval systems:
- Dense retrieval (vectors) + sparse retrieval (keywords) = hybrid search (we’ll cover this next)
- Broad retrieval (many candidates) + re-ranking (filtered candidates) = ranking
- Retrieval (find candidates) + rewriting (improve queries) = query transformation
The pattern is always the same: cast a wide net first, then be precise.
In Module 01, you learned that naive retrieval with a good threshold works for simple cases. In Module 02, you’re learning that simple cases are rare in production.
Real documents are messy. Real questions are ambiguous. Real systems need multiple gates.
What does this mean for an audit log?
Remember the RetrievalAuditLog from Module 01? It recorded the following:
[RAG-AUDIT] queryId=46d61c6b | chunks=4 | contextFound=true | answer="..."With re-ranking, the audit log must evolve. It should now record the following:
[RERANKING-AUDIT] queryId=46d61c6b |candidates_retrieved=10 |candidates_after_reranking=4 |rejected_by_relevance_gate=6 |answer="..."This additional data (the number of rejected documents) isn’t purely for show. It’s evidence. And if someone questions the answer, you can precisely explain why those four documents were selected and why the other six weren’t.
This relates to a broader principle I explored in Human Intervention Is Not a Feature: the system needs to be designed so that a human can understand and audit every decision. Reclassification makes this possible.
Next: hybrid search (the real production pattern)
Re-ranking solves one problem: semantic relevance filtering. But it doesn’t solve another: what about exact matches?
If a user asks, “What is Article 9 of the EU AI Act?”, simple retrieval will find documents about AI governance and regulation, and these will score well. But if you have a document that literally contains the string “Article 9”, vector similarity might rank it below a semantically richer document that doesn’t explicitly mention Article 9.
This is where hybrid search comes in. Combining BM25 (exact match, keyword-based) with vector (semantic) search provides both accuracy and comprehensiveness.
We’ll implement this in the next article and see why re-ranking becomes even more crucial when you have two retrieval signals instead of one.
The code
The complete implementation of ReRankingRagService can be found in Module 02 of the reliable-enterprise-ai repository (https://github.com/raulferrerdev/reliable-enterprise-ai), along with HybridSearchService and CompressingRagService.
Conclusion
Naive retrieval silently fails when it misclassifies documents. You don’t get an error, but rather a certainly wrong response.
Re-ranking adds a second evaluation filter that filters candidates by relevance, not just similarity. This improves accuracy at the cost of increased latency.
Most importantly for businesses: re-ranking creates an auditable decision-making process. You can document why documents were rejected, and under the EU AI Act, this documentation is mandatory if your system is classified as high risk.
Some information may be outdated