A Pattern I Keep Running Into
I’ve been studying RAG systems, and there’s a failure mode that I didn’t fully appreciate at first. The system returns an answer with full confidence, no error signals, no indication that something went wrong. But when the answer is wrong, the reason is somewhere in the retrieval pipeline.
I want to work through what I’m learning about why this happens and how re-ranking addresses it.
The Pattern I Keep Seeing
When I look at how vector similarity search actually works, I notice something: high similarity scores don’t guarantee relevance in context.
Here’s an example that stuck with me: I was testing a system querying a knowledge base about AI governance. The system retrieved documents with 0.8+ similarity scores. Those documents were semantically related to the query. But semantically related isn’t the same as contextually correct.
What the system pulled back was mostly about process governance: how to manage logs, versioning, documentation. What the query was asking for was regulatory governance: the EU AI Act, compliance frameworks, oversight requirements.
Uses same vocabulary but has different meaning. There is a high similarity, but is the wrong answer.
And the system didn’t flag this as a problem. It assembled the retrieved context, sent it to the LLM, and the LLM generated a plausible-sounding response based on the wrong foundation. The user got an answer. It sounded correct. It was confidently wrong.
This is the pattern that’s been nagging at me, and I think it’s important enough that I want to understand it better.
What I’m Realizing About Similarity vs. Relevance
The core issue, as I’m understanding it, is that vector similarity measures mathematical closeness in a latent space. It’s a distance metric. But distance isn’t the same as relevance.
Two documents can be mathematically close in vector space but contextually unrelated. Or they can be close AND related. Or close but relevant to a different aspect of the question than what you’re actually looking for.
The vector model doesn’t know the difference. It returns neighbors. It’s up to the rest of the system to figure out whether those neighbors are actually useful.
When I think about production systems, especially in regulated contexts like education, this becomes a real problem. Under the EU AI Act, high-risk systems need explainable decisions. If your retrieval logic is “these vectors are closest” and that retrieval is fundamentally ambiguous (closest could mean right or wrong), then your audit trail is also ambiguous.
What Re-Ranking Does (And Doesn’t Do)
Re-ranking is an additional filtering step that comes after retrieval. Instead of trusting that the most similar vectors are the most relevant, you add a second gate that explicitly evaluates relevance.
The basic idea: cast a wide net in retrieval (get candidates), then apply a stricter relevance filter (keep only what’s actually useful).
Phase 1: Broad Recall
Retrieve more candidates than you’ll actually use. Lower your similarity threshold. The goal is to make sure you’re not missing potentially relevant documents.
// Lower the bar to capture potentially relevant documentsEmbeddingStoreContentRetriever.builder() .maxResults(10) // More candidates .minScore(0.5) // Lower threshold .build()Phase 2: Precise Re-ranking
Pass those candidates through a relevance gate. This is where I’m still learning what works best.
Using a specialized re-ranking model (like Cohere Rerank) is more accurate than using a general-purpose LLM as a re-classifier, but it’s also more expensive and adds latency. Using an LLM prompt to filter candidates is cheaper but less consistent.
private boolean isRelevant(String question, String chunk) { String prompt = "Is this text relevant to answer: %s? Reply ONLY YES or NO.".formatted(question); return chatModel.chat(prompt).startsWith("YES");}The cost-benefit tradeoff is something I’m still working through.
Where This Gets Real: Educational Contexts
In an exam grading system, the re-ranking problem becomes acute. Consider a scenario:
A student’s response matches multiple documents:
- Document A: The correct answer from the official answer key
- Document B: A similar concept from a different lesson, but not what was taught
- Document C: A common misconception that uses similar vocabulary
Simple retrieval by vector similarity might rank B or C higher because they share more vocabulary with the student’s specific wording. The correct answer key (A) ranks third.
Result: The system grades the answer as incorrect, even though it matches the correct key.
This is the scenario that made me realize re-ranking isn’t optional for educational AI systems. It’s architectural.
What I’m Learning About Measuring Re-Ranking
As you move from naive retrieval to re-ranking, the metrics change.
Rejection rate: What percentage of retrieved candidates does the re-ranker discard? In high-precision systems, 40-50% rejection is common. But high rejection might also signal that your embedding model or chunking strategy is weak upstream.
Context sufficiency: Is the re-ranked context actually better for the LLM? This is harder to measure than it sounds. Faithfulness scores (does the LLM’s answer stay grounded in the context?) are useful but imperfect.
Token tax: Re-ranking increases inference costs. If you’re evaluating 10 candidates per query, you’re making 10x more calls to your inference engine. The cost-benefit depends on your use case.
Auditability: Can you explain why specific chunks were rejected? This matters under Article 13 of the EU AI Act (transparency requirements). If you’re logging retrieval decisions, you need to log rejection decisions too.
I don’t have solid numbers yet for what “good” looks like across these metrics. I’m still learning.
What the Documentation Doesn’t Tell You
I’ve been reading through LangChain4j and re-ranking architecture docs, and there are some gaps that surprised me.
LLM-based re-ranking (using a general-purpose LLM to evaluate relevance) works inconsistently on long documents. If your chunks are too large, the re-ranker loses focus. There’s a sweet spot for chunk size that I’m still trying to find.
Cross-encoder re-rankers (which look at query + document simultaneously) are slower than vector embeddings but significantly more accurate. But the latency cost can be prohibitive for low-latency consumer applications.
Hybrid search (combining keyword search with semantic search, then re-ranking) seems to be the most robust pattern I’m seeing, but it also has the highest operational complexity.
I’m treating these as learnings, not established truth. Different systems and contexts will have different tradeoffs.
What I’m Still Trying to Figure Out
- When is re-ranking actually necessary versus optional? Where’s the line?
- What’s the right metric for measuring whether retrieval quality is “good enough”?
- How much does specialized vs. general-purpose re-ranking actually matter in practice?
- What chunk size minimizes silent failures without creating excessive fragmentation?
- How do you explain retrieval rejection decisions to non-technical users?
- In regulatory contexts, how detailed should rejection logging be?
These are the questions keeping me engaged with this problem.
Conclusion
Vector similarity is useful for broad retrieval, but it’s not sufficient for precise relevance in production systems, especially in contexts where confidence matters more than creativity.
Re-ranking adds friction and latency, but it also adds the ability to be explicit about what’s actually relevant. And in regulated contexts, that explicitness is often more valuable than speed.
I’m still learning what the right tradeoffs look like for different use cases. But I’m becoming more convinced that the pattern of broad-recall-then-precise-ranking is where reliability lives.
This is part of working through RAG architectures for educational AI. If you’ve built production RAG systems with re-ranking, I’d want to hear what patterns you’ve found.
Some information may be outdated