LOADING
1295 words
6 minutes
The Embedding Gap. Why Your Vector Database Fails Before You Query It

Why the Wrong Embedding Model is an Irreversible Structural Flaw in Your RAG Pipeline.

The Embedding Gap: Why Does Your Vector Database Fail Before You Even Query It?

Your RAG system looks good on the dashboard: queries return results, response times are within the SLA (Service Level Agreement), and the developers are happy. However, in a business meeting, someone discovers that the AI ​​isn’t finding the right documents, but rather the closest ones. And that’s not the same thing.

The flaw that many teams never diagnose occurs before a single query even accesses the vector database. It happens the moment you choose an embedding model without understanding what that choice actually encodes into your data. And once that data is indexed, fixing the error doesn’t mean tweaking a parameter, but re-embedding and reindexing everything. From scratch.

Below, we’ll see why embedding models aren’t interchangeable, or why the same document produces radically different vectors depending on the model you use.

Why Most Teams Don’t Understand That Embedding != Embedding

We might think that any embedding model converts text into “meaning,” and that this meaning is practically equivalent across models. It isn’t.

Each embedding model is trained on a specific corpus, with a specific goal function, optimized for a specific distribution of tasks. What all-MiniLM-L6-v2 encodes as “similar” is not the same as what nomic-embed-text encodes as “similar,” and neither matches what BGE-M3 considers proximate. The math is the same, but the semantic geometry is completely different. Two models can parse the same paragraph about GDPR compliance and place its vector in entirely different neighborhoods of their respective latent spaces.

This is important because vector search doesn’t retrieve documents; it retrieves neighbors in an embedding space. If that space was built with the wrong model for your domain, your retrieval pipeline is navigating the wrong map, and will return results that are plausible in direction but wrong in fact. The underlying descriptive logic model (LLM) is then based on that erroneous information, and you have a system that looks correct but is systematically incorrect.

The same three documents, embedded with three different models, occupy completely different neighborhoods in each model's vector space. A query placed at point Q retrieves document A from Model 1, document C from Model 2, and document B from Model 3, even though the query is identical.

The practical consequence is that you cannot mix embedding models. If you index your documents with nomic-embedded-text through Ollama and then decide to switch to BGE-M3 for better multilingual support, all the vectors in your Weaviate collection become useless. This isn’t about migrating schemas; it’s about rebuilding a complete semantic index from scratch. And if you have 10 terabytes of enterprise documentation, that’s not something you can fix in a Tuesday afternoon.

Same document, different universes: How model training shapes semantic space

To understand why this happens, it’s helpful to precisely define what an embedding model does.

For example, when all-MiniLM-L6-v2 processes the sentence “The board approved the risk assessment protocol,” it produces a 384-dimensional vector. However, when we use nomic-embed-text to process the same sentence, we get a 768-dimensional vector, and when we use BGE-M3, we get a 1024-dimensional vector. But dimensionality isn’t the main issue: models with the same dimension can produce very different vectors for identical input because each model learned a distinct internal representation during training.

  • all-MiniLM-L6-v2 was trained primarily on pairs of sentences commonly found on the web. It excels at capturing general semantic similarity but has a known weakness with domain-specific terminology, particularly legal, financial, and regulatory language.

  • Nomic-embed-text was designed with transparency and performance in mind across large contexts. The Nomic Embed V2 model supports up to 8192 tokens using rotating positional embedding, making it significantly better suited for long documents, where MiniLM would lose coherence beyond its 512-token limit.

  • BGE-M3 goes even further. It supports the simultaneous retrieval of dense, sparse, and multi-vector data within a single model, as well as over 100 languages ​​and input sequences of up to 8192 tokens. This flexibility comes at a higher cost: 568 million parameters compared to MiniLM’s 33 million.

Neither of these models is inherently bad. Each is optimized for a different constraint profile.

Evaluating the Embedding Layer: Metrics That Really Matter

For an AI system to be reliable, it’s not enough to evaluate how it generates text; you first have to evaluate how it finds the information. If the database is flawed, the final result will be too.

The Database: Recall@K and the “Funnel” Problem

Recall@K measures what percentage of relevant information we manage to retrieve from the database. If the Recall@5 is low (for example, 0.6), it means that 40% of the necessary data never reaches the model. But what we can’t do is try to solve this with a “reranker”: a reranker only sorts what has already been found; if the initial search failed, there’s nothing to sort.

Relevance: MAP (Mean Average Precision)

It’s not just about finding the information, but where it appears. MAP measures the quality of the ordering. So, if the LLM only reads the first 3 text fragments, the most important information needs to be at the top.

A system that places relevant information in the fourth position is less efficient than one that places it in the first, even if both have “found” the document.

The “Acid Test”: Lower Similarity Limit

To determine if an embedding model is of high quality, we can perform this test:

  • Irrelevant documents are taken and their similarity to the query is measured (cosine score).

  • If the average score exceeds 0.65, the model has a problem: it cannot distinguish noise from the signal.

In this case, the best solution is to change or retrain the embedding model, since the problem is structural and cannot be solved with prompt engineering or larger context windows.

Under the EU AI Act (Article 13), optimizing these metrics is not optional; it is a legal obligation. A traceable record must be maintained that explains:

  • The exact version of the model used.

  • How the input data was transformed.

  • The results of the quality assurance tests before the system launch.

Frequently Asked Questions

Multiple models at once?

It’s not recommended. Mixing models creates “watertight compartments.” For consistent search across your entire system, use a single model for the entire corpus. If you use multiple models, you’d have to manage separate indexes and complex routing.

When to update the model?

The industry evolves rapidly (especially after the 2026 advancements). Ideally, test new models quarterly with your quality assurance tests, but don’t reindex everything until you see a real improvement that justifies the effort.

Ollama (on-premises) or API (cloud)?

The quality of the vector is the same. The difference is strategic:

  • On-premises (Ollama): More privacy (key for the EU AI Act), no token costs, and lower network latency.

  • API: Less technical complexity and no need for proprietary hardware.

The Myth of the 0.85 Similarity Score

A high score isn’t always good. If everything scores high, your model can’t distinguish between topics. What matters isn’t the absolute number, but the distance between a relevant and an irrelevant result.

Minimum Audit for Your System

If you want to know if your system is working today, do this:

  • Choose 20-30 real queries and mark their correct answers.

  • Measure the Recall@5: Do those answers appear among the top 5 results? (Minimum acceptable: 70%).

  • Check the similarity of irrelevant information: If the system flags more than 0.65 as irrelevant, your underlying model is flawed.

Conclusion

The most serious problem in AI is not an obvious technical failure, but rather retrieving erroneous data that appears to be true for months. This destroys trust and leads to decisions based on false information.

Indexing is an architectural decision: choosing the embedding model is not a simple configuration adjustment; it is the foundation of the entire system. If you choose the wrong model from the start, any subsequent processes (such as LLM or re-ranking) will only be treating symptoms, not the underlying problem.

The failure occurs at the source: a vector database doesn’t fail when searching; it fails the moment the data is stored using an inappropriate model.

The Embedding Gap. Why Your Vector Database Fails Before You Query It
Author
Raúl Ferrer
Published at
2026-02-28
License
CC BY-NC-SA 4.0

Some information may be outdated

Profile Image of the Author
Raúl Ferrer
Software Architect & Tech Lead. Applying software and systems engineering principles in production to build reliable, observable, and maintainable AI. Author of iOS Architecture Patterns (Apress).

Loading stats...