Dense Retrievers Know More Than They Can Express

When we think of neural retrieval, the first assumption that comes to mind is that the role of a model is to represent information in a way that allows it to be retrieved by a related query. But this representation is overly simplistic: the role of a retrieval model is to represent information in a way that can be used by a scoring mechanism. All retrieval work, in essence, is constrained by the operator we choose to rank documents, and all model expressivity is bottlenecked by what this operator can capture.
This raises one, important question: do retrieval models know more than they are able to express? In other words, despite scoring limitations, does the retrieval training process allow a model to learn richer representations than we assume, waiting to be extracted? The answer is yes, and even more than that: these representations are trivial to (at least partially) extract, and their distribution approaches natural language itself.
Retrieval: A Song of Representations and ScoringLink to section
Single-vector embedding models are not a good approach to retrieval, because their inherent limitations make them unsuitable for a lot of situations that are all too common in the real world.
This is a conclusion that is increasingly apparent in all domains, with multi-vector models vastly outperforming their single-vector equivalent with an order of magnitude fewer parameters in both agentic and multimodal settings, two areas which are rapidly growing in importance.
However, while it is tempting to write another article denouncing the evils of single-vector retrieval, it is much more interesting to think about why this is the case. At surface-level, it might appear obvious: well, it's a single vector representing multi-faceted information, so the meaning is diluted!
This assumption is not incorrect: representing nuanced information in one vector is bound to create strong dilution. But it is incomplete: what is diluted is not necessarily the representation itself: after all, LLMs are perfectly capable to create complex task vectors encompassing nuances. What single-vector limits, and what makes it so harmful to generalisation, is the expressiveness of scoring operators.
We can think about it this way: all (first stage) retrieval operations are effectively about enabling the use of an efficient scoring mechanism that can produce useful relevance rankings, to be further enhanced by a reranker or directly used by the final consumer, whether human or agentic. The role of the embedding model is not to perform this scoring by itself: it would be far too expensive to use a neural model, even if it had just a handful of parameters, to score millions of document at query time.
Instead, as we discussed in our introduction, the role of the embedding model is to convert the information in the documents into a format that can be readily consumed by the efficient scoring mechanism we discussed above. Thus, training embedding is a pure "representation learning" task, where we are provided with a constraint, the input format of the downstream former, and must learn representations that best enable it. In fact, the three great family of retrievers are shaped by their scoring operators: single-vector dense retrievers and their cosine similarity operator, sparse retrievers and various forms of weighted dot products, and multi-vector models and their MaxSim operator.
The reason late interaction models, such as Wholembedv3, are so powerful is because of this operator: MaxSim allows for a level of fine-grained expressiveness in scoring that is simply not possible with single-vector cosine similarity. Late interaction is about preserving this expressivity by allowing information from both the documents and the query to interact as late in the scoring process as possible (hence the "late" interaction naming).
In fact, a few people in the ColBERT community sometimes rant that they don't like the term "multi-vector retrieval" very much. The reason for this is this simple fact: with our current understanding of retrieval, multi-vector approaches are currently required to enable MaxSim, which itself is the scoring operator that is currently required to enable late interaction retrieval. By no means is it the "perfect" operator, but it is a necessary evil to allow expressive scoring, until we develop better ways to do so.
And just like ColBERT is powerful because its scoring operator maximises expressiveness at the cost of additional engineering constraints, single-vector embeddings are brittle because their scoring operator favours engineering flexibility, with the opposite tradeoffs. But as we discussed: this is not an inherent feature of the models themselves, but of the way their final representations are cast to be made compatible with this operator.
As such, we find ourselves asking the same question that we did at the start of this article: do dense models contain more information than they are able to express, and, more importantly, can we extract this knowledge?
If you have read this far, you are probably assuming that the reason we are asking this question is because we found out that, Yes, you can. And even more than that: it's trivial to do so.
Extracting a model's vocabularyLink to section
To an extent, we know that it is very simple to train dense embedding models to become strong multi-vector retrievers, and this makes sense, as MaxSim scoring can be thought of as "cosine-similarity++" and would leverage a great deal of the same information.
But circling back to our previous point, this is an imperfect solution: yes, MaxSim and late interaction is very powerful. But it's also heavy, requires solid engineering chops, and more importantly, while it is technically a first-stage retrieval approach, it still needs its own "pre-first-stage" candidate generation stage in order for the MaxSim compute cost to be bearable.
But even more importantly in this context: this conversion is performed with retrieval-specific supervised training, making it less of an extraction and more of an adaptation: after all, you could make the case that ModernBERT itself contains retrieval information, because if you train it with retrieval supervision, it becomes a capable retriever. Not untrue, but missing the point.
No, instead, we want to see what information a model contains, without further training.
Sparse AutoEncodersLink to section
And to do so, we turn towards the field of interpretability. Specifically, we look into Sparse AutoEncoders, or SAEs. These models have become commonplace in studies trying to understand the "black box" of language models, coming to prominence spearheaded by Anthropic and now found all over both research and industry.
At their core, SAEs could not be simpler, as they're built on concepts that will be familiar to the vast majority of machine learning enthusiasts: they are shallow networks composed of single encoder-decoder block. Training follows a simple reconstruction objective: the decoder has to reconstruct the original input after it has been projected through the encoder.
What makes SAEs interesting is the constraint that is added to the encoder: a sparsity penalty is added to ensure that each input feature is only able to activate a limited number of features in the encoder's latent space. The intermediate representation between the encoder and the decoder is effectively a large sparse vector, where the vast majority of dimensions have a value of 0.
The theory, that empirical results support to good extent, is that doing so yields some sort of "latent vocabulary", in which we can analyse which tokens activate which features and better understand how information is represented within the large, dense, and otherwise uninterpretable internal activations of LLMs. This field of research is what led to the (in)famous Golden Gate Claude experiment, that we all dearly miss.
Being a retrieval-focused company, this made us think: if SAEs are capable of extracting a latent vocabulary that can more-or-less clearly mapped to concepts, could this vocabulary also be useful for retrieval? And so, we designed a simple set of experiments: train SAEs on top of common retrievers, both internal and external, and explore the makeup of the resulting features.
The Latent Space is ZipfianLink to section
Before we say more about the nature of the representations we extracted, it's important to take a step back, and discuss Zipf's Law. Indeed, one of the core underlying aspects of most classical lexical approaches to NLP is the fact that natural language tends to follow a Zipfian distribution.
A picture is worth a thousand words, so let's first look at what a Zipfian looks like before we delve (ChatGPT did not write this, but we will not let it claim a good word) further:
Zipf's law is a simple empirical observation: when you gather a set of observations and sort them in decreasing order, the distribution is often such that the value of the nth entry is inversely proportional to n. In practice, what this means is that the third most common element will occur about half as often as the second most common one, which itself will be about half as common as the first.
Most human languages naturally tend towards a quasi-Zipfian distribution: while they don't follow a perfect Zipf's curve, they all roughly espouse its shape. And as we know by now, humans are very good at optimisation: if a distribution is known and has well-defined properties, we will figure out how to take advantage of this.
And we did: for a long time, tools were specifically designed around this. The most famous of such approaches in retrieval is BM25, designed around TF-IDF features (Term Frequency - Inverse Document Frequency, essentially a way to give more weight to more discriminative terms) with a few additional tweaks and parameters to tune. In fact, BM25 is a fantastic example of our drive to optimise, in two ways: first, despite having been introduced in 1995, it remains today the pareto-optimal way to do retrieval with lexical features. Secondly, the name itself is a throwback to the necessity of iterations to optimise: BM25 stands for Best Match 25, and the 25 simply refers to the fact that it was the 25th method the team had designed.
Mainstream neural sparse methods, such as SPLADE, have largely done away with these assumptions: their training methods leads to a smoother curve, with fewer saturated features. But as it turns out, Latent Terms, the activated features we extracted through an SAE applied over retrievers, are distributed in a Zipfian way over large corpuses:
In practice, this means that the vocabulary we extract via SAEs, which we call Latent Terms, follow a distribution that is broadly similar to that of human language. This is unlike current sparse retrieval methods, which are explicitly trained with methods to induce sparsity that result in the absence of the characteristic saturated top of the curve that is present in natural words. And as you've probably guessed it, having a discrete vocabulary with distribution similar to that of lexical terms means that it this vocabulary is readily usable with methods designed for lexical terms.
What does that vocabulary look like?Link to section
But before we get into its suitability for retrieval methods like BM25, let us first take a look at the vocabulary itself: what exactly does it capture? What do its features look like?
Thankfully, the vocabulary is rather small, with most of our experiments targeting a vocabulary size in the 65536 range. This enables qualitative analysis, that is then easy to scale up with the use of LLMs.
What this activation analysis revealed is that three broad, unevenly-distributed categories are present in the identified features: Lexical Features, which fire on a single term, Narrow Semantic Features, which capture multiple ways of referring to the same concept, and Broad Topical Features, activated by a wide range of terms around a similar topic.
The distribution between these 3 types differs somewhat between models, but the general trend holds: roughly 10% of the features are narrow semantic, around a third is purely lexical, and the rest, comprising over half the features, are broad topical ones.
Not only does it overcome failure cases, it makes for very strong retrieversLink to section
While the shape of the features is encouraging, it's time to put them to the real test: do Latent Terms simply have a Zipfian distribution, but ultimately do not carry meaningful enough signal, or are they effective retrieval features?
A (truncated) table is worth a thousand words:
| Method | SciFact | NFC | FiQA | TREC-Covid | DBPedia | NQ | HotpotQA | FEVER |
|---|---|---|---|---|---|---|---|---|
| Lexical BM25 | 0.686 | 0.319 | 0.249 | 0.680 | 0.300 | 0.285 | 0.569 | 0.481 |
| SPLADE-v3 | 0.710 | 0.357 | 0.374 | 0.748 | 0.450 | 0.586 | 0.692 | 0.796 |
| Contriever | 0.655 | 0.313 | 0.274 | 0.448 | 0.377 | 0.419 | 0.542 | 0.581 |
| Nomic | 0.703 | 0.346 | 0.377 | 0.822 | 0.431 | 0.598 | 0.672 | 0.813 |
| GTE-MC | 0.756 | 0.381 | 0.456 | 0.849 | 0.475 | 0.617 | 0.773 | 0.875 |
| Latent Terms + Contriever | 0.713 | 0.340 | 0.317 | 0.709 | 0.409 | 0.468 | 0.627 | 0.751 |
| Latent Terms + Nomic | 0.749 | 0.372 | 0.382 | 0.783 | 0.436 | 0.577 | 0.732 | 0.885 |
| Latent Terms + GTE-ModernColBERT | 0.730 | 0.374 | 0.399 | 0.759 | 0.387 | 0.509 | 0.653 | 0.814 |
In summary: not only are Latent Terms extracted with SAEs compatible with BM25, but they are able to achieve very competitive retrieval results, matching or outperforming their single-vector backbone, whether the backbone is an older, weaker one (Contriever) or a more modern model (in this case, nomic-embed-text-v1.5). The overall performance does appear pretty strongly correlated with retrieval training, as the better text embedding also produces the superior Latent Terms results.
Perhaps even more interestingly, the approach is competitive with SPLADE models from a similar era: SAE+BM25 over Nomic outperforms SPLADEv3, despite the latter's heavy use of knowledge distillation from much more powerful model.
Finally, the final insight immediately apparent from the table is, again, that scoring operators do matter immensely: while single-vector models lose to their Latent Terms cousin, GTE-ModernColBERT, a late interaction model using the powerful MaxSim operator, comfortably outperforms its counterpart, despite it remaining strong.
But there is another benchmark we are interested in here: LIMIT. LIMIT, which we talked about previously, is a toy task: queries are very straightforward, and documents are essentially a long list of a person's attributes. It is purposefully designed to be trivial for approaches capturing fine-grained information in their scoring: "normal" BM25's Recall@20 is in the high 90s, and so is GTE-ModernColBERT's. However, its very simple formulation is antagonistic to the limitations of single-vector scoring, and even large, 8 billion parameter single-vector models fail to reach double-digits recall numbers. As such, the question is pretty clear: can Latent Terms, despite being built upon the same single-vector model, recover LIMIT performance far beyond the single-vector setting scoring limitation?
| Method | Recall (R)@10 | R@20 | R@100 | R@1000 |
|---|---|---|---|---|
| Lexical BM25 | 0.9440 | 0.9490 | 0.9645 | 0.9945 |
| SPLADE-v3 | 0.5760 | 0.6650 | 0.8095 | 0.9440 |
| GTE-ModernColBERT | 0.8430 | 0.8565 | 0.8720 | 0.8795 |
| Contriever | 0.0210 | 0.0265 | 0.0530 | 0.1250 |
| Latent Terms + GTE-ModernColBERT | 0.7985 | 0.8315 | 0.8915 | 0.9775 |
| Latent Terms + Contriever | 0.4140 | 0.5100 | 0.7295 | 0.9290 |
The answer is yes, although it is not perfect: while Contriever as a single-vector model reaches a Recall@100 of just 0.053, its Latent Terms variant hits 0.729. This confirms that even though it has been trained only in a single-vector setting the model itself has learned information allowing to avoid collapse on LIMIT beyond that that can be expressed in this same setting. As this avenue of research is still very young and primitive, this suggests that our current training methods teach models significant, meaningful signal that is just waiting for us to develop better ways to extract.
Is this inherent to retrievers or do all language models contain a sparse vocab eagerly waiting to meet BM25?Link to section
But first, we need to think about where this information comes from. Earlier, we talked about the core goal of sparse auto-encoders in existing research: given a set of neural activations from a complex, highly-uninterpretable language model, cast them into a sparse set of activations that can be studied to understand how these activations related to language, thus better understanding models.
Then, we demonstrated that in the case of retrievers, these sparse activations approximate a natural language distribution, and that we identify three "families" of features capturing different levels of lexical and semantic information. The combination of these factors enable methods designed for natural language, in this case BM25, to work extremely well and result in strong retrieval performance.
With these two factors in mind, there's one question that naturally follows: does this retrieval-friendly Latent Terms structure naturally emerge in the representations of encoder models, or are these extractable, meaningful features learned as a byproduct of retrieval-focused contrastive training?
| Method | SciFact | NFCorpus | FiQA | TREC-Covid | DBPedia | NQ | HotpotQA |
|---|---|---|---|---|---|---|---|
| Latent Terms + BERT | 0.585 | 0.216 | 0.131 | 0.212 | 0.134 | 0.165 | 0.345 |
| Latent terms + Contriever | 0.713 | 0.340 | 0.317 | 0.709 | 0.409 | 0.468 | 0.627 |
The answer, quite clearly, is the latter: features extracted by SAEs over pre-trained language models do not contain the structure that enables Latent Terms to act as strong retrieval discriminators. This confirms one of our intuitions: the SAE process is not, by itself, creating structure or information that is useful for retrieval. However, it is a straightforward way to expose information that the model learns about what makes a given term (or token, in our case) in a document impact its relevance to a query, or vice-versa, in ways that the model can fail to express in a single-vector representation. This finding is also supported by the clear jump in quality between a weaker backbone (Contriever) and a stronger one (Nomic).
This opens up quite a lot of interesting questions: is this the best way to extract this information? Should we be figuring out training methods that are not so post-hoc, so that this information is extracted in an ever better way? Is the future of retrieval sparser than we have been led to believe?
Where can I learn more, and What's Next?Link to section
If you want to dig more onto the scientific aspect of this approach, our new preprint is now up on arXiv.
As for what's next, this paper is the first of a series of findings that we intend to publish about this line of work. Characteristically, we expect these to be sparse, but you should stay tuned to find out more about what the vector space of retrievers is hiding. If this kind of research resonates with you, you should definitely get in touch and share your excitement.
CitationLink to section
If you'd like to formally reference this work, please cite the associated paper:
@misc{latentterms,
title={Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies},
author={Benjamin Clavié and Sean Lee and Aamir Shakir and Makoto P. Kato},
year={2026},
eprint={2605.29384},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2605.29384},
}