October 16, 2025
Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0

TLDR:
We introduce the mxbai-edge-colbert-v0 model family, at two model sizes: 17M and 32M. Both models are very strong baselines for future research, with the 17M variant outperforming ColBERTv2 and standing in a league of its own for small-model retrieval.
Introduction
This summer, we set out to prepare the next steps of an ambitious research roadmap, with the ultimate aim of designing ever-improving approaches to late interaction and multi-vector retrieval. As we started planning experimental work more thoroughly, we were faced with a problem when thinking about one important question: what model do we tinker with for early experiments?
Indeed, to try out new things you need a strong baseline with a well-understood training process to ensure that the results of your experiments are meaningful. Ideally, this baseline would be both strong and small. Scaling laws exist in information retrieval as they do in the rest of machine learning, and capable strong models allow you to test out ideas in record time before scaling them with relatively little effort.
With this in mind, we did not know what this tiny experimental testbed should be: this is not to say that there are no good ColBERT models out there, but none of them seemed to meet our needs!
ColBERTv2 is an excellent baseline, but it is from 2021, which, in AI terms, means that it is ancient. GTE-ModernColBERT, the current state-of-the-art, is a fantastic model, but it suffers from two problems: it is larger than we would like and it is initialized from a pre-trained dense embedding model that is hard to reproduce, limiting our control on experiments. answerai-colbert-small-v1, while an extremely strong model at a great compact size, is also initialized from a pre-trained checkpoint which is the averaging of not one, but two hard-to-reproduce embedding models. Additionally, it has a MiniLM backbone, which means it suffers from the same limitations of previous-generation encoders such as the lack of long-context support, much like ColBERTv2.
Just a few weeks before we started pondering this, the Ettin collection of models had come out: among other things, it includes a replication of ModernBERT (with some tweaks) across a large range of model scales, from 17 million to 1 billion parameters. The two smallest sizes of Ettin, with 17 and 32 million parameters respectively, seemed like perfect matches: we quickly made the decision to train tiny, capable models that could support all of our future experiments. We also immediately decided that we would release these models publicly, as we believe open source releases to be the perfect home for models that can run on just about any hardware.
If you want the full details on our training process, please head over to the tech report. We have attempted to make the tech report a true overview of "how to train sane, near state-of-the-art retrievers in 2025" and we would highly encourage you to read it if this is something of interest for you.
If you just want an overview of what we did and HuggingFace links, however, you're at the right place.
Training Small ColBERTs: The Steps
Previous research on ColBERT models has indicated a pretty clear trend: all state-of-the-art ColBERTs are initialized from strong single-vector embedding models, which have undergone their own somewhat standardized multi-stage training process. This is likely due to a combination of reasons, ranging from MaxSim's learning constraints favouring already-strong representations over unaligned ones to the lack of a standardized ColBERT pre-training recipe, among other potential culprits.
Training Dense Backbones
As such, before training our ColBERT models, we must first prepare suitable dense models to serve as backbones. As the purpose of this is to build all-around good performing baselines rather than benchmark chasing, we opt to follow standardized methods and use widely used datasets with limited overfitting potential.
The Dense training process we opt for consists of three steps:
- First, we performed a large-scale weakly supervised contrastive pre-training on around 150 million training pairs. This step serves as a way to preheat the model's representations, shifting them from Ettin's original language modelling objective to a similarity objective. The data is not of high quality, but is large in volume, slowly shifting the embedding space in the right direction.
- Secondly, we perform supervised fine-tuning. This is the key step, where the model trained in the first stage is now exposed to retrieval queries and their matching document, with positive documents annotated by a human. Following standard practice, we perform hard-negative mining, so as to provide the model with "believable" looking negative examples and teach it to distinguish near-matches from actually relevant documents.
Our third step is, as of yet, less standard: Stella-style knowledge distillation. This step is the key component of the Stella retrieval models, which are well known in the information retrieval community as being very strong models for their size. Effectively, the aim here is to align the representations of our model with that of a much larger, better model. Curiosity got the best of us here: we are really big fans of the Stella models and wanted to explore this approach to distillation in depth.
Again, we provide more information on this step in the tech report, but broadly we adopted a simplified version of the Stella mixture of losses, inspired by MongoDB's recent report on LEAF-style distillation. We note that this step strongly improved the performance of our 32 million parameter model variant and resulted in a small-but-noticeable boost for the 17 million one.
After this stage, here we are: we now have a viable backbone that is easy to produce using standardized methods!
Training a ColBERT model
We're now ready to move on to the next step: creating ColBERT models!
Ablations light the way
We decided to take this opportunity to also run many ablations, seeking to answer a few questions we still had about the underlying mechanisms of the standard training recipe. Namely, we were wondering if...:
- Muon is a good optimizer for late interaction models?
- Projection dimension matters, and if so, at what point performance begins to rapidly degrade?
- Qwen3-Reranker is a good teacher for KL-Div distillation over teacher scores?
- Our proposed improvements to ColBERT projection heads improve models using state-of-the-art recipes, rather than more academic ones using a weaker base model?
- The backbone models that have undergone Stella-style distillation produce better ColBERT models?
- The use, or not, of casing has an impact?
The answer to these questions and many more is, you guessed it, in the tech report! But, as a sneak peek, the answer to 1 is yes.
Building on our findings from the ablations, we then proceed to use the best settings we've uncovered to train the final models, resulting in our final checkpoints: mxbai-edge-colbert-v0, at both the 17M and 32M parameter scales. For everything not otherwise ablated, we followed the standardized training method introduced in JaColBERTv2.5.
So, how do they fare?
Short answer: surprisingly well!
For models which did not do anything out-of-the-extraordinary to seek SotA performance, and which have largely steered clear of any contamination data at the most important stages, our models reach robust performance across the board:
Results on BEIR
Model | AVG | MS MARCO | SciFact | Touche | FiQA | TREC-COVID | NQ | DBPedia |
---|---|---|---|---|---|---|---|---|
Large Models (>100M) | ||||||||
GTE-ModernColBERT-v1 | 0.547 | 0.453 | 0.763 | 0.312 | 0.453 | 0.836 | 0.618 | 0.480 |
ColBERTv2 | 0.488 | 0.456 | 0.693 | 0.263 | 0.356 | 0.733 | 0.562 | 0.446 |
Medium Models (<35M) | ||||||||
mxbai-edge-colbert-v0-32m | 0.521 | 0.450 | 0.740 | 0.313 | 0.390 | 0.775 | 0.600 | 0.455 |
answerai-colbert-small-v1 | 0.534 | 0.434 | 0.740 | 0.250 | 0.410 | 0.831 | 0.594 | 0.464 |
bge-small-en-v1.5 | 0.517 | 0.408 | 0.713 | 0.260 | 0.403 | 0.759 | 0.502 | 0.400 |
snowflake-s | 0.519 | 0.402 | 0.722 | 0.235 | 0.407 | 0.801 | 0.509 | 0.410 |
Small Models (<25M) | ||||||||
mxbai-edge-colbert-v0-17m | 0.490 | 0.416 | 0.719 | 0.316 | 0.326 | 0.713 | 0.551 | 0.410 |
colbert-muvera-micro | 0.394 | 0.364 | 0.662 | 0.251 | 0.254 | 0.561 | 0.386 | 0.332 |
all-MiniLM-L6-v2 | 0.419 | 0.365 | 0.645 | 0.169 | 0.369 | 0.472 | 0.439 | 0.323 |
Results on LongEmbed
Model | AVG |
---|---|
Large Models (>100M) | |
GTE-ModernColBERT-v1 (32k) | 0.898 |
GTE-ModernColBERT-v1 (4k) | 0.809 |
granite-embedding-english-r2 | 0.656 |
ColBERTv2 | 0.428 |
Medium Models (<50M) | |
mxbai-edge-colbert-v0-32m (32k) | 0.849 |
mxbai-edge-colbert-v0-32m (4k) | 0.783 |
granite-embedding-small-english-r2 | 0.637 |
answerai-colbert-small-v1 | 0.441 |
bge-small-en-v1.5 | 0.312 |
snowflake-arctic-embed-s | 0.356 |
Small Models (<25M) | |
mxbai-edge-colbert-v0-17m (32k) | 0.847 |
mxbai-edge-colbert-v0-17m (4k) | 0.776 |
all-MiniLM-L6-v2 | 0.298 |
colbert-muvera-micro | 0.405 |
Our 17 million parameter model in particular is a standout performer which we hope will be a very strong baseline for many experiments to come. Despite its incredibly low parameter count and a projection dimension of 48, just about one third of the standard 128, it comfortably outperforms ColBERTv2. And it does so while scaling exceptionally well across longer contexts: its performance of LongEmbed very comfortably exceeds the current <1B parameter state-of-the-art single-vector retriever on the LongEmbed leaderboard by more than 19 NDCG@10 points.
Efficiency is the name of the game
Our models build upon the current wave of more efficient encoders, spearheaded by ModernBERT and carried on by subsequent models such as Ettin or ModernVBERT. As such, we designed them with efficiency in mind, attempting to minimize their computational requirements without degrading performance.
On top of their low parameter counts and the architectural improvements inherent to the ModernBERT architecture, such as built-in unpadding and flash attention 2, we adopt very small final projection dimensions for our models, which makes them particularly memory- and RAM-friendly:
Model | Params | Dim. | NDCG@10 | LoCo | GPU | CPU | Mem. (MB) |
---|---|---|---|---|---|---|---|
ColBERTv2 | 130M | 128 | 0.6198 | -- | 81s | 1540s | 732 |
answerai-colbert-small-v1 | 33M | 96 | 0.6545 | -- | 59s | 621s | 549 |
colbert-muvera-micro | 4M | 128 | 0.5599 | -- | 45s | 88s | 732 |
mxbai-edge-colbert-v0-17m | 17M | 48 | 0.6405 | ✓ | 51s | 487s | 275 |
mxbai-edge-colbert-v0-32m | 32M | 64 | 0.6520 | ✓ | 55s | 589s | 366 |
We're particularly excited by the 17M variant's potential as an end-to-end retriever or reranker following a static retriever for on-edge usecases, as it can embed dozens of documents in milliseconds on CPU with remarkably low memory footprint.
What next
The models are already available on HuggingFace and supported in PyLate: mxbai-edge-colbert-v0-17m and mxbai-edge-colbert-v0-32m.
With this release, we killed two birds with one stone, having released both the strongest existing edge retrieval model to date, mxbai-edge-colbert-v0, and a set of extremely strong baselines to support further experimentation.
In the future, we intend to periodically update our edge-sized open source offerings to further disseminate our research findings in a bite-sized, anyone-can-use-it format.
If this sounds like something you'd like to contribute to, we are hiring across all technical positions! Take a look at them below and don't hesitate to apply if you feel like a fit for any of them:
- Research: Research Staff, and Research Interns
- Software: Software Engineer, Frontend Engineer and DevOps Engineer
- Product: Product Designer