Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0

TLDR:
We introduce the mxbai-edge-colbert-v0 model family, at two model sizes: 17M and 32M. Both models are very strong baselines for future research, with the 17M variant outperforming ColBERTv2 and standing in a league of its own for small-model retrieval.

Introduction

This summer, we set out to prepare the next steps of an ambitious research roadmap, with the ultimate aim of designing ever-improving approaches to late interaction and multi-vector retrieval. As we started planning experimental work more thoroughly, we were faced with a problem when thinking about one important question: what model do we tinker with for early experiments?

Indeed, to try out new things you need a strong baseline with a well-understood training process to ensure that the results of your experiments are meaningful. Ideally, this baseline would be both strong and small. Scaling laws exist in information retrieval as they do in the rest of machine learning, and capable strong models allow you to test out ideas in record time before scaling them with relatively little effort.

With this in mind, we did not know what this tiny experimental testbed should be: this is not to say that there are no good ColBERT models out there, but none of them seemed to meet our needs!

ColBERTv2 is an excellent baseline, but it is from 2021, which, in AI terms, means that it is ancient. GTE-ModernColBERT, the current state-of-the-art, is a fantastic model, but it suffers from two problems: it is larger than we would like and it is initialized from a pre-trained dense embedding model that is hard to reproduce, limiting our control on experiments. answerai-colbert-small-v1, while an extremely strong model at a great compact size, is also initialized from a pre-trained checkpoint which is the averaging of not one, but two hard-to-reproduce embedding models. Additionally, it has a MiniLM backbone, which means it suffers from the same limitations of previous-generation encoders such as the lack of long-context support, much like ColBERTv2.

Just a few weeks before we started pondering this, the Ettin collection of models had come out: among other things, it includes a replication of ModernBERT (with some tweaks) across a large range of model scales, from 17 million to 1 billion parameters. The two smallest sizes of Ettin, with 17 and 32 million parameters respectively, seemed like perfect matches: we quickly made the decision to train tiny, capable models that could support all of our future experiments. We also immediately decided that we would release these models publicly, as we believe open source releases to be the perfect home for models that can run on just about any hardware.

If you want the full details on our training process, please head over to the tech report. We have attempted to make the tech report a true overview of "how to train sane, near state-of-the-art retrievers in 2025" and we would highly encourage you to read it if this is something of interest for you.

If you just want an overview of what we did and HuggingFace links, however, you're at the right place.

Training Small ColBERTs: The Steps

Previous research on ColBERT models has indicated a pretty clear trend: all state-of-the-art ColBERTs are initialized from strong single-vector embedding models, which have undergone their own somewhat standardized multi-stage training process. This is likely due to a combination of reasons, ranging from MaxSim's learning constraints favouring already-strong representations over unaligned ones to the lack of a standardized ColBERT pre-training recipe, among other potential culprits.

Training Dense Backbones

As such, before training our ColBERT models, we must first prepare suitable dense models to serve as backbones. As the purpose of this is to build all-around good performing baselines rather than benchmark chasing, we opt to follow standardized methods and use widely used datasets with limited overfitting potential.

The Dense training process we opt for consists of three steps:

First, we performed a large-scale weakly supervised contrastive pre-training on around 150 million training pairs. This step serves as a way to preheat the model's representations, shifting them from Ettin's original language modelling objective to a similarity objective. The data is not of high quality, but is large in volume, slowly shifting the embedding space in the right direction.
Secondly, we perform supervised fine-tuning. This is the key step, where the model trained in the first stage is now exposed to retrieval queries and their matching document, with positive documents annotated by a human. Following standard practice, we perform hard-negative mining, so as to provide the model with "believable" looking negative examples and teach it to distinguish near-matches from actually relevant documents.

Our third step is, as of yet, less standard: Stella-style knowledge distillation. This step is the key component of the Stella retrieval models, which are well known in the information retrieval community as being very strong models for their size. Effectively, the aim here is to align the representations of our model with that of a much larger, better model. Curiosity got the best of us here: we are really big fans of the Stella models and wanted to explore this approach to distillation in depth.

Again, we provide more information on this step in the tech report, but broadly we adopted a simplified version of the Stella mixture of losses, inspired by MongoDB's recent report on LEAF-style distillation. We note that this step strongly improved the performance of our 32 million parameter model variant and resulted in a small-but-noticeable boost for the 17 million one.

After this stage, here we are: we now have a viable backbone that is easy to produce using standardized methods!

Training a ColBERT model

We're now ready to move on to the next step: creating ColBERT models!

Ablations light the way

We decided to take this opportunity to also run many ablations, seeking to answer a few questions we still had about the underlying mechanisms of the standard training recipe. Namely, we were wondering if...:

Muon is a good optimizer for late interaction models?
Projection dimension matters, and if so, at what point performance begins to rapidly degrade?
Qwen3-Reranker is a good teacher for KL-Div distillation over teacher scores?
Our proposed improvements to ColBERT projection heads improve models using state-of-the-art recipes, rather than more academic ones using a weaker base model?
The backbone models that have undergone Stella-style distillation produce better ColBERT models?
The use, or not, of casing has an impact?

The answer to these questions and many more is, you guessed it, in the tech report! But, as a sneak peek, the answer to 1 is yes.

Building on our findings from the ablations, we then proceed to use the best settings we've uncovered to train the final models, resulting in our final checkpoints: mxbai-edge-colbert-v0, at both the 17M and 32M parameter scales. For everything not otherwise ablated, we followed the standardized training method introduced in JaColBERTv2.5.

So, how do they fare?

Short answer: surprisingly well!

For models which did not do anything out-of-the-extraordinary to seek SotA performance, and which have largely steered clear of any contamination data at the most important stages, our models reach robust performance across the board:

Results on BEIR

Model	AVG	MS MARCO	SciFact	Touche	FiQA	TREC-COVID	NQ	DBPedia
Large Models (>100M)
GTE-ModernColBERT-v1	0.547	0.453	0.763	0.312	0.453	0.836	0.618	0.480
ColBERTv2	0.488	0.456	0.693	0.263	0.356	0.733	0.562	0.446
Medium Models (<35M)
mxbai-edge-colbert-v0-32m	0.521	0.450	0.740	0.313	0.390	0.775	0.600	0.455
answerai-colbert-small-v1	0.534	0.434	0.740	0.250	0.410	0.831	0.594	0.464
bge-small-en-v1.5	0.517	0.408	0.713	0.260	0.403	0.759	0.502	0.400
snowflake-s	0.519	0.402	0.722	0.235	0.407	0.801	0.509	0.410
Small Models (<25M)
mxbai-edge-colbert-v0-17m	0.490	0.416	0.719	0.316	0.326	0.713	0.551	0.410
colbert-muvera-micro	0.394	0.364	0.662	0.251	0.254	0.561	0.386	0.332
all-MiniLM-L6-v2	0.419	0.365	0.645	0.169	0.369	0.472	0.439	0.323

Results on LongEmbed

Model	AVG
Large Models (>100M)
GTE-ModernColBERT-v1 (32k)	0.898
GTE-ModernColBERT-v1 (4k)	0.809
granite-embedding-english-r2	0.656
ColBERTv2	0.428
Medium Models (<50M)
mxbai-edge-colbert-v0-32m (32k)	0.849
mxbai-edge-colbert-v0-32m (4k)	0.783
granite-embedding-small-english-r2	0.637
answerai-colbert-small-v1	0.441
bge-small-en-v1.5	0.312
snowflake-arctic-embed-s	0.356
Small Models (<25M)
mxbai-edge-colbert-v0-17m (32k)	0.847
mxbai-edge-colbert-v0-17m (4k)	0.776
all-MiniLM-L6-v2	0.298
colbert-muvera-micro	0.405

Our 17 million parameter model in particular is a standout performer which we hope will be a very strong baseline for many experiments to come. Despite its incredibly low parameter count and a projection dimension of 48, just about one third of the standard 128, it comfortably outperforms ColBERTv2. And it does so while scaling exceptionally well across longer contexts: its performance of LongEmbed very comfortably exceeds the current <1B parameter state-of-the-art single-vector retriever on the LongEmbed leaderboard by more than 19 NDCG@10 points.

Efficiency is the name of the game

Our models build upon the current wave of more efficient encoders, spearheaded by ModernBERT and carried on by subsequent models such as Ettin or ModernVBERT. As such, we designed them with efficiency in mind, attempting to minimize their computational requirements without degrading performance.

On top of their low parameter counts and the architectural improvements inherent to the ModernBERT architecture, such as built-in unpadding and flash attention 2, we adopt very small final projection dimensions for our models, which makes them particularly memory- and RAM-friendly:

Model	Params	Dim.	NDCG@10	LoCo	GPU	CPU	Mem. (MB)
ColBERTv2	130M	128	0.6198	--	81s	1540s	732
answerai-colbert-small-v1	33M	96	0.6545	--	59s	621s	549
colbert-muvera-micro	4M	128	0.5599	--	45s	88s	732
mxbai-edge-colbert-v0-17m	17M	48	0.6405	✓	51s	487s	275
mxbai-edge-colbert-v0-32m	32M	64	0.6520	✓	55s	589s	366

We're particularly excited by the 17M variant's potential as an end-to-end retriever or reranker following a static retriever for on-edge usecases, as it can embed dozens of documents in milliseconds on CPU with remarkably low memory footprint.

What next

The models are already available on HuggingFace and supported in PyLate: mxbai-edge-colbert-v0-17m and mxbai-edge-colbert-v0-32m.

With this release, we killed two birds with one stone, having released both the strongest existing edge retrieval model to date, mxbai-edge-colbert-v0, and a set of extremely strong baselines to support further experimentation.

In the future, we intend to periodically update our edge-sized open source offerings to further disseminate our research findings in a bite-sized, anyone-can-use-it format.

If this sounds like something you'd like to contribute to, we are hiring across all technical positions! Take a look at them below and don't hesitate to apply if you feel like a fit for any of them:

Research: Research Staff, and Research Interns
Software: Software Engineer, Frontend Engineer and DevOps Engineer
Product: Product Designer