Our Research Vision, Part 1

At Mixedbread, we are trying to strike a balance that feels increasingly hard to get just right: our goal is to be a research lab whose work directly feeds into a product, rather than a product company who happens to occasionally require research.

This has led us to think a lot about why we do research, how we do it, and how we can strike the right balance between offering an attractive product to users while contributing to the broader research community. In this blog post, we're sharing some of our early thoughts on our reasons for setting up Mixedbread as such, the way we approach research and decide what to work on, and how we intend to achieve this balance.

Why we do what we do

We fundamentally believe that information retrieval and capital-S Search are fundamental areas of research in the currently nascent AI era. There are two reasons for this belief.

The first one is that neural retrieval, at its core, is about understanding how to shape and align the embedding space so that the relationship between representations of seemingly very different things, queries and documents, can be effectively captured.

The second reason is that Search, broadly defined, is a foundational tool for any intelligent agent. This is not to say that we are RAG absolutists: knowledge can and should, as much as it is feasible, be baked into the LLMs themselves.

But memorization, alone, is never going to solve problems that we commonly associate with intelligence. Even if it were not outright unreasonable, it would still be exceedingly inefficient to expect LLMs to memorize gigantic e-commerce catalogues that are changing on a daily basis, weekly financial reports, or even today's weather.

The world, and the billions of knowledge bases it contains, is a rapidly evolving environment. I do not expect my accountant to remember every single invoice I have issued over the past 4 years. Nor do I expect them to know exactly what type of tax treaty is applicable to every single kind of international transaction. On the other hand, I do expect that he possesses tools that would make surfacing these pieces of information trivial, should it be needed.

In fact, we agree with Nobel-prize winning psychologist Daniel Kahneman's definition of human intelligence as not only the ability to reason, but also the ability to find relevant material in memory and to deploy attention when needed, insofar that we believe information retrieval to be a cornerstone of intelligence in general. To take the age-old example: geniuses of the past were not any less smart than 20th century physicists, but the very act of creating new knowledge requires the ability to access, search through and curate existing information.

This is what Search is all about: providing an intelligent agent, human or artificial, with the ability to look up information when it is needed, building on the assumption that the agent will then know what to do with it. In our view, perfecting search is potentially one of the most important research north stars in the field, secondary only to AGI itself.

Of course, there are many, many, many, many... many ways of representing information. And there are just as many ways of attempting to retrieve it. "Old school" retrieval has a lot of failure cases, but there are many promising ways to overcome these limitations as part of the endless quest for Perfect Search. Researching these, and understanding why and how they work is the aim of Mixedbread.

Taking the Research out of the lab

While the above sounds very aspirational, it is also very abstract. If the past few years have taught us anything, it's that research conducted in the lab, with outcomes optimized for the lab, and with the sole dissemination aim being an academic paper ultimately has low impact.

Our approach to figuring out what is worth pursuing can be summed up in just 5 words: Impactful research should be useful.

Of course, not all research will yield a usable artefact: long-term work is important! And sometimes, the outcome of weeks of work will be the learning that X does not work. However, it is key to remember that individual research items should always be thought of as part of larger projects, at least fuzzily defined, towards building something that will be useful.

Useful Research

In practice, this way of thinking manifests itself in two ways: ensuring that we conduct research that somehow contributes tangibly to an end goal; and working towards things that are usable in the real world.

Pillar 1: Tangible End Goals

Whenever we come up with a new idea, we ask ourselves: "Should we spend N weeks on this, will we get something useful out of this?".

Useful, here, is defined very broadly: will this project lead to an open-source artefact that will be useful to many people? Will it further our understanding in such a way that it will meaningfully improve future models? Will this improvement be a direct performance improvement, or will it facilitate our decision making (indeed, here are hundreds of knobs to turn when training retrievers, and their interactions are very poorly understood, so any insight onto the underlying mechanisms is useful)

If the answer to any of the questions above is yes, then it is likely worth doing. If, on the other hand, we are unable to understand how doing this work will lead to tangible benefits for a reasonable number of people, then it's probably not going to make it to the priority list.

The nice aspect of this way of thinking is that it is ultimately not very limiting: even "moonshot" projects can fall within it nicely, and we definitely have a few of those... On the other hand, it makes it considerably easier to stay focused: if we're unable to articulate why this would be useful beyond producing a paper, then we either need to think more about it, or it wasn't a great idea in the first place.

Pillar 2: Real-World Usability

The second underlying question that precedes everything we do is directly derived from the first one: is this going to be useful in the real-world? Or, in other words, is X realistically useful?

It would be very easy, for example, to decide that performing reranking with a 1.7T param model is an acceptable research item. After all, with so many parameters, the metrics are going to look amazing. Maybe we could even fine-tune it and use it as a fully uncompressed multi-vector retrieval pipeline: imagine all the information you can squeeze into 16384 dimensions per token!

This could potentially make a great paper, and it would have numbers in bold that would be extremely hard to beat. State-of-the-art would be achieved internally in one triumphant release.

On the other hand, there would be serious concerns here: could this be served at a reasonable price? Could it be served at all without being a loss leader? Would the latency concerns be acceptable for all users, or even for any user at all?

Realistically, the answer to those questions in this extreme example is no. The much more interesting situation, however, is when the answer is "Probably not as is, BUT it could be".

Indeed, a lot of Search, in practice, is about efficiency engineering. Some of the most interesting papers in our field are about creative indexing methods, techniques that allow extreme compression (or quantization) without loss of performance, or other tricks that (almost magically) make things go much faster and require much less hardware without compromising quality.

Hence, this is naturally something that is constantly at the back of our mind, and a lot of our day-to-day is dipping into this work at the intersection of engineering and research. The goal of our research, after all, is to be impactful. And to be impactful, something needs to run!

Striking a Balance

Finally, as mentioned in the introduction, a major concern of ours is how can a for-profit company produce meaningful research?

There are many examples in industry. Obviously, you might be thinking of giants such as OpenAI and Anthropic, whose research very much directly feeds onto products. Or startups, such as HuggingFace and Prime Intellect, whose business model allows them to sustain pretty cool X users (who occasionally write papers and/or libraries) while ensuring money is flowing in.

Looking at other companies who have made it work, in various ways, it has become very clear to us that tradeoffs are inevitable, and that it takes trial and error as well as a domain-specific strategy to truly make this work. It's pretty clear to us that while we have a plan, it'll probably be flawed in many ways, so we expect to iterate on it as we move forward.

Some companies have taken the approach of going full closed-source. While they conduct world-class research, most of it is only ever going to be accessible through their products, with none of the inner workings published. Others are taking the fully open research pathway, and seeking profitability in other ways.

At Mixedbread, we want our research to directly feed into our product, and the feedback on our product to directly inform our research. This obviously means that we cannot, mechanically, be fully open: some things will remain closed source, especially as they are heavily embedded in our internal machinery.

However, a lot of what we do will be open, and we're aiming to set up a hybrid approach. We believe that this can be sustained in various forms:

Much of our research effort is focused on individual aspects of Search and Retrieval, to yield a greater understanding of certain components. In the past, we have shared papers, technical reports and short "findings" blog posts. We fully intend to continue to share these insights with the broader community, as we're convinced that there is no path to "solved search" without the broader research effort.
We have begun releasing specialized tooling which can be used outside of our internal platform, such as maxsim-cpu or batched, and intend to continue to do so in the future, once again as we believe such releases to be important to foster a more mature ecosystem.
Open source models have long been part of Mixedbread's DNA. In fact, our embedding model represented almost 3% of total HuggingFace model downloads last year, and our state-of-the-art rerankers released earlier this year were part of the first wave of research exploring lightweight LLMs as rerankers. In the future, we expect to sustain frequent open model releases, distilling the learnings from our private modelling work onto small, personal-device friendly models.

If this sounds good to you...

... How about joining us? We're currently hiring across all positions!

If you're interested in multimodal Information Retrieval research, very broadly defined, we are looking for:

Research Staff, at all levels of seniority. This is a mix of what you might see called Research Scientist and Research Engineer in other places, or more broadly, Member of Technical Staff, Research. This is an umbrella position for our research team, where the actual responsibilities will be tailored to your research interest and our ongoing objectives.
Research Interns, as part of our internship program. You will be matched with a specific, self-contained project. The goal of all of our internships is to help you build your understanding and your CV, and they're all designed with a publishable artefact (blog post, paper, model/code release..., depending on the use case) in mind.

If you're more interested in the engineering side of making search at scale work, we're also hiring across all engineering positions, for both Software Engineers and DevOps Engineers.