In these difficult times, nothing beats a nice, warm bowl of soup.
Minestrone is one of the classic Italian soups: thick, hearty, flavorful, combining beans, hearty vegetables, and rice or pasta. Its taste is a product of assembling diverse ingredients. It’s a bit like borscht in Eastern Europe, casseroles in America, or a homemade stir-fry in Pacific Asia in that it combines available, inexpensive ingredients into a beloved dish.
We can use much the same kind of recipe for neural network models, according to a line of papers starting with Wortsman et al. (2022).
“Model soups” (alas, not “model casseroles” or “model stir-fries”) are a class of model ensembling techniques designed to mitigate the cost of optimizing training data and model hyperparameters. When training a neural network, you typically try different data and hyperparameter values and train multiple times, looking for the best-performing result. Training is very computationally expensive, and costs add up quickly.
Instead, model soups involve training multiple models with different hyperparameters and training data choices — the same as you usually would — but then combining them. The result is a higher-performing and more robust model than the single best performer. It doesn’t save costs because you still train multiple models, but you can get a better result for the same price.
The model soup approach has already proven useful for text-image multimodal embedding models (Wortsman et al. 2022) and generative large language models. (Takuya et al. 2025) At Jina AI, we’ve begun using this technique to train our own models, and jina-embeddings-v3 and reader-lm-v2
both incorporate model soups.
In this article, we’re going to look at model soups and show the results of some of our work with them. Specifically:
- Can we use model soups to improve performance by merging models at different points in their training?
- Can we merge models trained with different datasets and for different tasks to obtain better performance and higher training efficiency than by training a single model?
This has important potential benefits:
- Model soups can have better and more robust performance.
- Multilingual embedding models often suffer from biases and performance failures caused by unequal amounts of training data. It would be a boon to be able to train the best model we can on each task or dataset individually and then combine them equally.
- We may be able to do better continuous learning and model updating by making changes to our models in a modular way, updating one component model at a time, and then remerging it with the others.
tagHow Does it Work?
Merging the outputs of multiple models is an old technique in statistical decision theory. For example, it's common practice in weather forecasting to create multiple models, often made by different people with different assumptions, and then use a variety of mechanisms to average their predictions. If each model’s errors are randomly distributed, then averaging the models will lead to answers with fewer errors.
For example, if you have three different models that output a binary “yes” or “no”, and each is wrong 10% of the time, then two out of the three will be wrong only 2.8% of the time. Five models, with a majority decision criterion, will only be wrong 0.856% of the time.
Averaging models works on the same principle, but instead of combining the outputs of different models, it combines the models themselves.
The approach used is an extension of stochastic weight averaging (Izmailov et al. 2018), which relies on insights into the loss landscapes of neural networks to show that simple weight averaging can improve model generalization performance under common conditions.
The actual mechanics of averaging the models is disturbingly simple: You just average the weights of multiple models.

If this seems too easy, it’s important to note that there are limitations when merging models this way. You can’t just merge the weights of any two neural networks and expect it to work.
Model averaging only works on very similar models, i.e., models whose weights are not very different from each other to begin with. The way to ensure this is to pre-train one model and then create multiple variants of that model by fine-tuning them with different hyperparameters or different data. These models will typically be similar enough to average.
In more technical terms, pre-training usually produces a model whose weights are near the bottom of a loss basin, and fine-tuning doesn’t easily lead to escaping that loss basin. If all the models to be merged have weights in the same loss basin, then their weights will be fairly close to the same, and averaging them is likely to work. This is not guaranteed, but empirically, it seems to be true often enough to be useful.
tagExperimental Setup
Base Model: For the experiments described here, we used xlm-roberta-base
from FacebookAI (Conneau et al. 2020) as our pre-trained base model. This model has 280 million parameters and has been pre-trained on 2.5TB of Common Crawl data containing text in roughly 100 languages.
We fine-tuned xlm-roberta-base
on our curated sentence pair training set for embeddings training, before performing our experiments.
Training Data: Jina AI maintains custom-curated datasets for training. For the first experiment, we used sentence triplets specifically curated for contrastive training in six languages: English, Arabic, German, Spanish, Japanese, and Chinese. For the second experiment, we used task-specific training datasets in English.
Evaluation: We used relevant parts of the MMTEB benchmark set (Enevoldsen et al. 2025) and MIRACL benchmark (Zhang et al. 2023) to evaluate the models produced by our training and merging.
tagExperiment 1: Single-Run Averaging
For this experiment, we used contrastive sentence triplets in all six languages, mixed together, for a total of 6,000 training steps with a batch size of 1,024 items. At every 2,000 steps, we saved the model state for averaging, producing 3 models, each reflecting a different point in the training process.
We averaged the three models to produce a final model. We then tested the merged model and the three saved checkpoints against the MMTEB-STS and MIRACL benchmark sets.
Our results are summarized in the table below:
Model | MIRACL (avg 6 languages) |
MMTEB-STS English (avg 8 benchmarks) |
MMTEB-STS Multilingual (avg 6 benchmarks) |
Average of 20 benchmarks |
---|---|---|---|---|
No triplet training | 0.3163 | 0.7859 | 0.7322 | 0.6276 |
Step 2000 | 0.4631 | 0.7924 | 0.7561 | 0.6813 |
Step 4000 | 0.4639 | 0.7902 | 0.7583 | 0.6812 |
Step 6000 (final) | 0.4680 | 0.7891 | 0.7575 | 0.6818 |
Merged model (all 3 stored checkpoints) |
0.4669 | 0.7910 | 0.7579 | 0.6823 |
Merging with previous checkpoints did not generally produce a better-performing model than the best performer among the stored checkpoints on individual benchmarks or on any of the three batteries of benchmarks used. However, it did produce the best model on all benchmarks averaged together.
In individual benchmarks, the difference between the merged model and the best-performing checkpoint is in every case less than 0.01. This is true not only for the averages in the table above but for each individual test.
This demonstrates that merging different training checkpoints can produce a more robust model at very little performance cost.
Furthermore, by merging the different checkpoints, we can effectively guard against overtraining. Overtraining has recently become an important topic in neural networks. (Springer et al., 2025) A network can be trained in a way that makes it harder and worse performing after further fine-tuning.
Since the best-performing checkpoint in our experiment is often not the last one, we have likely overtrained our model at 6,000 training steps. The merged model comes very close to matching the performance of the best checkpoint in all tests, removing the defects of overtraining.
tagExperiment 2: Averaging Models Trained for Different Tasks
For this experiment, we trained three models, each for a different common embedding task:
- Semantic similarity: Measuring the relative overlap or similarity in meaning between two texts, typically of comparable length.
- Document retrieval based on textual queries: Finding the documents that best satisfy a query. Queries are generally much shorter texts than the documents they match.
- Question answering: Finding the document that best answers a natural language question. Questions are also generally much shorter than the texts they match.
Training models for all three tasks at once is quite difficult because the goals are very dissimilar, and we hope that model soups will improve the process.
Based on previous experience, we knew that each task required a different number of training epochs. The training is summarized below:
Task | Training Steps (batchsize = 1,024) |
Training Dataset Size (in items) |
---|---|---|
Question Answering (QA) | 2,000 | 256,000 |
Document Retrieval | 3,000 | 384,000 |
Semantic Similarity (STS) | 1,000 | 128,000 |
This produced three models, which we then merged into a single model. We tested the resulting model against the portions of the MMTEB benchmark set relevant to those three tasks: MIRACL, NanoBEIR, and STSEval (English and Multilingual parts of MMTEB).
MIRACL (avg 6 languages) |
NanoBEIR (avg 13 benchmarks) |
MMTEB-STS English (avg 9 benchmarks) |
MMTEB-STS Multilingual (avg 6 benchmarks) |
Average 34 benchmarks | |
---|---|---|---|---|---|
No triplet training | 0.3163 | 0.5089 | 0.7859 | 0.7322 | 0.5876 |
QA training | 0.4489 | 0.5332 | 0.7843 | 0.7535 | 0.6237 |
Retrieval training | 0.4272 | 0.5360 | 0.7766 | 0.7340 | 0.6154 |
STS training | 0.1779 | 0.4519 | 0.7994 | 0.7651 | 0.5508 |
Merged model | 0.4246 | 0.5309 | 0.7981 | 0.7640 | 0.6240 |
We see here that the task-specific trained models have the best performance on each task. MIRACL is primarily a question-answering benchmark, even if it’s called a retrieval one, and the QA-trained model outperforms all others on it, including the merged model. NanoBEIR is a more conventional information retrieval benchmark set, and we see the retrieval-trained model is the top performer on it. The semantic similarity (STS) model scores quite poorly on those benchmarks, but beats the others on explicitly STS tasks. For each category, the merged model performs more poorly than the single-task trained model.
But once again, if we average over all benchmarks, the merged model outperforms the others, although its score represents only a very small improvement over the QA-trained model, and it is a very poor performer on STS tasks.
We also merged just the QA and retrieval models and scored the resulting model on the same benchmarks:
MIRACL (avg 6 languages) |
NanoBEIR (avg 13 benchmarks) |
MMTEB-STS English (avg 9 benchmarks) |
MMTEB-STS Multilingual (avg 6 benchmarks) |
Average 34 tests | Average QA & IR (19 tests) |
Average STS (15 tests) |
|
---|---|---|---|---|---|---|---|
Best task-trained model | 0.4489 | 0.5360 | 0.7994 | 0.7651 | 0.6237 | 0.5066 | 0.7857 |
Merged model | 0.4246 | 0.5309 | 0.7981 | 0.7640 | 0.6240 | 0.4973 | 0.7845 |
QA+Retrieval merged model | 0.4610 | 0.5404 | 0.7878 | 0.7498 | 0.6288 | 0.5153 | 0.7726 |
We see here that while we can improve performance on both question-answering and retrieval by merging trained models for the two tasks, adding STS-trained models reduces task-specific performance in all categories. This suggests that semantic similarity is, in some important respects, unlike QA and retrieval, and an STS-trained model is unsuited to merge with the other two.
This is likely because question-answering and retrieval involve matching short texts — questions and queries — to longer documents, while semantic similarity involves comparing documents of more similar length.
Wortsman et al. (2022) describe a selective approach to averaging that they call “greedy” merging. It involves taking one model, usually the best performing of a set of models, and then only adding those models to it that individually improve performance. With just three models, there was little point in using greedy merging for this experiment. However, we might imagine a case with more models and using a technique like this as a basis for determining the degree of similarity between tasks. We found here that semantic similarity is unlike the other two. We could then evaluate when one model can perform many tasks and when it’s more cost-effective to use a different model.
tagSoup’s on!
Model soups blend diversity into something greater than the sum of their parts. The value of this approach is in its ability to offer greater consistency, robustness, and to act as a safeguard against overtraining at no additional training cost. Our experiments show that merging checkpoints or task-specialized models can enhance overall performance, even if it occasionally comes at the cost of task-specific peaks.
In the end, model soups offer a practical and very simple way to build more adaptable models, although it comes with some caveats. It’s not a panacea, and it’s applicable only when models are already very similar.
As they say on the internet, Your Mileage May Vary. But it’s cheap and easy to find out if model soups can help when you train your models.