Back in 2018, Google dropped BERT and it was a game-changer for NLP, way before the current LLM wave. Even now, tons of Small Language Models are built on BERT. In Dec. 2024, ModernBERT takes what we've learned from recent LLM developments and applies it back to these smaller models. The key moves? Better parameter-efficiency, code understanding and long-context handling.
In this post, we'll break down how ModernBERT stacks up against two models we know inside and out: jina-XLM-RoBERTa
(the multilingual backbone behind jina-embeddings-v3) and RobERTa-large
. Let's look at each model:
- ModernBERT (Dec. 2024) is a recently released SLM, developed collaboratively by Answer.AI, LightOn, and HuggingFace. It leverages modern optimizations like RoPE for an 8,192-token context window and GeGLU layers, boosting performance while maintaining efficiency.
jina-XLM-RoBERTa
(Sept. 2024) is a multilingual text embedding model based on Meta'sXLM-RoBERTa
. While the originalXLM-RoBERTa
enhancesRoBERTa
using the XLM large multilingual dataset,jina-XLM-RoBERTa
takes it further with extended context training, RoPE implementation, and FlashAttention-2 support. This model serves as the backbone for jina-embeddings-v3.RoBERTa-large
(July 2019) developed by Meta, is an enhanced version of BERT with 355 million parameters. Through extended training, larger datasets, and innovations like dynamic masking, it has achieved impressive results on key benchmarks including GLUE, SQuAD, and RACE. This makes it well-suited for various NLP tasks from text classification to question answering.
By comparing these models across three core aspects, we aim to highlight ModernBERT's effective design choices for fellow model developers and identify key development insights for future BERT-like models. We'll also share our learnings from developing jina-embeddings-v3 and discuss planned improvements for jina-embeddings-v4
and jina-reranker-v3
.
tagModernBERT's Parameter-Efficiency
Let's first examine ModernBERT's approach to parameter-efficiency - it's bringing in several key insights from recent LLM developments. ModernBERT leverages three core strategies: a deeper but thiner architecture, controlled vocabulary size, and progressive model upscaling starting from smaller models.
tagDeep-And-Thin Architecture
ModernBERT-large goes deeper with 28 layers, while jina-XLM-RoBERTa
and RoBERTa-large
run at 24. But here's the interesting part - it matches RoBERTa-large
in parameter count despite those extra layers. jina-XLM-RoBERTa
needs more parameters since it's handling 89 languages, while the other two focus just on English.
Most of a transformer's parameters comes from attention and fully-connected layers. ModernBERT stays competitive size-wise by going "thinner" - they're running 2,624 hidden units across 28 layers, compared to RoBERTa-large's 4,096 units across 24 layers. This "deeper" but thinner setup is letting them hit their performance targets without bloating the model.
ModernBERT-large | jina-XLM-RoBERTa |
RoBERTa-large |
|
---|---|---|---|
Parameters | 400M | 550M | 355M |
Hidden states | 1,024 | 1,024 | 1,024 |
Intermediate dims | 2,624 | 4,096 | 4,096 |
Attention heads | 16 | 16 | 16 |
Layers | 28 | 24 | 24 |
Vocabulary size | 50,368 | 250,002 | 50,265 |
This approach lines up with Meta's MobileLLM research, which found that for smaller models, depth matters more than width when it comes to capturing complex patterns and driving performance. Essentially, the ability to process information through more transformer layers proves more valuable than having wider layers for parallel processing.
Let's look at the data on how this deep-and-thin architecture performs.
ModernBERT-large | jina-XLM-RoBERTa |
RoBERTa-large |
|
---|---|---|---|
STS12 | 72.6 | 72.7 | 68.9 |
STS13 | 84.9 | 83.9 | 81.0 |
STS14 | 77.5 | 77.7 | 74.8 |
STS15 | 84.8 | 85.8 | 84.1 |
STS16 | 79.4 | 79.6 | 78.6 |
STS17 | 87.5 | 87.2 | 87.2 |
TRECCOVID | 61.1 | 59.6 | 49.3 |
FiQA | 44.4 | 40.0 | 40.7 |
NFCorpus | 32.6 | 30.6 | 27.9 |
SciFact | 68.6 | 65.5 | 63.1 |
Average | 69.3 | 68.2 | 65.6 |
Take jina-XLM-RoBERTa
- it builds on RoBERTa-large
's shallow-fat architecture but pumps up the vocabulary from 50K to 250K tokens and trains on more data. Yet ModernBERT still edges it out, suggesting the architectural shift is making a real difference in efficiency.
tagVocabulary Size Matters
First, let's look at how vocabulary parameters is counted in transformers. For any transformer, vocabulary parameters = number of distinct tokens × hidden size
. Take jina-XLM-RoBERTa
: with 250K tokens and 1,024 dimensions, it needs 256M parameters just for vocabulary encoding - before handling any actual language tasks!
But here is the thing: vocabulary weights don't contribute to attention mechanisms - they're just lookup tables. For SLMs working with fixed parameter budgets, a larger vocabulary means fewer parameters available for attention layers, which do the actual language processing. This explains why English-only ModernBERT-large outperforms multilingual jina-XLM-RoBERTa
despite being smaller - jina-XLM-RoBERTa
allocates more parameters (47%!) to support multiple languages. ModernBERT's focused vocabulary not only improves performance but also speeds up inference, making it particularly effective for resource-constrained applications.
So now if we look at just the core model parameters (excluding vocabulary weights), ModernBERT actually packs more computational power than its peers: ModernBERT dedicates 19% more parameters to actual language modeling than jina-XLM-RoBERTa
and 15% more than RoBERTa-large
!
Model Specs | ModernBERT-large | jina-XLM-RoBERTa |
RoBERTa-large |
---|---|---|---|
Language Support | English Only | 89 Languages | English Only |
Vocab Size | 50.4K | 250K | 50.3K |
Total Params | 400M | 550M | 355M |
Vocab Params | 51M | 256M | 51M |
Vocab Param Ratio | 13% | 47% | 14% |
Core Model Params | 349M | 294M | 304M |
tagModel Upscaling by "Weight Tiling"
In building jina-BERT-v2
backbone, we found training SLMs from scratch was resource-intensive and complex. ModernBERT tackles this with a smart initialization approach called weight tiling - essentially bootstrapping ModernBERT-large from the weights of its smaller base version.
This technique isn't entirely new - it builds on DeepMind's work with Gopher and shows up in Microsoft's Phi-2 models too. But its application here is particularly effective for addressing the SLM training bottleneck.
This initialization strategy gives ModernBERT-large a significant advantage - instead of starting cold from scratch, it leverages pre-learned patterns from its smaller counterpart. It's proven particularly effective for scaling up language models in this size range.
We find that a warm started model rapidly recovers from a high initial loss (due to the added parameters) to a loss quite close to that of the base model. We are able to expand 417M parameters by over 3× in size and maintain performance greater than an equivalent fresh model trained from scratch to convergence, implying that the gains were not limited to the start of training. However, at larger sizes, the relative gains achieved at convergence diminish, especially with expansions in width.
The cyclical weight wrapping isn't just a convenience - it aligns well with how attention matrices naturally exhibit periodic patterns. Gopher's research shows this approach really shines for SLMs (sub-9B parameters), though the benefits start tapering off as you move into larger model territories.
tagModernBERT's Code Modeling
ModernBERT brings a specialized approach to code understanding with its code-optimized tokenizer and training data. This fine-tuning for code processing pays off in both comprehension and retrieval tasks.
We ran a benchmark using the jina-embeddings-v2-code
corpus, comparing three models as backbones: ModernBERT
, jina-XLM-RoBERTa
, and RoBERTa-large
. The test? CodeSearchNet - matching text descriptions to code snippets. ModernBERT outperformed both alternatives across the board.
Task | ModernBERT-large | jina-XLM-RoBERTa |
RoBERTa-large |
---|---|---|---|
AdvRetrieval | 0.342 | 0.363 | 0.331 |
QueryRetrieval.python | 0.521 | 0.530 | 0.525 |
QueryRetrieval java | 0.679 | 0.633 | 0.644 |
QueryRetrieval.javascript | 0.755 | 0.768 | 0.732 |
QueryRetrieval.php | 0.815 | 0.781 | 0.755 |
QueryRetrieval.ruby | 0.729 | 0.744 | 0.722 |
QueryRetrieval.go | 0.833 | 0.809 | 0.796 |
Retrieval.go | 0.778 | 0.750 | 0.759 |
Retrieval.java | 0.840 | 0.792 | 0.796 |
Retrieval.javascript | 0.817 | 0.792 | 0.757 |
Retrieval.php | 0.852 | 0.805 | 0.796 |
Retrieval.python | 0.849 | 0.816 | 0.787 |
Retrieval.ruby | 0.849 | 0.796 | 0.803 |
Avg. | 0.743 | 0.721 | 0.708 |
tagThe Tokenizer Edge
Let's dig into why ModernBERT handles code so well - it's using OLMo tokenizer, which was specifically trained on code, rather than standard BERT/RoBERTa tokenizers.
A tokenizer breaks UTF-8 text into tokens that get mapped to vectors - these are what the model actually processes. During training, it learns to combine frequently occurring character sequences into single tokens. The difference? A standard tokenizer might break init
into in
+ it
, missing the programming context. But ModernBERT's code-aware tokenizer gets it without breaking it.
Here's where it gets interesting with space handling: ModernBERT preserves Python's leading spaces as single tokens and differentiates between 4 vs 8 spaces - crucial for code structure. Meanwhile, jina-XLM-RoBERTa
collapses all continuous spaces into a single _
, and RoBERTa-large treats each space as its own token. This means ModernBERT's encoder gets cleaner, more meaningful input when processing code, while the others are working with fractured, less coherent tokens.
tagModernBERT's Long-Context Handling
ModernBERT has made significant strides in processing long text, thanks to its extensive training corpus (300B tokens with 8,192-token samples) and advanced techniques like combined global and local attention.
To evaluate long-document handling capabilities, we used the MLDR dataset - a comprehensive long-text benchmark spanning 13 languages. Since ModernBERT currently supports only English, we focused on MLDR's English subset to benchmark ModernBERT against jina-XLM-RoBERTa
. While both these models can handle 8K-token inputs, RoBERTa-large
was excluded from this benchmark due to its 512-token limit, which is insufficient for long-text analysis.
ModernBERT-large | jina-XLM-RoBERTa |
|
---|---|---|
MLDR-en | 0.351 | 0.290 |
ModernBERT's superior performance isn't just due to its extensive long-text training - it's largely thanks to its innovative combination of global and local attention mechanisms. Unlike jina-XLM-RoBERTa
, which applies computationally expensive global attention to every layer, ModernBERT takes a more efficient approach. It alternates between global attention (used every third layer with a theta
of 160,000) and local attention (using a 128-token sliding window with a theta
of 100,000). This hybrid strategy maintains high performance while dramatically reducing training time.
In ModernBERT, every third layer employs global attention with a RoPE theta of 160,000 and the remaining layers use a 128 token, local sliding window attention with a RoPE theta of 10,000. —— ModernBERT
tagThe Bitter Lesson?
The scaling law and the bitter lesson suggest that major performance improvements come primarily from increasing parameter counts and training data. This principle guided our approach of expanding the corpus and using LoRA for task-specific adaptations.
However, ModernBERT's success has revealed that we undervalued the power of architectural optimization. It demonstrates that SLMs can achieve exceptional results through better data-model efficiency, without necessarily scaling up parameters. A recent Stella Embeddings technical report reinforces this finding, indicating that current embedding model training methods can be improved without increasing corpus or model size.
Moving forward, we anticipate lower computational costs and smaller model sizes as we gain deeper insights into data utilization and implement ModernBERT's techniques. In the short term, we can implement straightforward improvements outlined in the ModernBERT paper - specifically integrating more code-related data and adopting a code-friendly tokenizer. More complex changes, like switching to a deep-and-thin architecture or bootstrapping large models from smaller ones, will require building backbone models from scratch - a more mid-term initiative.
While ModernBERT's efficiency is remarkable, its text-only limitation points to future challenges. As multimodal embedding models gain popularity, our next challenge is developing smarter, faster, and more capable search foundation models that can handle inputs for multimodal applications. These applications demand even longer context windows - an efficiency challenge that remains to be solved.
tagConclusion
Throughout this post, we've explored how ModernBERT advances BERT-family models through three key innovations: its deep-and-thin architecture, optimized tokenizer, and efficient scaling using weight tiling. These improvements enable ModernBERT to deliver outstanding performance in a relatively compact size, surpassing both RoBERTa-large
and jina-XLM-RoBERTa
across various tasks. ModernBERT demonstrates that architectural improvements can matter more than parameter size, opening doors for more efficient models. Its successful use of weight tiling shows how progressive scaling can reduce training costs while preserving or even boosting performance. Additionally, its compact vocabulary and targeted optimizations suggest growing opportunities for specialized SLMs in resource-limited settings.