


Today we're releasing jina-embeddings-v4, our new 3.8 billion parameter universal embedding model for text and images. It includes a set of task-specific LoRA adapters that optimize performance for the most popular retrieval tasks, including query-document retrieval, semantic matching, and code search. jina-embeddings-v4 achieves state-of-the-art retrieval performance on multimodal and multilingual tasks across MTEB, MMTEB, CoIR, LongEmbed, STS, Jina-VDR, CLIP, and ViDoRe benchmarks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixture of them. The model supports both single-vector and multi-vector embeddings.

jina-embeddings-v4 is our most ambitious embedding model yet. As an open-source model, jina-embeddings-v4 outperforms leading closed-source embedding models from major providers, delivering 12% better performance than OpenAI's text-embedding-3-large
on multilingual retrieval (66.49 vs 59.27), 28% improvement on long document tasks (67.11 vs 52.42), 15% better than voyage-3
on code retrieval (71.59 vs 67.23), and matching Google's gemini-embedding-001
performance. This makes v4 the most capable open-source universal embedding model available today, offering researchers and developers enterprise-grade multimodal embedding capabilities with full transparency into training process, architectural decisions, and model weights through our comprehensive technical report.

tagNew Architecture
Qwen2.5-VL-3B-Instruct
backbone (3.8B parameters). Text and image inputs are processed through a shared pathway: images are first converted to token sequences via a vision encoder, then both modalities are jointly processed by the language model decoder with contextual attention layers. Three task-specific LoRA adapters (60M parameters each) provide specialized optimization for retrieval, text-matching, and code tasks without modifying the frozen backbone weights. The architecture supports dual output modes: (1) single-vector embeddings (2048 dimensions, truncatable to 128) generated via mean pooling for efficient similarity search, and (2) multi-vector embeddings (128 dimensions per token) via projection layers for late interaction retrieval strategies.The upgrade from jina-embeddings-v3 to jina-embeddings-v4 represents a paradigm shift from text-only to multimodal embeddings. While v3 focused on optimizing text embeddings with task-specific LoRA adapters, v4 addresses the growing requirement for embedding both textual and visual content in unified representations.
Aspect | jina-embeddings-v3 | jina-embeddings-v4 |
---|---|---|
Backbone Model | jina-XLM-RoBERTa | Qwen2.5-VL-3B-Instruct |
Parameters (Base) | 559M | 3.8B |
Parameters (with adapters) | 572M | 3.8B + 60M per adapter |
Modalities | Text only | Text + Images (multimodal) |
Max Input Length | 8,192 tokens | 32,768 tokens |
Image Processing | None | Up to 20 megapixels, visually rich documents |
Multilingual Support | 89 languages | 29+ languages |
Vector Types | Single-vector only | Single-vector + Multi-vector (late interaction) |
Single-vector Dimensions | 1024 (MRL truncatable to 32) | 2048 (MRL truncatable to 128) |
Multi-vector Dimensions | Not available | 128 per token |
Task LoRA Specializations | • Asymmetric retrieval • Semantic similarity • Classification • Separation |
• Asymmetric retrieval • Semantic similarity • Code retrieval |
Training Stages | 3-stage: Pre-training → Embedding fine-tuning → Adapter training | 2-stage: Joint pair training → Task-specific adapter training |
Loss Functions | InfoNCE, CoSent, Extended triplet loss | Joint InfoNCE + KL divergence for single/multi-vector |
Positional Encoding | RoPE (rotary base frequency tuning) | M-RoPE (Multimodal Rotary Position Embedding) |
Cross-modal Processing | N/A | Unified encoder (reduced modality gap) |
MRL Support | Yes | Yes |
Attention Implementation | FlashAttention2 | FlashAttention2 |
tagBackbone
The most significant architectural change in v4 is the backbone change from XLM-RoBERTa
to Qwen2.5-VL-3B-Instruct
. This decision was driven by v4's core objective to create a universal embedding model that enables "true multimodal processing" where images are converted to token sequences and processed alongside text, eliminating the modality gap present in dual-encoder architectures.
The backbone selection aligns with several key design goals: Qwen2.5-VL's excellence in document understanding directly supports v4's strength in processing visually rich content like tables, charts, and screenshots. The dynamic resolution capabilities enable v4 to handle images resized to 20 megapixels as specified in the architecture. The advanced positional encoding provides the foundation that allows v4 to achieve superior cross-modal alignment with a 0.71 alignment score compared to 0.15 for OpenAI CLIP.
tagLoRA Adapters
V4 streamlines from v3's five tasks to three focused tasks, reflecting lessons learned about effectiveness and user adoption:
- Asymmetric retrieval (consolidating v3's query/passage adapters)
- Symmetric similarity (v3's text-matching equivalent for STS tasks)
- Code retrieval (learned from v2-code, missing in v3)
This consolidation removes v3's classification and separation adapters, focusing v4 on the most impactful embedding use cases - retrieval and STS.
tagOutput Embeddings
V4 introduces a dual-output system supporting both single-vector and multi-vector embeddings, whereas v3 only provided single-vector outputs. This addresses different retrieval scenarios:
- Single-vector mode: 2048-dimensional embeddings (truncatable to 128 via MRL) for efficient similarity search
- Multi-vector mode: 128 dimensions per token for late-interaction retrieval
This dual approach provides greater effectiveness with multi-vector representations, particularly in visually rich document retrieval, while maintaining efficiency for standard similarity tasks. The consistent 7-10% performance advantage of multi-vector over single-vector mode across visual tasks suggests that late interaction provides fundamentally better semantic matching for multimodal content.
tagParameter Size
While v4 is 6.7 times larger than v3 (3.8B vs 570M parameters), the text-only performance improvements are actually modest, suggesting the parameter scaling was primarily driven by multimodal requirements rather than text enhancement. On core text benchmarks, v4 achieves 66.49 on MMTEB compared to v3's 58.58 (14% improvement) and 55.97 on MTEB-EN versus v3's 54.33 (3% improvement). For code retrieval, v4 scores 71.59 on CoIR compared to v3's 55.07 (30% improvement), while long document performance shows v4 at 67.11 versus v3's 55.66 on LongEmbed (21% improvement). The substantial scaling becomes justified when considering v4's multimodal capabilities: achieving 84.11 nDCG@5 on visual document retrieval (Jina-VDR) and 90.17 on ViDoRe benchmarks - capabilities entirely absent in v3. The parameter increase thus represents our investment in multimodal functionality while maintaining competitive text performance, with the unified architecture eliminating the need for separate text and vision models while achieving 0.71 cross-modal alignment compared to 0.15 for traditional dual-encoder approaches.
tagGetting Started
For a quick vibe-check, try our text-to-image demo in the Search Foundation toolbox. We've prepared a collection of document images from our website, and you can also add your own image URLs. Simply type your query and press enter to see ranked results. You may retreat it either like OCR or content-based image retrieval - also feel free to try queries in non-English.
The demo is available at: https://jina.ai/api-dashboard/m0-image-rerank Please note that using this demo will consume your primary API key's tokens. Also the demo might seem a bit slow since it needs to download all images on the server from those URLs, and no cache is implemented for images.
tagVia API
The code below shows how to use jina-embeddings-v4. You can pass a text string, a base64-encoded image, or an image URL. New users can get a Jina API key with 10 million free tokens.
curl https://api.jina.ai/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer JINA_API_KEY" \
-d @- <<EOFEOF
{
"model": "jina-embeddings-v4",
"task": "text-matching",
"input": [
{
"text": "A beautiful sunset over the beach"
},
{
"text": "Un beau coucher de soleil sur la plage"
},
{
"text": "海滩上美丽的日落"
},
{
"text": "浜辺に沈む美しい夕日"
},
{
"image": "https://i.ibb.co/nQNGqL0/beach1.jpg"
},
{
"image": "https://i.ibb.co/r5w8hG8/beach2.jpg"
},
{
"image": "iVBORw0KGgoAAAANSUhEUgAAABwAAAA4CAIAAABhUg/jAAAAMklEQVR4nO3MQREAMAgAoLkoFreTiSzhy4MARGe9bX99lEqlUqlUKpVKpVKpVCqVHksHaBwCA2cPf0cAAAAASUVORK5CYII="
}
]
}
EOFEOF
Due to limited GPU resources, our Embedding API currently supports documents up to 8K tokens in length, despite jina-embeddings-v4's native capability of handling up to 32K tokens. For applications requiring longer contexts beyond 8K tokens (such as Late Chunking), we recommend deploying our models through CSPs or self-hosting the model.
tagVia CSP Marketplaces
jina-embeddings-v4 will be soon available directly on AWS, Azure and GCP at the prices listed there.
tagVia HuggingFace
For research and experiment purpose, you can use the model locally from our Hugging Face page. We've prepared a Google Colab notebook that demonstrates how it works.

tagConclusion
jina-embeddings-v4 represents our most significant leap yet—a 3.8 billion parameter universal embedding model that processes text and images through a unified pathway, supporting both dense and late-interaction retrieval while outperforming proprietary models from Google, OpenAI and Voyage AI especially on visually rich document retrieval. But this capability didn't emerge in isolation; it's the culmination of four generations of solving fundamental limitations.
When we started with jina-embeddings-v1
back in early 2022, everyone assumed more data meant better performance. We proved the opposite—filtering 1.5 billion pairs down to 385 million high-quality examples outperformed much larger datasets. The lesson: curation beats collection.

But users kept hitting BERT's 512-token wall. Training on longer sequences seemed expensive, until jina-embeddings-v2
revealed an elegant solution: train short, deploy long. ALiBi's linear attention biases let models trained on 512 tokens seamlessly handle 8,192 tokens at inference. We got more capability for less compute.

jina-embeddings-v2
's success exposed another constraint—different tasks needed different optimizations. Rather than building separate models, jina-embeddings-v3 used tiny 60M LoRA adapters to customize a 570M base model for any task. One model became five specialized models.

Even with task specialization, we remained text-only while users needed visual understanding. The standard CLIP based models like jina-clip-v1 and jina-clip-v2 uses separate encoders, creating a "modality gap" where similar content in different formats ends up far apart. Like our recently released jina-reranker-m0, jina-embeddings-v4 eliminated this entirely—one unified pathway processes everything, removing the gap rather than bridging it.

Both jina-embeddings-v4 and jina-reranker-m0 share a fundamental shift: using LLMs as backbones instead of encoder-only models. This isn't coincidental—it reflects a deep advantage most miss: Encoder-only models create "modality gaps" where images cluster separately from text. The decoder-only models opens up possibilities that weren't achievable with encoder-only architectures, including true mixed-modality representation and explainability.
Our key insight: embeddings and generation are both about understanding semantics. LLMs that excel at generation naturally excel at representation. We believe the future lies in unified architectures where embedding and reranking emerge from the same search foundation model—and that's exactly what Jina AI is building toward.