Jina Embeddings v4: Universal Embeddings for Multimodal Multilingual Retrieval

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

arXiv.orgMichael Günther

Today we're releasing jina-embeddings-v4, our new 3.8 billion parameter universal embedding model for text and images. It includes a set of task-specific LoRA adapters that optimize performance for the most popular retrieval tasks, including query-document retrieval, semantic matching, and code search. jina-embeddings-v4 achieves state-of-the-art retrieval performance on multimodal and multilingual tasks across MTEB, MMTEB, CoIR, LongEmbed, STS, Jina-VDR, CLIP, and ViDoRe benchmarks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixture of them. The model supports both single-vector and multi-vector embeddings.

Colorful charts comparing the performance of various bioinformatics tools, depicted in shades of green, black, and purple, fo — Performance of jina-embeddings-v4 across visual document retrieval and multimodal benchmarks. The boxplot distributions show mean scores and performance variability for embedding models across six benchmark categories: ViDoRe (vision document retrieval), Jina-VDR (comprehensive visual document retrieval), Wikimedia Commons Retrieval (multilingual document-description matching), GitHub README Retrieval (code documentation retrieval), Tweet Stock Retrieval (financial chart analysis), and CLIP Benchmark (general text-to-image retrieval). Jina-embeddings-v4 variants (highlighted in cyan) demonstrate state-of-the-art performance across visually rich document tasks, with the multi-vector version achieving the highest scores in specialized visual document benchmarks (90.2 on ViDoRe, 80.2 on Jina-VDR), while maintaining competitive performance on general multimodal retrieval tasks (84.1 on CLIP Benchmark). Models are ranked by mean performance within each benchmark category, with individual data points showing score distributions across multiple evaluation tasks.

jina-embeddings-v4 is our most ambitious embedding model yet. As an open-source model, jina-embeddings-v4 outperforms leading closed-source embedding models from major providers, delivering 12% better performance than OpenAI's text-embedding-3-large on multilingual retrieval (66.49 vs 59.27), 28% improvement on long document tasks (67.11 vs 52.42), 15% better than voyage-3 on code retrieval (71.59 vs 67.23), and matching Google's gemini-embedding-001 performance. This makes v4 the most capable open-source universal embedding model available today, offering researchers and developers enterprise-grade multimodal embedding capabilities with full transparency into training process, architectural decisions, and model weights through our comprehensive technical report.

Tables comparing benchmark results for various embedding models, using color coding to represent different evaluation metrics — Performance of jina-embeddings-v4 across five retrieval benchmarks. The chart shows boxplot distributions with mean scores for each model across Text Retrieval, Code Retrieval, Multilingual Retrieval, Long Context Retrieval, and Semantic Textual Similarity (STS) benchmarks. jina-embeddings-v4 (highlighted in cyan) demonstrates competitive or state-of-the-art performance across all evaluation categories, with particularly strong results in text retrieval and STS. Models are ranked by mean performance within each benchmark category, with individual data points showing score distributions across multiple evaluation tasks.

Leaderboard table on a dark green background ranking models by performance metrics like Rank, Model, Memory, and Mean, with h — As of July 18, 2025, jina-embeddings-v4 ranked No. 5 on the MTEB retrieval task benchmark and is also the smallest model among its peers. Go to the MTEB Leaderboard, select "customize this benchmark," and in the task type field, exclude everything but Retrieval. Then you will see the updated table. Unlike `jina-embedding-v3`, we did not submit benchmarks for other tasks except Retrieval and STS.

tagNew Architecture

Diagram illustrating a technological process for encoding textual and image data into vectors using multiple computational mo — Architecture of jina-embeddings-v4. The model is built on the `Qwen2.5-VL-3B-Instruct` backbone (3.8B parameters). Text and image inputs are processed through a shared pathway: images are first converted to token sequences via a vision encoder, then both modalities are jointly processed by the language model decoder with contextual attention layers. Three task-specific LoRA adapters (60M parameters each) provide specialized optimization for retrieval, text-matching, and code tasks without modifying the frozen backbone weights. The architecture supports dual output modes: (1) single-vector embeddings (2048 dimensions, truncatable to 128) generated via mean pooling for efficient similarity search, and (2) multi-vector embeddings (128 dimensions per token) via projection layers for late interaction retrieval strategies.

The upgrade from jina-embeddings-v3 to jina-embeddings-v4 represents a paradigm shift from text-only to multimodal embeddings. While v3 focused on optimizing text embeddings with task-specific LoRA adapters, v4 addresses the growing requirement for embedding both textual and visual content in unified representations.

Aspect	jina-embeddings-v3	jina-embeddings-v4
Backbone Model	jina-XLM-RoBERTa	Qwen2.5-VL-3B-Instruct
Parameters (Base)	559M	3.8B
Parameters (with adapters)	572M	3.8B + 60M per adapter
Modalities	Text only	Text + Images (multimodal)
Max Input Length	8,192 tokens	32,768 tokens
Image Processing	None	Up to 20 megapixels, visually rich documents
Multilingual Support	89 languages	29+ languages
Vector Types	Single-vector only	Single-vector + Multi-vector (late interaction)
Single-vector Dimensions	1024 (MRL truncatable to 32)	2048 (MRL truncatable to 128)
Multi-vector Dimensions	Not available	128 per token
Task LoRA Specializations	• Asymmetric retrieval • Semantic similarity • Classification • Separation	• Asymmetric retrieval • Semantic similarity • Code retrieval
Training Stages	3-stage: Pre-training → Embedding fine-tuning → Adapter training	2-stage: Joint pair training → Task-specific adapter training
Loss Functions	InfoNCE, CoSent, Extended triplet loss	Joint InfoNCE + KL divergence for single/multi-vector
Positional Encoding	RoPE (rotary base frequency tuning)	M-RoPE (Multimodal Rotary Position Embedding)
Cross-modal Processing	N/A	Unified encoder (reduced modality gap)
MRL Support	Yes	Yes
Attention Implementation	FlashAttention2	FlashAttention2

tagBackbone

The most significant architectural change in v4 is the backbone change from XLM-RoBERTa to Qwen2.5-VL-3B-Instruct. This decision was driven by v4's core objective to create a universal embedding model that enables "true multimodal processing" where images are converted to token sequences and processed alongside text, eliminating the modality gap present in dual-encoder architectures.

The backbone selection aligns with several key design goals: Qwen2.5-VL's excellence in document understanding directly supports v4's strength in processing visually rich content like tables, charts, and screenshots. The dynamic resolution capabilities enable v4 to handle images resized to 20 megapixels as specified in the architecture. The advanced positional encoding provides the foundation that allows v4 to achieve superior cross-modal alignment with a 0.71 alignment score compared to 0.15 for OpenAI CLIP.

tagLoRA Adapters

V4 streamlines from v3's five tasks to three focused tasks, reflecting lessons learned about effectiveness and user adoption:

Asymmetric retrieval (consolidating v3's query/passage adapters)
Symmetric similarity (v3's text-matching equivalent for STS tasks)
Code retrieval (learned from v2-code, missing in v3)

This consolidation removes v3's classification and separation adapters, focusing v4 on the most impactful embedding use cases - retrieval and STS.

tagOutput Embeddings

V4 introduces a dual-output system supporting both single-vector and multi-vector embeddings, whereas v3 only provided single-vector outputs. This addresses different retrieval scenarios:

Single-vector mode: 2048-dimensional embeddings (truncatable to 128 via MRL) for efficient similarity search
Multi-vector mode: 128 dimensions per token for late-interaction retrieval

This dual approach provides greater effectiveness with multi-vector representations, particularly in visually rich document retrieval, while maintaining efficiency for standard similarity tasks. The consistent 7-10% performance advantage of multi-vector over single-vector mode across visual tasks suggests that late interaction provides fundamentally better semantic matching for multimodal content.

tagParameter Size

While v4 is 6.7 times larger than v3 (3.8B vs 570M parameters), the text-only performance improvements are actually modest, suggesting the parameter scaling was primarily driven by multimodal requirements rather than text enhancement. On core text benchmarks, v4 achieves 66.49 on MMTEB compared to v3's 58.58 (14% improvement) and 55.97 on MTEB-EN versus v3's 54.33 (3% improvement). For code retrieval, v4 scores 71.59 on CoIR compared to v3's 55.07 (30% improvement), while long document performance shows v4 at 67.11 versus v3's 55.66 on LongEmbed (21% improvement). The substantial scaling becomes justified when considering v4's multimodal capabilities: achieving 84.11 nDCG@5 on visual document retrieval (Jina-VDR) and 90.17 on ViDoRe benchmarks - capabilities entirely absent in v3. The parameter increase thus represents our investment in multimodal functionality while maintaining competitive text performance, with the unified architecture eliminating the need for separate text and vision models while achieving 0.71 cross-modal alignment compared to 0.15 for traditional dual-encoder approaches.

tagGetting Started

For a quick vibe-check, try our text-to-image demo in the Search Foundation toolbox. We've prepared a collection of document images from our website, and you can also add your own image URLs. Simply type your query and press enter to see ranked results. You may retreat it either like OCR or content-based image retrieval - also feel free to try queries in non-English.

0:00

/0:22

The demo is available at: https://jina.ai/api-dashboard/m0-image-rerank Please note that using this demo will consume your primary API key's tokens. Also the demo might seem a bit slow since it needs to download all images on the server from those URLs, and no cache is implemented for images.

tagVia API

The code below shows how to use jina-embeddings-v4. You can pass a text string, a base64-encoded image, or an image URL. New users can get a Jina API key with 10 million free tokens.

curl https://api.jina.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer JINA_API_KEY" \
  -d @- <<EOFEOF
  {
    "model": "jina-embeddings-v4",
    "task": "text-matching",
    "input": [
        {
            "text": "A beautiful sunset over the beach"
        },
        {
            "text": "Un beau coucher de soleil sur la plage"
        },
        {
            "text": "海滩上美丽的日落"
        },
        {
            "text": "浜辺に沈む美しい夕日"
        },
        {
            "image": "https://i.ibb.co/nQNGqL0/beach1.jpg"
        },
        {
            "image": "https://i.ibb.co/r5w8hG8/beach2.jpg"
        },
        {
            "image": "iVBORw0KGgoAAAANSUhEUgAAABwAAAA4CAIAAABhUg/jAAAAMklEQVR4nO3MQREAMAgAoLkoFreTiSzhy4MARGe9bX99lEqlUqlUKpVKpVKpVCqVHksHaBwCA2cPf0cAAAAASUVORK5CYII="
        }
    ]
  }
EOFEOF

Due to limited GPU resources, our Embedding API currently supports documents up to 8K tokens in length, despite jina-embeddings-v4's native capability of handling up to 32K tokens. For applications requiring longer contexts beyond 8K tokens (such as Late Chunking), we recommend deploying our models through CSPs or self-hosting the model.

tagVia CSP Marketplaces

jina-embeddings-v4 will be soon available directly on AWS, Azure and GCP at the prices listed there.

Microsoft Azure Marketplace

tagVia HuggingFace

For research and experiment purpose, you can use the model locally from our Hugging Face page. We've prepared a Google Colab notebook that demonstrates how it works.

tagConclusion

jina-embeddings-v4 represents our most significant leap yet—a 3.8 billion parameter universal embedding model that processes text and images through a unified pathway, supporting both dense and late-interaction retrieval while outperforming proprietary models from Google, OpenAI and Voyage AI especially on visually rich document retrieval. But this capability didn't emerge in isolation; it's the culmination of four generations of solving fundamental limitations.

When we started with jina-embeddings-v1 back in early 2022, everyone assumed more data meant better performance. We proved the opposite—filtering 1.5 billion pairs down to 385 million high-quality examples outperformed much larger datasets. The lesson: curation beats collection.

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model’s awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.

arXiv.orgMichael Günther

But users kept hitting BERT's 512-token wall. Training on longer sequences seemed expensive, until jina-embeddings-v2 revealed an elegant solution: train short, deploy long. ALiBi's linear attention biases let models trained on 512 tokens seamlessly handle 8,192 tokens at inference. We got more capability for less compute.

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI’s proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

arXiv.orgMichael Günther

jina-embeddings-v2's success exposed another constraint—different tasks needed different optimizations. Rather than building separate models, jina-embeddings-v3 used tiny 60M LoRA adapters to customize a 570M base model for any task. One model became five specialized models.

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.

arXiv.orgSaba Sturua

Even with task specialization, we remained text-only while users needed visual understanding. The standard CLIP based models like jina-clip-v1 and jina-clip-v2 uses separate encoders, creating a "modality gap" where similar content in different formats ends up far apart. Like our recently released jina-reranker-m0, jina-embeddings-v4 eliminated this entirely—one unified pathway processes everything, removing the gap rather than bridging it.

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

arXiv.orgMichael Günther

Both jina-embeddings-v4 and jina-reranker-m0 share a fundamental shift: using LLMs as backbones instead of encoder-only models. This isn't coincidental—it reflects a deep advantage most miss: Encoder-only models create "modality gaps" where images cluster separately from text. The decoder-only models opens up possibilities that weren't achievable with encoder-only architectures, including true mixed-modality representation and explainability.

Our key insight: embeddings and generation are both about understanding semantics. LLMs that excel at generation naturally excel at representation. We believe the future lies in unified architectures where embedding and reranking emerge from the same search foundation model—and that's exactly what Jina AI is building toward.