News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
copyright

jina-embeddings-v4

Universal embedding model for multimodal and multilingual retrieval
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2025-06-24
Input
abc
Text
image
Image
picture_as_pdf
PDF
arrow_forward
Output
more_horiz
Vector
apps
Multi-Vector
Model Details
Parameters: 3.8B
Input Token Length: 32K
Input Image Size: 768×28×28
Output Dimension: 2048
Language Support
🌍 Multilingual support
Related Models
link
jina-embeddings-v3
link
jina-clip-v2
Tags
multimodal-embedding
document retrieval
multilingual
multi-vector
long-context
production
matryoshka
Available via
Jina APICommercial LicenseHugging Face
I/O graph 1
I/O graph 2
I/O graph 3
I/O graph 4
Choose models to compare
Publications (1)
arXiv
June 24, 2025
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

Overview

Jina Embeddings V4 is a 3.8 billion parameter multimodal embedding model that provides unified text and image representation capabilities. Built on the Qwen2.5-VL-3B-Instruct backbone, the model features an architecture that supports both single-vector and multi-vector embeddings in the late interaction style, addressing limitations found in traditional CLIP-style dual-encoder models. The model incorporates three specialized task-specific LoRA adapters (60M parameters each) that optimize performance across different retrieval scenarios including asymmetric query-document retrieval, semantic text similarity, and code search without modifying the frozen backbone weights. The model demonstrates strong performance in processing visually rich content such as tables, charts, diagrams, screenshots, and mixed-media formats through a unified processing pathway that reduces the modality gap present in conventional architectures. Supporting multilingual capabilities, the model can handle input texts up to 32,768 tokens with images resized to 20 megapixels, making it suitable for various document retrieval and cross-modal search applications across different languages and domains.

Methods

Jina Embeddings V4 implements a unified multimodal language model architecture that differs from CLIP-style dual-encoder approaches. The model processes inputs through a shared pathway where images are first converted to token sequences via a vision encoder, then both text and image modalities are processed together by the language model decoder with contextual attention layers. This architecture supports two output modes to accommodate different use cases: single-vector embeddings that produce 2048-dimensional vectors truncatable down to 128 dimensions through Matryoshka Representation Learning, generated via mean pooling for efficient similarity search; and multi-vector embeddings that output 128 dimensions per token via projection layers for late interaction style retrieval. The model includes three task-specific LoRA adapters that provide specialized optimization: the retrieval adapter uses prefix-based asymmetric encoding with hard negatives training for query-document scenarios, the text-matching adapter employs CoSENT loss for semantic similarity tasks, and the code adapter focuses on natural language-to-code retrieval applications. Training occurs in two phases: initial pair training using contrastive InfoNCE loss with both text-text and text-image pairs from over 300 sources, followed by task-specific fine-tuning of the three LoRA adapters using triplet-based methods and specialized loss functions tailored to each domain's requirements.

Performance

Jina Embeddings V4 achieves competitive performance across multiple benchmark categories. On visual document retrieval, it scores 72.19 average on the JinaVDR benchmark compared to 64.50 for ColPali-v1.2, and 84.11 average on ViDoRe compared to 83.90 for ColPali, with the multi-vector mode reaching 90.17 on ViDoRe. For cross-modal retrieval, the model scores 84.11 on CLIP Benchmark, compared to jina-clip-v2 (81.12) and nllb-clip-large-siglip (83.19). In text retrieval tasks, it achieves 55.97 on MTEB-en and 66.49 on MMTEB, with notable performance in long document processing at 67.11 on LongEmbed compared to 55.66 for its predecessor. The model demonstrates solid semantic text similarity performance with 85.89 on English STS tasks and 72.70 on multilingual STS benchmarks. Code retrieval capabilities reach 71.59 on CoIR benchmark, though specialized models like voyage-code-3 (77.33) achieve higher scores in this domain. The model shows improved cross-modal alignment with a score of 0.71 compared to 0.15 for OpenAI CLIP, addressing the modality gap issue in multimodal models. Multi-vector mode consistently outperforms single-vector mode on visually rich tasks, while single-vector mode provides efficient performance for standard retrieval scenarios.

Best Practice

To effectively utilize Jina Embeddings V4, select the appropriate LoRA adapter based on your specific application requirements. Use the 'retrieval' adapter for asymmetric query-document retrieval scenarios where queries and documents have different structures, ensuring proper prefixes are applied to distinguish between query and passage content. The 'text-matching' adapter is suitable for semantic similarity tasks and symmetric retrieval where the goal is to find similar content rather than answers to queries, making it appropriate for document clustering, duplicate detection, and content recommendation systems. For programming-related applications, the 'code' adapter is optimized for natural language-to-code retrieval, code-to-code similarity search, and technical question answering scenarios. Choose output modes based on your performance and efficiency requirements: single-vector embeddings offer efficient similarity search and are suitable for storage-constrained environments, with truncatable dimensions allowing reduction from 2048 to 128-512 dimensions with acceptable quality trade-offs, while multi-vector embeddings provide higher precision for complex retrieval tasks, particularly when working with visually rich documents where late interaction scoring captures detailed relationships. The model's unified architecture allows processing of mixed text-image inputs without requiring separate encoders or OCR preprocessing for visual documents. The model's cross-modal alignment capabilities and multilingual support make it suitable for international applications. For production deployments, consider the 60M parameter overhead per LoRA adapter when planning memory requirements, noting that all three adapters can be maintained simultaneously with less than 2% additional memory footprint, enabling flexible task switching during inference.
Blogs that mention this model
June 25, 2025 • 12 minutes read
Jina Embeddings v4: Universal Embeddings for Multimodal Multilingual Retrieval
Jina Embeddings v4 is a 3.8 billion parameter universal embedding model for multimodal and multilingual retrieval that supports both single-vector and multi-vector embedding outputs.
Jina AI
March 07, 2025 • 14 minutes read
Long-Context Embedding Models are Blind Beyond 4K Tokens
We investigate embedding models on new "needle-in-haystack" tasks and find that beyond 4K tokens, they're just rolling dice - even with exact lexical matches or query expansion, they can't tell signal from noise in long context.
Saahil Ognawala
Alex C-G
Vertical repetition of the word 'HAYSTACK' with a solitary 'NEEDLE' on a yellowish background.
January 22, 2025 • 10 minutes read
What Should We Learn From ModernBERT?
Bigger training data, efficient parameter sizing, and a deep-but-thin architecture, ModernBERT sets a direction for future BERT-like models.
Nan Wang
Alex C-G
Futuristic illustration with a central white circle surrounded by white dots on a dotted background.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.