Available via
I/O graph 1
I/O graph 2
I/O graph 3
I/O graph 4
Choose models to compare
Publications (1)
Overview
Jina Embeddings V4 is a 3.8 billion parameter multimodal embedding model that provides unified text and image representation capabilities. Built on the Qwen2.5-VL-3B-Instruct backbone, the model features an architecture that supports both single-vector and multi-vector embeddings in the late interaction style, addressing limitations found in traditional CLIP-style dual-encoder models. The model incorporates three specialized task-specific LoRA adapters (60M parameters each) that optimize performance across different retrieval scenarios including asymmetric query-document retrieval, semantic text similarity, and code search without modifying the frozen backbone weights. The model demonstrates strong performance in processing visually rich content such as tables, charts, diagrams, screenshots, and mixed-media formats through a unified processing pathway that reduces the modality gap present in conventional architectures. Supporting multilingual capabilities, the model can handle input texts up to 32,768 tokens with images resized to 20 megapixels, making it suitable for various document retrieval and cross-modal search applications across different languages and domains.
Methods
Jina Embeddings V4 implements a unified multimodal language model architecture that differs from CLIP-style dual-encoder approaches. The model processes inputs through a shared pathway where images are first converted to token sequences via a vision encoder, then both text and image modalities are processed together by the language model decoder with contextual attention layers. This architecture supports two output modes to accommodate different use cases: single-vector embeddings that produce 2048-dimensional vectors truncatable down to 128 dimensions through Matryoshka Representation Learning, generated via mean pooling for efficient similarity search; and multi-vector embeddings that output 128 dimensions per token via projection layers for late interaction style retrieval. The model includes three task-specific LoRA adapters that provide specialized optimization: the retrieval adapter uses prefix-based asymmetric encoding with hard negatives training for query-document scenarios, the text-matching adapter employs CoSENT loss for semantic similarity tasks, and the code adapter focuses on natural language-to-code retrieval applications. Training occurs in two phases: initial pair training using contrastive InfoNCE loss with both text-text and text-image pairs from over 300 sources, followed by task-specific fine-tuning of the three LoRA adapters using triplet-based methods and specialized loss functions tailored to each domain's requirements.
Performance
Jina Embeddings V4 achieves competitive performance across multiple benchmark categories. On visual document retrieval, it scores 72.19 average on the JinaVDR benchmark compared to 64.50 for ColPali-v1.2, and 84.11 average on ViDoRe compared to 83.90 for ColPali, with the multi-vector mode reaching 90.17 on ViDoRe. For cross-modal retrieval, the model scores 84.11 on CLIP Benchmark, compared to jina-clip-v2 (81.12) and nllb-clip-large-siglip (83.19). In text retrieval tasks, it achieves 55.97 on MTEB-en and 66.49 on MMTEB, with notable performance in long document processing at 67.11 on LongEmbed compared to 55.66 for its predecessor. The model demonstrates solid semantic text similarity performance with 85.89 on English STS tasks and 72.70 on multilingual STS benchmarks. Code retrieval capabilities reach 71.59 on CoIR benchmark, though specialized models like voyage-code-3 (77.33) achieve higher scores in this domain. The model shows improved cross-modal alignment with a score of 0.71 compared to 0.15 for OpenAI CLIP, addressing the modality gap issue in multimodal models. Multi-vector mode consistently outperforms single-vector mode on visually rich tasks, while single-vector mode provides efficient performance for standard retrieval scenarios.
Best Practice
To effectively utilize Jina Embeddings V4, select the appropriate LoRA adapter based on your specific application requirements. Use the 'retrieval' adapter for asymmetric query-document retrieval scenarios where queries and documents have different structures, ensuring proper prefixes are applied to distinguish between query and passage content. The 'text-matching' adapter is suitable for semantic similarity tasks and symmetric retrieval where the goal is to find similar content rather than answers to queries, making it appropriate for document clustering, duplicate detection, and content recommendation systems. For programming-related applications, the 'code' adapter is optimized for natural language-to-code retrieval, code-to-code similarity search, and technical question answering scenarios. Choose output modes based on your performance and efficiency requirements: single-vector embeddings offer efficient similarity search and are suitable for storage-constrained environments, with truncatable dimensions allowing reduction from 2048 to 128-512 dimensions with acceptable quality trade-offs, while multi-vector embeddings provide higher precision for complex retrieval tasks, particularly when working with visually rich documents where late interaction scoring captures detailed relationships. The model's unified architecture allows processing of mixed text-image inputs without requiring separate encoders or OCR preprocessing for visual documents. The model's cross-modal alignment capabilities and multilingual support make it suitable for international applications. For production deployments, consider the 60M parameter overhead per LoRA adapter when planning memory requirements, noting that all three adapters can be maintained simultaneously with less than 2% additional memory footprint, enabling flexible task switching during inference.
Blogs that mention this model