jina-clip-v2

Multilingual multimodal embeddings for texts and images

Release Post

License

CC-BY-NC-4.0

Release Date

2024-11-05

Input

Image

Text

Output

Vector

Model Details

Parameters: 865M

Input Token Length: 8K

Input Image Size: 512×512

Output Dimension: 1024

Language Support

🌍 Multilingual support

Related Models

jina-clip-v1

Overview

Jina CLIP v2 revolutionizes multimodal AI by bridging the gap between visual and textual understanding across 89 languages. This model solves critical challenges in global e-commerce, content management, and cross-cultural communication by enabling accurate image-text matching regardless of language barriers. For businesses expanding internationally or managing multilingual content, it eliminates the need for separate models per language or complex translation pipelines. The model particularly shines in scenarios requiring precise visual search across language boundaries, such as global marketplace product discovery or multilingual digital asset management.

Methods

At its core, Jina CLIP v2 employs a sophisticated dual-encoder architecture that combines a Jina XLM-RoBERTa text encoder (561M parameters) with an EVA02-L14 vision encoder (304M parameters). The text encoder processes content in 89 languages with a massive context window of 696,320 tokens, while the vision encoder handles high-resolution images up to 512x512 pixels. The model introduces innovative Matryoshka representation learning, which enables dynamic embedding dimension adjustment from 1024 down to 64 dimensions while preserving performance. This architecture processes both text and images through their respective encoders, projecting them into a shared semantic space where similar concepts align regardless of their original modality or language.

Performance

The model achieves state-of-the-art performance with 98.0% accuracy on Flickr30k image-to-text retrieval tasks, surpassing both its predecessor and NLLB-CLIP-SigLIP. In multilingual scenarios, it demonstrates up to 4% improvement over NLLB-CLIP-SigLIP in cross-lingual image retrieval tasks, despite having fewer parameters than its largest competitor. The model maintains strong performance even when embeddings are compressed - reducing dimensions by 75% still preserves over 99% of performance across text, image, and cross-modal tasks. On the comprehensive Multilingual MTEB benchmarks, it achieves 69.86% on retrieval and 67.77% on semantic similarity tasks, performing competitively with specialized text embedding models.

Best Practice

For optimal deployment, users should consider several key factors. The model requires CUDA-capable hardware for efficient processing, with memory requirements scaling based on batch size and image resolution. To optimize API costs and performance, resize images to 512x512 pixels before processing - larger images are automatically tiled, increasing token usage and processing time. The model excels at matching images with descriptive text across languages but may struggle with abstract concepts or highly specialized domain-specific content. It's particularly effective for e-commerce product search, content recommendation systems, and visual search applications, but may not be suitable for tasks requiring fine-grained visual detail analysis or highly specialized domain expertise. When using the Matryoshka representation feature, consider the trade-off between dimension reduction and performance - while 64-dimension embeddings maintain strong performance, critical applications may benefit from higher dimensions.

Blogs that mention this model

July 31, 2025 • 12 minutes read

How Image Resolution Impacts Visual Document Retrieval

Image resolution is crucial for embedding visually rich documents. Too small and models miss key details; too large and they can't connect the parts.