News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
copyright

jina-clip-v2

Multilingual multimodal embeddings for texts and images
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2024-11-05
Input
image
Image
abc
Text
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 865M
Input Token Length: 8K
Input Image Size: 512x512
Output Dimension: 1024
Language Support
šŸŒ Multilingual support
Related Models
link
jina-clip-v1
Tags
multimodal-embedding
image-text-alignment
multilingual
large-context
instruction-tuned
masked-region-learning
production
cross-lingual-retrieval
zero-shot-classification
modality-gap-aware
Available via
Jina APICommercial LicenseAWS SageMakerMicrosoft AzureGoogle CloudHugging Face
I/O graph 1
I/O graph 2
Choose models to compare
Publications (1)
ICLR 2025
December 12, 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Overview

Jina CLIP v2 revolutionizes multimodal AI by bridging the gap between visual and textual understanding across 89 languages. This model solves critical challenges in global e-commerce, content management, and cross-cultural communication by enabling accurate image-text matching regardless of language barriers. For businesses expanding internationally or managing multilingual content, it eliminates the need for separate models per language or complex translation pipelines. The model particularly shines in scenarios requiring precise visual search across language boundaries, such as global marketplace product discovery or multilingual digital asset management.

Methods

At its core, Jina CLIP v2 employs a sophisticated dual-encoder architecture that combines a Jina XLM-RoBERTa text encoder (561M parameters) with an EVA02-L14 vision encoder (304M parameters). The text encoder processes content in 89 languages with a massive context window of 696,320 tokens, while the vision encoder handles high-resolution images up to 512x512 pixels. The model introduces innovative Matryoshka representation learning, which enables dynamic embedding dimension adjustment from 1024 down to 64 dimensions while preserving performance. This architecture processes both text and images through their respective encoders, projecting them into a shared semantic space where similar concepts align regardless of their original modality or language.

Performance

The model achieves state-of-the-art performance with 98.0% accuracy on Flickr30k image-to-text retrieval tasks, surpassing both its predecessor and NLLB-CLIP-SigLIP. In multilingual scenarios, it demonstrates up to 4% improvement over NLLB-CLIP-SigLIP in cross-lingual image retrieval tasks, despite having fewer parameters than its largest competitor. The model maintains strong performance even when embeddings are compressed - reducing dimensions by 75% still preserves over 99% of performance across text, image, and cross-modal tasks. On the comprehensive Multilingual MTEB benchmarks, it achieves 69.86% on retrieval and 67.77% on semantic similarity tasks, performing competitively with specialized text embedding models.

Best Practice

For optimal deployment, users should consider several key factors. The model requires CUDA-capable hardware for efficient processing, with memory requirements scaling based on batch size and image resolution. To optimize API costs and performance, resize images to 512x512 pixels before processing - larger images are automatically tiled, increasing token usage and processing time. The model excels at matching images with descriptive text across languages but may struggle with abstract concepts or highly specialized domain-specific content. It's particularly effective for e-commerce product search, content recommendation systems, and visual search applications, but may not be suitable for tasks requiring fine-grained visual detail analysis or highly specialized domain expertise. When using the Matryoshka representation feature, consider the trade-off between dimension reduction and performance - while 64-dimension embeddings maintain strong performance, critical applications may benefit from higher dimensions.
Blogs that mention this model
April 08, 2025 • 21 minutes read
jina-reranker-m0: Multilingual Multimodal Document Reranker
Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.
Jina AI
Modern dot matrix text display on a dark blue background, conveying a digital feel.
January 07, 2025 • 6 minutes read
Text-Image Global Contrastive Alignment and Token-Patch Local Alignment
CLIP can visualize token-patch similarities, however, it’s more of a post-hoc interpretability trick than a robust or official "attention" from the model. Here's why.
Han Xiao
3D rendered scene with a black-screened laptop on a geometrical pedestal and patterned spheres, against a blue backdrop.
December 16, 2024 • 2 minutes read
ReĀ·Search: Order 2024 Yearbook of Search Foundation Advances
Discover ReĀ·Search, our premium yearbook showcasing our best research articles and search foundation models in 2024. Featuring spot UV-coated hardcover, 160 full-color pages, and meticulous design throughout. Available worldwide at $35, shipping included.
Jina AI
Open red publication "ReSearch" volume 24 displayed on a white surface with a distinctive shadow casting over the pages.
December 12, 2024 • 12 minutes read
Scaling Test-Time Compute For Embedding Models
Better results scale with compute—more on learning, more on search. A good pretrained model takes you far, but test-time compute takes you further. It's time to recognize this paradigm of test-time compute, even for embedding models.
Han Xiao
David Hockney artwork of a hand holding a rod with three colored spheres on a blue-toned background.
November 21, 2024 • 9 minutes read
Jina CLIP v2: Multilingual Multimodal Embeddings for Text and Images
Jina-CLIP v2, a 0.9B multimodal embedding model with multilingual support of 89 languages, high image resolution at 512x512, and Matryoshka representations.
Jina AI
Digital number "2" displayed in a mosaic of colorful squares against a dark background, creating a futuristic vibe.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI Ā© 2020-2025.