News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.

jina-clip-v1

Multimodal embedding models for images and English text
Release Postarrow_forward
License
license
Apache-2.0
Release Date
calendar_month
2024-06-05
Input
image
Image
abc
Text
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 223M
Input Token Length: 8K
Input Image Size: 224x224
Output Dimension: 768
Language Support
🇺🇸 English
Related Models
link
jina-clip-v2
link
jina-embeddings-v3
link
jina-colbert-v2
Tags
multimodal-embedding
image-text-alignment
english-only
zero-shot-classification
cross-modal-search
long-text-support
unified-embeddings
text-to-text
text-to-image
visual-semantic
Available via
Jina APIAWS SageMakerMicrosoft AzureHugging Face
Choose models to compare
Publications (1)
ICML 2024
May 30, 2024
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Overview

Jina CLIP v1 revolutionizes multimodal AI by being the first model to excel equally in both text-to-text and text-to-image retrieval tasks. Unlike traditional CLIP models that struggle with text-only scenarios, this model achieves state-of-the-art performance across all retrieval combinations while maintaining a remarkably compact 223M parameter size. The model addresses a critical industry challenge by eliminating the need for separate models for text and image processing, reducing system complexity and computational overhead. For teams building search systems, recommendation engines, or content analysis tools, Jina CLIP v1 offers a single, efficient solution that handles both text and visual content with exceptional accuracy.

Methods

The model's architecture represents a significant innovation in multimodal AI design, combining an adapted Jina BERT v2 text encoder with the cutting-edge EVA-02 image encoder from the Beijing Academy for Artificial Intelligence. The text encoder supports sequences up to 12,288 tokens - over 100 times longer than the original CLIP's 77-token limit - while the image encoder efficiently processes 16 patch tokens. The training process follows a novel three-step approach: first, aligning image-caption pairs while maintaining text understanding through interleaved text-pair training; second, incorporating AI-generated longer text descriptions of images; and finally, using hard negative text triplets to enhance semantic distinction capabilities. This unique training methodology enables the model to maintain high performance across both short captions and detailed textual descriptions while preserving strong visual understanding.

Performance

Jina CLIP v1 demonstrates remarkable improvements over OpenAI's original CLIP across all benchmarks. In text-only retrieval, it achieves a 165% performance increase with a score of 0.429 compared to CLIP's 0.162. For image-related tasks, it shows consistent improvements: 2% better in text-to-image retrieval (0.899), 6% in image-to-text retrieval (0.803), and 12% in image-to-image retrieval (0.916). The model particularly shines in zero-shot visual classification tasks, successfully categorizing images without prior training on specific domains. When evaluated on standard benchmarks like MTEB for text retrieval, CIFAR-100 for image tasks, and Flickr8k/30k and MSCOCO Captions for cross-modal performance, it consistently outperforms specialized single-modality models while maintaining competitive performance in cross-modal tasks.

Best Practice

To effectively deploy Jina CLIP v1, teams should consider both its capabilities and resource requirements. The model processes images in 224x224 pixel tiles, with each tile consuming 1,000 tokens of processing capacity. For optimal performance, implement efficient image preprocessing to match these dimensions. While the model excels at both short and long text processing, it currently only supports English language input. Teams should carefully consider token usage: text requires approximately 1.1 tokens per word, while images are processed in tiles (e.g., a 750x500 pixel image requires 12 tiles, consuming 12,000 tokens). The model is available through both the Jina Embeddings API and as an open-source release on Hugging Face under the Apache 2.0 license, offering flexibility in deployment options. For production environments, consider using the AWS Marketplace or Azure deployment options, which provide optimized infrastructure setups.
Blogs that mention this model
April 08, 2025 • 21 minutes read
jina-reranker-m0: Multilingual Multimodal Document Reranker
Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.
Jina AI
Modern dot matrix text display on a dark blue background, conveying a digital feel.
December 12, 2024 • 12 minutes read
Scaling Test-Time Compute For Embedding Models
Better results scale with compute—more on learning, more on search. A good pretrained model takes you far, but test-time compute takes you further. It's time to recognize this paradigm of test-time compute, even for embedding models.
Han Xiao
David Hockney artwork of a hand holding a rod with three colored spheres on a blue-toned background.
December 04, 2024 • 13 minutes read
Still Need Chunking When Long-Context Models Can Do It All?
Comparing how long-context embedding models perform with different chunking strategies to find the optimal approach for your needs.
Michael Günther
Alex C-G
Artistic pixel art of two seagulls on colored pipes with speech bubbles; one reads "Too long?" and the other shows math equat
November 21, 2024 • 9 minutes read
Jina CLIP v2: Multilingual Multimodal Embeddings for Text and Images
Jina-CLIP v2, a 0.9B multimodal embedding model with multilingual support of 89 languages, high image resolution at 512x512, and Matryoshka representations.
Jina AI
Digital number "2" displayed in a mosaic of colorful squares against a dark background, creating a futuristic vibe.
October 29, 2024 • 11 minutes read
Beyond CLIP: How Jina-CLIP Advances Multimodal Search
Learn how Jina-CLIP enhances OpenAI's CLIP with better retrieval accuracy and more diverse results through unified text-image embeddings.
Bo Wang
Alex C-G
Abstract digital landscape with wave-like green and pink dunes against a dark background, conveying a tranquil atmosphere.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.