jina-clip-v1

Multimodal embedding models for images and English text

License

Apache-2.0

Release Date

2024-06-05

Input

Image

Text

Output

Vector

Model Details

Parameters: 223M

Input Token Length: 8K

Input Image Size: 224×224

Output Dimension: 768

Language Support

🇺🇸 English

Related Models

jina-clip-v2

jina-embeddings-v3

jina-colbert-v2

Overview

Jina CLIP v1 revolutionizes multimodal AI by being the first model to excel equally in both text-to-text and text-to-image retrieval tasks. Unlike traditional CLIP models that struggle with text-only scenarios, this model achieves state-of-the-art performance across all retrieval combinations while maintaining a remarkably compact 223M parameter size. The model addresses a critical industry challenge by eliminating the need for separate models for text and image processing, reducing system complexity and computational overhead. For teams building search systems, recommendation engines, or content analysis tools, Jina CLIP v1 offers a single, efficient solution that handles both text and visual content with exceptional accuracy.

Methods

The model's architecture represents a significant innovation in multimodal AI design, combining an adapted Jina BERT v2 text encoder with the cutting-edge EVA-02 image encoder from the Beijing Academy for Artificial Intelligence. The text encoder supports sequences up to 12,288 tokens - over 100 times longer than the original CLIP's 77-token limit - while the image encoder efficiently processes 16 patch tokens. The training process follows a novel three-step approach: first, aligning image-caption pairs while maintaining text understanding through interleaved text-pair training; second, incorporating AI-generated longer text descriptions of images; and finally, using hard negative text triplets to enhance semantic distinction capabilities. This unique training methodology enables the model to maintain high performance across both short captions and detailed textual descriptions while preserving strong visual understanding.

Performance

Jina CLIP v1 demonstrates remarkable improvements over OpenAI's original CLIP across all benchmarks. In text-only retrieval, it achieves a 165% performance increase with a score of 0.429 compared to CLIP's 0.162. For image-related tasks, it shows consistent improvements: 2% better in text-to-image retrieval (0.899), 6% in image-to-text retrieval (0.803), and 12% in image-to-image retrieval (0.916). The model particularly shines in zero-shot visual classification tasks, successfully categorizing images without prior training on specific domains. When evaluated on standard benchmarks like MTEB for text retrieval, CIFAR-100 for image tasks, and Flickr8k/30k and MSCOCO Captions for cross-modal performance, it consistently outperforms specialized single-modality models while maintaining competitive performance in cross-modal tasks.

Best Practice

To effectively deploy Jina CLIP v1, teams should consider both its capabilities and resource requirements. The model processes images in 224x224 pixel tiles, with each tile consuming 1,000 tokens of processing capacity. For optimal performance, implement efficient image preprocessing to match these dimensions. While the model excels at both short and long text processing, it currently only supports English language input. Teams should carefully consider token usage: text requires approximately 1.1 tokens per word, while images are processed in tiles (e.g., a 750x500 pixel image requires 12 tiles, consuming 12,000 tokens). The model is available through both the Jina Embeddings API and as an open-source release on Hugging Face under the Apache 2.0 license, offering flexibility in deployment options. For production environments, consider using the AWS Marketplace or Azure deployment options, which provide optimized infrastructure setups.

Blogs that mention this model

June 25, 2025 • 12 minutes read

Jina Embeddings v4: Universal Embeddings for Multimodal Multilingual Retrieval

Jina Embeddings v4 is a 3.8 billion parameter universal embedding model for multimodal and multilingual retrieval that supports both single-vector and multi-vector embedding outputs.