jina-embeddings-v2-base-zh

Chinese-English bilingual embeddings with SOTA performance

License

Apache-2.0

Release Date

2024-01-09

Input

Text

Output

Vector

Model Details

Parameters: 161M

Input Token Length: 8K

Output Dimension: 768

Language Support

🇺🇸 English

🇨🇳 Chinese

Related Models

jina-embeddings-v2-base-en

jina-embeddings-v3

Overview

Jina Embeddings v2 Base Chinese breaks new ground as the first open-source model to seamlessly handle both Chinese and English text with an unprecedented 8,192 token context length. This bilingual powerhouse addresses a critical challenge in global business: the need for accurate, long-form document processing across Chinese and English content. Unlike traditional models that struggle with cross-lingual understanding or require separate models for each language, this model maps equivalent meanings in both languages to the same embedding space, making it invaluable for organizations expanding globally or managing multilingual content.

Methods

The model's architecture combines a BERT-based backbone with symmetric bidirectional ALiBi (Attention with Linear Biases), enabling efficient processing of long sequences without the traditional 512-token limitation. The training process follows a carefully orchestrated three-phase approach: initial pre-training on high-quality bilingual data, followed by primary and secondary fine-tuning stages. This methodical training strategy, coupled with the model's 161M parameters and 768-dimensional output, achieves remarkable efficiency while maintaining balanced performance across both languages. The symmetric bidirectional ALiBi mechanism represents a significant innovation, allowing the model to handle documents up to 8,192 tokens in length—a capability previously limited to proprietary solutions.

Performance

In benchmarks on the Chinese MTEB (C-MTEB) leaderboard, the model demonstrates exceptional performance among models under 0.5GB, particularly excelling in Chinese language tasks. It significantly outperforms OpenAI's text-embedding-ada-002 in Chinese-specific applications while maintaining competitive performance in English tasks. A notable improvement in this release is the refined similarity score distribution, addressing the score inflation issues present in the preview version. The model now provides more distinct and logical similarity scores, ensuring more accurate representation of semantic relationships between texts. This enhancement is particularly evident in comparative tests, where the model shows superior discrimination between related and unrelated content in both languages.

Best Practice

The model requires 322MB of storage and can be deployed through multiple channels including AWS SageMaker (us-east-1 region) and the Jina AI API. While GPU acceleration isn't mandatory, it can significantly improve processing speed for production workloads. The model excels in various applications including document analysis, multilingual search, and cross-lingual information retrieval, but users should note that it's specifically optimized for Chinese-English bilingual scenarios. For optimal results, input text should be properly segmented, and while the model can handle up to 8,192 tokens, breaking extremely long documents into semantically meaningful chunks is recommended for better performance. The model may not be suitable for tasks requiring real-time processing of very short texts where lower-latency, specialized models might be more appropriate.

Blogs that mention this model

April 29, 2024 • 7 minutes read

Jina Embeddings and Reranker on Azure: Scalable Business-Ready AI Solutions

Jina Embeddings and Rerankers are now available on Azure Marketplace. Enterprises that prioritize privacy and security can now easily integrate Jina AI's state-of-the-art models right in their existing Azure ecosystem.