News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.

jina-embeddings-v2-base-zh

Chinese-English bilingual embeddings with SOTA performance
Release Postarrow_forward
License
license
Apache-2.0
Release Date
calendar_month
2024-01-09
Input
abc
Text
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 161M
Input Token Length: 8K
Output Dimension: 768
Language Support
🇺🇸 English
🇨🇳 Chinese
Related Models
link
jina-embeddings-v2-base-en
link
jina-embeddings-v3
Tags
text-embedding
chinese
multilingual
base-model
production
long-context
high-dimension
Available via
Jina APIAWS SageMakerMicrosoft AzureHugging Face
Choose models to compare
Publications (1)
arXiv
February 26, 2024
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Overview

Jina Embeddings v2 Base Chinese breaks new ground as the first open-source model to seamlessly handle both Chinese and English text with an unprecedented 8,192 token context length. This bilingual powerhouse addresses a critical challenge in global business: the need for accurate, long-form document processing across Chinese and English content. Unlike traditional models that struggle with cross-lingual understanding or require separate models for each language, this model maps equivalent meanings in both languages to the same embedding space, making it invaluable for organizations expanding globally or managing multilingual content.

Methods

The model's architecture combines a BERT-based backbone with symmetric bidirectional ALiBi (Attention with Linear Biases), enabling efficient processing of long sequences without the traditional 512-token limitation. The training process follows a carefully orchestrated three-phase approach: initial pre-training on high-quality bilingual data, followed by primary and secondary fine-tuning stages. This methodical training strategy, coupled with the model's 161M parameters and 768-dimensional output, achieves remarkable efficiency while maintaining balanced performance across both languages. The symmetric bidirectional ALiBi mechanism represents a significant innovation, allowing the model to handle documents up to 8,192 tokens in length—a capability previously limited to proprietary solutions.

Performance

In benchmarks on the Chinese MTEB (C-MTEB) leaderboard, the model demonstrates exceptional performance among models under 0.5GB, particularly excelling in Chinese language tasks. It significantly outperforms OpenAI's text-embedding-ada-002 in Chinese-specific applications while maintaining competitive performance in English tasks. A notable improvement in this release is the refined similarity score distribution, addressing the score inflation issues present in the preview version. The model now provides more distinct and logical similarity scores, ensuring more accurate representation of semantic relationships between texts. This enhancement is particularly evident in comparative tests, where the model shows superior discrimination between related and unrelated content in both languages.

Best Practice

The model requires 322MB of storage and can be deployed through multiple channels including AWS SageMaker (us-east-1 region) and the Jina AI API. While GPU acceleration isn't mandatory, it can significantly improve processing speed for production workloads. The model excels in various applications including document analysis, multilingual search, and cross-lingual information retrieval, but users should note that it's specifically optimized for Chinese-English bilingual scenarios. For optimal results, input text should be properly segmented, and while the model can handle up to 8,192 tokens, breaking extremely long documents into semantically meaningful chunks is recommended for better performance. The model may not be suitable for tasks requiring real-time processing of very short texts where lower-latency, specialized models might be more appropriate.
Blogs that mention this model
April 29, 2024 • 7 minutes read
Jina Embeddings and Reranker on Azure: Scalable Business-Ready AI Solutions
Jina Embeddings and Rerankers are now available on Azure Marketplace. Enterprises that prioritize privacy and security can now easily integrate Jina AI's state-of-the-art models right in their existing Azure ecosystem.
Susana Guzmán
Futuristic black background with a purple 3D grid, featuring the "Embeddings" and "Reranker" logos with a stylized "A".
February 28, 2024 • 3 minutes read
Revolutionizing Bilingual Text Embeddings with Multi-Task Contrastive Learning
Our new paper explores how our Spanish-English and German-English models use multi-task contrastive learning and a sophisticated data pipeline to master language understanding and cross-lingual efficiency for texts up to 8192 tokens
Jina AI
Composite image of four colorful, stylized landmarks: Brandenburg Gate, St. Peter's Basilica, Tiananmen, and Golden Gate Brid
January 31, 2024 • 16 minutes read
A Deep Dive into Tokenization
Tokenization, in LLMs, means chopping input texts up into smaller parts for processing. So why are embeddings billed by the token?
Scott Martens
Colorful speckled grid pattern with a mix of small multicolored dots on a black background, creating a mosaic effect.
January 26, 2024 • 13 minutes read
Jina Embeddings v2 Bilingual Models Are Now Open-Source On Hugging Face
Jina AI's open-source bilingual embedding models for German-English and Chinese-English are now on Hugging Face. We’re going to walk through installation and cross-language retrieval.
Scott Martens
Colorful "EMBEDDINGS" text above a pile of yellow smileys on a black background with decorative lines at the top.
January 09, 2024 • 12 minutes read
8K Token-Length Bilingual Embeddings Break Language Barriers in Chinese and English
The first bilingual Chinese-English embedding model with 8192 token-length.
Jina AI
Colorful 3D text "OPEN" in green and blue on a black background creating a vibrant effect
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.