在我们最近发布的论文 Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings 中,我们详细介绍了 德语-英语和西班牙语-英语双语文本嵌入模型的开发。
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
We introduce a novel suite of state-of-the-art bilingual text embedding models that are designed to support English and another target language. These models are capable of processing lengthy text inputs with up to 8192 tokens, making them highly versatile for a range of natural language processing tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations. By focusing on bilingual models and introducing a unique multi-task learning objective, we have significantly improved the model performance on STS tasks, which outperforms the capabilities of existing multilingual models in both target language understanding and cross-lingual evaluation tasks. Moreover, our bilingual models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary needs. Furthermore, we have expanded the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models. This integration aims to stimulate further research and advancement in text embedding technologies for these languages.

Embedding API
Start with 1M free tokens. Top-performing, 8192 context length bilingual embeddings for your search and RAG systems.

我们的方法采用多任务对比学习和先进的数据处理流程,专注于双语能力,同时支持长达 8192 个 token 的文本。这种方法使我们的模型在理解目标语言和高效进行跨语言评估方面表现出色。
Aquí Se Habla Español: Top-Quality Spanish-English Embeddings and 8k Context
Jina AI's new bilingual Spanish-English embedding model brings the state-of-the-art in AI to half a billion Spanish speakers.

Ich bin ein Berliner: German-English Bilingual Embeddings with 8K Token Length
Jina AI introduces a German/English bilingual embedding model, featuring an extensive 8,192-token length, specifically designed to support German businesses thriving in the U.S. market.

除了论文中涉及的双语模型外,我们还开发了中英双语和英语单语模型。这些新增模型展示了我们致力于满足广泛语言需求并提升语言处理能力的承诺。
8K Token-Length Bilingual Embeddings Break Language Barriers in Chinese and English
The first bilingual Chinese-English embedding model with 8192 token-length.

Jina AI Launches World's First Open-Source 8K Text Embedding, Rivaling OpenAI
Jina AI introduces jina-embeddings-v2, the world's first open-source model boasting an 8K context length. Matching the prowess of OpenAI's proprietary models, this innovation is now publicly accessible on Huggingface, signaling a significant milestone in the landscape of text embeddings.

我们的双语模型以其高效性为特征,通过优化词汇表大小来减少参数数量和内存占用。这种效率凸显了我们致力于创建既强大又资源高效的语言处理工具的决心。
在论文发布后,我们扩展了 Massive Text Embedding Benchmark (MTEB),加入了英德和英西嵌入模型的基准测试。这一扩展是我们努力推动非英语语言文本嵌入技术研究和进步的一部分。
在 Jina AI,我们的目标是通过在双语和单语文本嵌入模型领域的发展,提升多语言处理和理解能力,为 NLP 领域做出贡献。