产品

读取URL或搜索为大模型提供更好的依据。

世界一流的多模态多语言向量模型。

世界一流的重排器，最大限度地提高搜索相关性。

搜索、读取并推理直到找到最佳答案。

更多的

图片和文本的零样本和少样本分类。

将长文本切分成块或词元。

添加 mcp.jina.ai 作为您的MCP服务器，让大模型使用我们的API

为您的AI 编程助手 IDE 或大模型自动生成代码

公司

实习生计划

条款及条件

新闻稿

二月 28, 2024

基于多任务对比学习的双语文本嵌入革新

我们的新论文探讨了我们的西班牙语-英语和德语-英语模型如何使用多任务对比学习和复杂的数据流程，来掌握长度达 8192 个 token 的文本的语言理解和跨语言处理能力

Composite image of four colorful, stylized landmarks: Brandenburg Gate, St. Peter's Basilica, Tiananmen, and Golden Gate Brid

Jina AI • 3 分钟的读取量

在我们最近发布的论文 Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings 中，我们详细介绍了德语-英语和西班牙语-英语双语文本嵌入模型的开发。

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

We introduce a novel suite of state-of-the-art bilingual text embedding models that are designed to support English and another target language. These models are capable of processing lengthy text inputs with up to 8192 tokens, making them highly versatile for a range of natural language processing tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations. By focusing on bilingual models and introducing a unique multi-task learning objective, we have significantly improved the model performance on STS tasks, which outperforms the capabilities of existing multilingual models in both target language understanding and cross-lingual evaluation tasks. Moreover, our bilingual models are more efficient, requiring fewer parameters and less memory due to their smaller vocabulary needs. Furthermore, we have expanded the Massive Text Embedding Benchmark (MTEB) to include benchmarks for German and Spanish embedding models. This integration aims to stimulate further research and advancement in text embedding technologies for these languages.

arXiv.orgIsabelle Mohr

我们的方法采用多任务对比学习和先进的数据处理流程，专注于双语能力，同时支持长达 8192 个 token 的文本。这种方法使我们的模型在理解目标语言和高效进行跨语言评估方面表现出色。

除了论文中涉及的双语模型外，我们还开发了中英双语和英语单语模型。这些新增模型展示了我们致力于满足广泛语言需求并提升语言处理能力的承诺。

我们的双语模型以其高效性为特征，通过优化词汇表大小来减少参数数量和内存占用。这种效率凸显了我们致力于创建既强大又资源高效的语言处理工具的决心。

在论文发布后，我们扩展了 Massive Text Embedding Benchmark (MTEB)，加入了英德和英西嵌入模型的基准测试。这一扩展是我们努力推动非英语语言文本嵌入技术研究和进步的一部分。

在 Jina AI，我们的目标是通过在双语和单语文本嵌入模型领域的发展，提升多语言处理和理解能力，为 NLP 领域做出贡献。

类别:

新闻稿

办公室

加利福尼亚州桑尼维尔

710 Lakeway Dr, Ste 200, 桑尼维尔, CA 94085, 美国

德国柏林（总部）

Prinzessinnenstraße 19-20，10969 柏林，德国

中国北京

中国北京市海淀区西大街48号6号楼5层

中国深圳

中国深圳市赋安科技大厦4楼402

搜索底座

获取 Jina API 密钥

公司

实习生计划

条款

条款及条件

Jina AI © 2020-2025.