Jina ColBERT v2：用于向量检索和重排序的多语言后期交互检索器

今天，我们很高兴发布 Jina ColBERT v2（jina-colbert-v2），这是一个基于 ColBERT 架构的高级后期交互检索模型。这个新的语言模型提升了 jina-colbert-v1-en 的性能，并添加了多语言支持和动态输出维度。

这个新版本具有以下特点：

与原始 ColBERT-v2（+6.5%）和我们之前的版本 jina-colbert-v1-en（+5.4%）相比，具有更优的检索性能。
支持 89 种语言的多语言能力，在主要全球语言中都表现出色。
通过套娃表示学习实现用户可控的输出嵌入维度，使用户能够灵活平衡效率和精度。

tagjina-colbert-v2 技术总结

完整的技术报告可在 arXiv 上找到：

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.

arXiv.orgRohan Jha

	jina-colbert-v2	jina-colbert-v1-en	Original ColBERTv2
Average of 14 English BEIR tasks	0.521	0.494	0.489
Multilingual	89 languages	English-only	English-only
Output dimensions	128, 96, or 64	Fixed 128	Fixed 128
Max query length	32 tokens	32 tokens	32 tokens
Max document length	8192 tokens	8192 tokens	512 tokens
Parameters	560M	137M	110M
Model size	1.1GB	550MB	438MB

tagColBERT 中的非对称嵌入

ColBERT 在 BERT 架构的基础上增加了后期交互和非对称查询-文档编码。

ColBERT 的非对称性质意味着，在使用 jina-colbert-v2 或 jina-colbert-v1-en 等模型时，你需要指定是在进行查询嵌入、文档嵌入，还是两者都需要（用于重排序）。这种额外的灵活性提升了检索任务中相对于同质嵌入模型的性能。

tag支持超过 89 种语言

Jina ColBERT v2 具有广泛的多语言能力，旨在满足现代全球化信息检索和 AI 应用的需求。jina-colbert-v2 的训练语料库包含 89 种语言，并对主要国际语言进行了额外的训练阶段，包括阿拉伯语、中文、英语、法语、德语、日语、俄语和西班牙语，以及编程语言。训练还包括了对齐的双语文本语料库，以实现跨语言潜力，允许在重排序/检索任务中匹配不同语言的查询和文档。

Chart of language distribution in training data, highlighting dominance of English and Chinese. — 按语言（使用 ISO-639 代码指定）的预训练数据集分布情况（对数刻度）。

如今，Jina ColBERT v2 作为唯一一个生成紧凑嵌入的多语言 ColBERT 类模型，在 MIRACL 基准测试中测试的所有语言中都显著优于基于 BM25 的检索。

Bar chart comparing jina-colbert-v2 and BM25 performance across 20 languages on multilingual tasks. — 在 MIRACL 基准测试中，Jina ColBERT v2 在 16 种语言上与 BM25 的性能对比。

此外，在英语检索任务中，Jina ColBERT v2 的性能超过了其前身 jina-colbert-v1-en 和原始 ColBERT v2 模型，与高度专门化的仅英语 AnswerAI-ColBERT-small 模型的性能相当。

模型名称	平均分数 (14 个 BEIR 英语基准测试)	多语言支持
jina-colbert-v2	0.521	多语言
jina-colbert-v1-en	0.494	仅英语
ColBERT v2.0	0.489	仅英语
AnswerAI-ColBERT-small	0.549	仅英语

Bar chart showing model evaluations on English BEIR datasets, with several models like 'jina-colbert' and 'BM25'. — jina-colbert-v2 在 BEIR 基准英语数据集上的评估。

tag套娃表示学习

套娃表示学习是一种训练模型以支持不同输出向量大小同时最小化精度损失的技术。我们用几个不同的线性投影头（神经网络的最终层）训练网络的隐藏层，每个投影头支持不同的输出大小。Jina ColBERT v2 支持 128、96 和 64 维的输出向量。

Jina ColBERT v2 默认生成 128 维的输出嵌入，但也可以生成 96 和 64 维的嵌入，这些嵌入的性能几乎相同，但分别短了 25% 和 50%。

下表显示了 nDGC 的性能jina-colbert-v2 在六个 BEIR 基准测试中的前十结果（nDGC@10）。在此可以看到，128 维度和 96 维度之间的性能差异仅略高于 1%，而 128 维度和 64 维度之间的差异不到 1.5%。

输出维度	平均分数 (6 个基准测试的 nDGC@10)
128	0.565
96	0.558
64	0.556

BEIR 基准测试的条形图，突出显示从 nfcorpus 到 msmarco 等数据集的分数，其中 jina-colbert-v2.64 表现出色。 — 不同输出维度下的 Jina ColBERT v2 性能。

减小输出向量的大小可以节省空间，并加快需要比较不同向量或测量向量之间距离的应用程序（如基于向量的信息检索）的速度。

这带来了显著的成本影响，即使仅考虑存储成本的减少。例如，使用 Qdrant 的云成本计算器，在 AWS 上存储 1 亿个文档（每个文档具有 128 维向量）的预估成本为每月 1,319.24 美元。而在 64 维度下，这个成本降至 659.62 美元。

tag开始使用 Jina ColBERT v2

Jina ColBERT v2 可通过 Jina Search Foundation API、AWS marketplace 和 Azure 获取。它也可在 Hugging Face 上获得，但仅供非商业用途（CC BY-NC-4.0）。

tag通过 Jina Search Foundation API

用于嵌入

以下 curl 命令展示了如何通过 Jina Embeddings API 指定输入和选项来获取 jina-colbert-v2 的文档嵌入。要获取您所需大小的向量，请在 dimensions 参数中指定 128 或 64。此参数是可选的，默认值为 128。

如果输入文档超过 8192 个标记，将被截断。

在授权头中指定您的 Jina API 密钥 Authorization: Bearer <YOUR JINA API KEY>：

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # 或 64 用于半尺寸向量
	"input_type": "document", # 查询嵌入见下文
	"embedding_type": "float",
	"input": [
		"Your document text string goes here", 
		"You can send multiple texts", 
		"Each text can be up to 8192 tokens long"
    ]}'

要获取查询嵌入，请将 input_type 参数设置为 query 而不是 document。请注意，查询的大小限制比文档更严格。它们将在 32 个标记处被截断。查询编码将始终返回 32 个标记，如果少于 32 个标记，则包括填充的嵌入。

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # 或 64 用于半尺寸向量	
	"input_type": "query", # 查询嵌入必须指定此项
	"embedding_type": "float",
	"input": [
		"Your query text string goes here", 
		"You can send multiple texts", 
		"Each query text can be up to 32 tokens long"
    ]}'

用于重排序

要通过 Jina Reranker API 使用 jina-colbert-v2，输入一个查询和多个文档并获取可排序的匹配分数，请按如下方式构造您的请求：

curl https://api.jina.ai/v1/rerank \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
      "model": "jina-colbert-v2",
      "query": "What is the population of Berlin?",
      "top_n": 3,
      "documents": [
        "Berlin's population grew by 0.7 percent in 2023 compared with the previous year. Accordingly, around 27,300 more residents lived in Berlin at the end of the last year than in 2022. Those of 30 to under 40 years old form the numerically largest age group. With roughly 881,000 foreign residents from around 170 nations and an average age of the population of 42.5 years old.",
        "Mount Berlin is a glacier-covered volcano in Marie Byrd Land, Antarctica, 100 kilometres (62 mi) from the Amundsen Sea. It is a roughly 20-kilometre-wide (12 mi) mountain with parasitic vents that consists of two coalesced volcanoes: Berlin proper with the 2-kilometre-wide (1.2 mi) Berlin Crater and Merrem Peak with a 2.5-by-1-kilometre-wide (1.55 mi × 0.62 mi) crater, 3.5 kilometres (2.2 mi) away from Berlin.",
        "Population as of 31.12.2023 by nationality and federal states Land\\tTotal\\tGermans\\tForeigners\\tincluding EU-states number\\t%\\tnumber\\t%",
        "The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union.",
        "Irving Berlin (born Israel Beilin) was an American composer and songwriter. His music forms a large part of the Great American Songbook. Berlin received numerous honors including an Academy Award, a Grammy Award, and a Tony Award.",
        "Berlin is a town in the Capitol Planning Region, Connecticut, United States. The population was 20,175 at the 2020 census.",
        "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
        "Berlin, Berlin ist eine für die ARD produzierte Fernsehserie, die von 2002 bis 2005 im Vorabendprogramm des Ersten ausgestrahlt wurde. Regie führten unter anderem Franziska Meyer Price, Christoph Schnee, Sven Unterwaldt Jr. und Titus Selge."
        ]
    }'

注意 top_n 参数，它指定了您想要检索的文档数量。例如，如果您的应用程序只使用最佳匹配，请将 top_n 设置为 1。

要获取 Python 和其他编程语言及框架的代码片段，请访问 Jina AI Embeddings API 页面，或在 Jina Reranker API 页面的下拉菜单中选择 jina-colbert-v2。

tag通过 Stanford ColBERT

你也可以在 Stanford ColBERT 库中使用 Jina ColBERT v2 作为 ColBERT v2 的直接替代品。只需指定 jinaai/jina-colbert-v2 作为模型来源：

from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = ["Your list of texts"] 
query_vectors = ckpt.queryFromText(docs)

⚠️

要使用上述代码，你必须安装 einops 和 flash_attn。

tag通过 RAGatouille

Jina ColBERT v2 同样集成到了 RAGatouille 中。你可以通过 RAGPretrainedModel.from_pretrained() 方法下载和使用它：

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = ["Your list of texts"]
RAG.index(docs, index_name="your_index_name")
query = "Your query"
results = RAG.search(query)

⚠️

要使用上述代码，你必须安装 einops 和 flash_attn。

tag通过 Qdrant

从 1.10 版本开始，Qdrant 添加了对多向量和延迟交互模型的支持。无论是本地还是托管的云版本，Qdrant 引擎的现有用户都可以通过 Qdrant 的客户端直接集成 jina-colbert-v2。

使用 MAX_SIM 操作创建新的 Collection

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

qdrant_client.create_collection(
    collection_name="{collection_name}",
    vectors_config={
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        )
    }
)

⚠️

正确设置 multivector_config 参数对在 Qdrant 中使用 ColBERT 风格的模型至关重要。

向多向量集合插入文档

import requests
from qdrant_client import QdrantClient, models

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer <YOUR BEARER>'
}

data = {
    'model': 'jina-colbert-v2',
    'input_type': 'query',
    'embedding_type': 'float',
    'input': [
        'Your text string goes here',
        'You can send multiple texts',
        'Each text can be up to 8192 tokens long'
    ]
}

response = requests.post(url, headers=headers, json=data)
rows = response.json()["data"]

qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

for i, row in enumerate(rows):
    qdrant_client.upsert(
        collection_name="{collection_name}",
        points=[
            models.PointStruct(
                id=i,  
                vector=row["embeddings"],  
                payload={"text": data["input"][i]} 
            )
        ],
    )

查询集合

from qdrant_client import QdrantClient, models
import requests

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer <YOUR BEARER>'
}


data = {
    'model': 'jina-colbert-v2',
    "input_type": "query",
    "embedding_type": "float",
    "input": [
        "how many tokens in an input do Jina AI's embedding models support?"
    ]
}

response = requests.post(url, headers=headers, json=data)
vector = response.json()["data"][0]["embeddings"]


qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

results = qdrant_client.query_points(
    collection_name="{collection_name}",
    query=vector,
)

print(results)

tag总结

Jina ColBERT v2（jina-colbert-v2）在 jina-colbert-v1-en 的高性能基础上，将其功能扩展到了更广泛的全球语言。通过支持多种嵌入维度，jina-colbert-v2 允许用户根据具体使用场景调整精度和效率的平衡，这可能会在时间和计算成本方面带来显著节省。

该模型将所有这些功能整合到一个价格具有竞争力的包中，可通过直观的 Web API 访问，并且与支持 HTTP 请求的任何计算框架兼容。立即试用，获得 100 万个免费 token，看看它如何提升你的应用和流程。