Jina ColBERT v2：用於 Embedding 與重排序的多語言後期互動檢索器

今天，我們很興奮地發佈 Jina ColBERT v2（jina-colbert-v2），這是一個基於 ColBERT 架構的進階後期互動檢索模型。這個新的語言模型改進了 jina-colbert-v1-en 的效能，並增加了多語言支援和動態輸出維度。

這個新版本具有以下特色：

相比原始 ColBERT-v2 (+6.5%) 和我們的前一版本 jina-colbert-v1-en(+5.4%) 具有更優異的檢索效能。
多語言支援，支援 89 種語言，在主要的全球語言中都能提供強大的效能。
通過 Matryoshka 表示學習實現使用者可控的輸出嵌入維度，使用者可以靈活地在效率和精確度之間取得平衡。

tagjina-colbert-v2 技術摘要

完整的技術報告可在 arXiv 上找到：

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.

arXiv.orgRohan Jha

	jina-colbert-v2	jina-colbert-v1-en	Original ColBERTv2
14 個英文 BEIR 任務的平均值	0.521	0.494	0.489
多語言	89 種語言	僅英文	僅英文
輸出維度	128、96 或 64	固定 128	固定 128
最大查詢長度	32 個詞元	32 個詞元	32 個詞元
最大文檔長度	8192 個詞元	8192 個詞元	512 個詞元
參數量	560M	137M	110M
模型大小	1.1GB	550MB	438MB

tagColBERT 中的非對稱嵌入

ColBERT 在 BERT 架構的基礎上增加了後期互動和非對稱查詢-文檔編碼。

ColBERT 的非對稱性質意味著在使用 jina-colbert-v2 或 jina-colbert-v1-en 等模型時，你需要指定是要嵌入查詢、文檔，還是兩者都要（用於重排序）。這種額外的靈活性提升了檢索任務中相對於同質嵌入模型的效能。

tag支援超過 89 種語言的多語言功能

Jina ColBERT v2 具有廣泛的多語言能力，旨在滿足現代全球化信息檢索和 AI 應用的需求。jina-colbert-v2 的訓練語料庫包含 89 種語言，並針對主要國際語言進行了額外的訓練階段，包括阿拉伯語、中文、英語、法語、德語、日語、俄語和西班牙語，以及程式語言。訓練還包含了對齊的雙語文本語料庫，以釋放跨語言潛力，允許在重排序/檢索任務中匹配不同語言的查詢和文檔。

訓練數據中語言分佈圖表，突出顯示英語和中文的主導地位。 — 預訓練數據集按語言的數據分佈（由 ISO-639 代碼指定）以對數尺度顯示。

如今，Jina ColBERT v2 是唯一能生成緊湊嵌入的多語言 ColBERT 類模型，在 MIRACL 基準測試中的所有測試語言中都明顯優於基於 BM25 的檢索。

條形圖比較 jina-colbert-v2 和 BM25 在 20 種語言的多語言任務中的表現。 — 在 MIRACL 基準測試中，Jina ColBERT v2 在 16 種語言中相較 BM25 的表現。

此外，在英語檢索任務中，Jina ColBERT v2 的效能超過了其前身 jina-colbert-v1-en 和原始的 ColBERT v2 模型，並與高度專門化的僅英語 AnswerAI-ColBERT-small 模型效能相當。

模型名稱	平均分數 (14 個 BEIR 僅英語基準測試)	多語言支援
jina-colbert-v2	0.521	多語言
jina-colbert-v1-en	0.494	僅英語
ColBERT v2.0	0.489	僅英語
AnswerAI-ColBERT-small	0.549	僅英語

條形圖顯示多個模型如 'jina-colbert' 和 'BM25' 在英語 BEIR 數據集上的評估結果。 — jina-colbert-v2 在 BEIR 基準測試的部分僅英語數據集上的評估。

tagMatryoshka 表示學習

Matryoshka 表示學習是一種訓練模型以支援不同輸出向量大小，同時最小化精確度損失的技術。我們使用多個不同的線性投影頭（神經網路的最終層）來訓練網路的隱藏層，每個投影頭支援不同的輸出大小。Jina ColBERT v2 支援 128、96 和 64 維度的輸出向量。

Jina ColBERT v2 預設產生 128 維的輸出嵌入，但也可以產生 96 和 64 維的輸出，它們的效能幾乎相同，但分別縮短了 25% 和 50% 的長度。

下表顯示了 nDGC 效能jina-colbert-v2 在六個 BEIR 基準測試中的前十個結果（nDGC@10）。從這裡可以看出，128 維度和 96 維度之間的性能差異僅約 1%，而 128 和 64 維度之間的差異不到 1.5%。

輸出維度	平均分數（6 個基準測試的 nDGC@10）
128	0.565
96	0.558
64	0.556

BEIR 基準測試的條形圖，突出顯示從 nfcorpus 到 msmarco 等數據集的分數，jina-colbert-v2.64 表現優異。 — 不同輸出維度下的 Jina ColBERT v2 性能。

減少輸出向量的大小可以節省空間，並加快向量比較或計算向量間距離等信息檢索應用的速度。

這會帶來顯著的成本影響，僅從存儲方面來看就很明顯。例如，使用 Qdrant 的雲端成本計算器，在 AWS 上存儲 1 億個文檔，每個文檔使用 128 維向量，估計每月成本為 1,319.24 美元。而使用 64 維度時，成本降至 659.62 美元。

tag開始使用 Jina ColBERT v2

Jina ColBERT v2 可通過 Jina Search Foundation API、AWS marketplace 和 Azure 獲取。它也可在 Hugging Face 上獲取，但僅供非商業用途（CC BY-NC-4.0）。

tag通過 Jina Search Foundation API

用於嵌入

以下 curl 命令展示了如何通過 Jina Embeddings API 指定輸入和選項以獲取 jina-colbert-v2 的文檔嵌入。要獲取您想要的向量大小，請在 dimensions 參數中指定 128 或 64。此參數是可選的，默認值為 128。

如果輸入文檔超過 8192 個 token，將被截斷。

在授權標頭中指定您的 Jina API 密鑰 Authorization: Bearer <YOUR JINA API KEY>：

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # 或使用 64 以獲得半尺寸向量
	"input_type": "document", # 查詢嵌入請見下文
	"embedding_type": "float",
	"input": [
		"Your document text string goes here", 
		"You can send multiple texts", 
		"Each text can be up to 8192 tokens long"
    ]}'

要獲取查詢嵌入，請將 input_type 參數設置為 query 而非 document。請注意，查詢的大小限制比文檔更嚴格。它們將在 32 個 token 處被截斷。查詢編碼將始終返回 32 個 token，如果少於 32 個 token，則包括填充的嵌入。

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # 或使用 64 以獲得半尺寸向量	
	"input_type": "query", # 查詢嵌入必須指定此項
	"embedding_type": "float",
	"input": [
		"Your query text string goes here", 
		"You can send multiple texts", 
		"Each query text can be up to 32 tokens long"
    ]}'

用於重排序

要通過 Jina Reranker API 使用 jina-colbert-v2，傳入一個查詢和幾個文檔並獲得可排序的匹配分數，請按如下方式構建您的請求：

curl https://api.jina.ai/v1/rerank \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
      "model": "jina-colbert-v2",
      "query": "What is the population of Berlin?",
      "top_n": 3,
      "documents": [
        "Berlin's population grew by 0.7 percent in 2023 compared with the previous year. Accordingly, around 27,300 more residents lived in Berlin at the end of the last year than in 2022. Those of 30 to under 40 years old form the numerically largest age group. With roughly 881,000 foreign residents from around 170 nations and an average age of the population of 42.5 years old.",
        "Mount Berlin is a glacier-covered volcano in Marie Byrd Land, Antarctica, 100 kilometres (62 mi) from the Amundsen Sea. It is a roughly 20-kilometre-wide (12 mi) mountain with parasitic vents that consists of two coalesced volcanoes: Berlin proper with the 2-kilometre-wide (1.2 mi) Berlin Crater and Merrem Peak with a 2.5-by-1-kilometre-wide (1.55 mi × 0.62 mi) crater, 3.5 kilometres (2.2 mi) away from Berlin.",
        "Population as of 31.12.2023 by nationality and federal states Land\\tTotal\\tGermans\\tForeigners\\tincluding EU-states number\\t%\\tnumber\\t%",
        "The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union.",
        "Irving Berlin (born Israel Beilin) was an American composer and songwriter. His music forms a large part of the Great American Songbook. Berlin received numerous honors including an Academy Award, a Grammy Award, and a Tony Award.",
        "Berlin is a town in the Capitol Planning Region, Connecticut, United States. The population was 20,175 at the 2020 census.",
        "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
        "Berlin, Berlin ist eine für die ARD produzierte Fernsehserie, die von 2002 bis 2005 im Vorabendprogramm des Ersten ausgestrahlt wurde. Regie führten unter anderem Franziska Meyer Price, Christoph Schnee, Sven Unterwaldt Jr. und Titus Selge."
        ]
    }'

注意 top_n 參數，它指定了您想要檢索的文檔數量。例如，如果您的應用程式只使用最佳匹配，請將 top_n 設置為 1。

如需 Python 和其他程式語言及框架的程式碼片段，請訪問 Jina AI Embeddings API 頁面，或在 Jina Reranker API 頁面的下拉選單中選擇 jina-colbert-v2。

tag通過 Stanford ColBERT

您也可以在 Stanford ColBERT 函式庫中使用 Jina ColBERT v2 作為 ColBERT v2 的替代方案。只需將 jinaai/jina-colbert-v2 指定為模型來源：

from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = ["Your list of texts"] 
query_vectors = ckpt.queryFromText(docs)

⚠️

您必須安裝 einops 和 flash_attn 才能使用上述程式碼。

tag透過 RAGatouille

Jina ColBERT v2 同樣也整合到了 RAGatouille。您可以透過 RAGPretrainedModel.from_pretrained() 方法下載並使用它：

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = ["Your list of texts"]
RAG.index(docs, index_name="your_index_name")
query = "Your query"
results = RAG.search(query)

⚠️

您必須安裝 einops 和 flash_attn 才能使用上述程式碼。

tag透過 Qdrant

自 1.10 版本起，Qdrant 已新增對多向量和後期互動模型的支援。無論是本地還是託管雲端版本的 Qdrant 引擎的現有用戶，都可以透過 Qdrant 的客戶端直接整合 jina-colbert-v2。

使用 MAX_SIM 操作建立新的集合

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

qdrant_client.create_collection(
    collection_name="{collection_name}",
    vectors_config={
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        )
    }
)

⚠️

正確設定 multivector_config 參數對於在 Qdrant 中使用 ColBERT 類型的模型至關重要。

將文件插入多向量集合

import requests
from qdrant_client import QdrantClient, models

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer <YOUR BEARER>'
}

data = {
    'model': 'jina-colbert-v2',
    'input_type': 'query',
    'embedding_type': 'float',
    'input': [
        'Your text string goes here',
        'You can send multiple texts',
        'Each text can be up to 8192 tokens long'
    ]
}

response = requests.post(url, headers=headers, json=data)
rows = response.json()["data"]

qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

for i, row in enumerate(rows):
    qdrant_client.upsert(
        collection_name="{collection_name}",
        points=[
            models.PointStruct(
                id=i,  
                vector=row["embeddings"],  
                payload={"text": data["input"][i]} 
            )
        ],
    )

查詢集合

from qdrant_client import QdrantClient, models
import requests

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer <YOUR BEARER>'
}


data = {
    'model': 'jina-colbert-v2',
    "input_type": "query",
    "embedding_type": "float",
    "input": [
        "how many tokens in an input do Jina AI's embedding models support?"
    ]
}

response = requests.post(url, headers=headers, json=data)
vector = response.json()["data"][0]["embeddings"]


qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

results = qdrant_client.query_points(
    collection_name="{collection_name}",
    query=vector,
)

print(results)

tag總結

Jina ColBERT v2（jina-colbert-v2）在 jina-colbert-v1-en 的高性能基礎上，將其功能擴展到了更廣泛的全球語言。透過支援多種嵌入向量大小，jina-colbert-v2 允許用戶根據其特定用例調整精確度/效率的平衡，可能會在時間和計算成本上帶來顯著節省。

這個模型將所有這些功能組合成一個價格具有競爭力的套件，可透過直觀的網頁 API 存取，並與支援 HTTP 請求的任何計算框架相容。立即試用，獲得 100 萬個免費代幣，看看它如何增強您的應用程式和流程。