Jina ColBERT v2：多言語遅延インタラクション型エンベディング・リランキング検索システム

本日、ColBERT アーキテクチャをベースにした高度な後期相互作用型検索モデル Jina ColBERT v2（jina-colbert-v2）のリリースを発表できることを嬉しく思います。この新しい言語モデルは、jina-colbert-v1-en のパフォーマンスを向上させ、多言語サポートと動的な出力次元を追加しています。

この新リリースには以下の特徴があります：

オリジナルの ColBERT-v2（+6.5%）や前バージョンの jina-colbert-v1-en（+5.4%）と比較して優れた検索性能
89 言語に対応する多言語サポートで、主要なグローバル言語で高いパフォーマンスを発揮
マトリョーシカ表現学習によるユーザー制御可能な出力埋め込みサイズで、効率性と精度のバランスを柔軟に調整可能

tagjina-colbert-v2 の技術概要

技術報告の全文は arXiv で確認できます：

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.

arXiv.orgRohan Jha

	jina-colbert-v2	jina-colbert-v1-en	Original ColBERTv2
Average of 14 English BEIR tasks	0.521	0.494	0.489
Multilingual	89 languages	English-only	English-only
Output dimensions	128, 96, or 64	Fixed 128	Fixed 128
Max query length	32 tokens	32 tokens	32 tokens
Max document length	8192 tokens	8192 tokens	512 tokens
Parameters	560M	137M	110M
Model size	1.1GB	550MB	438MB

tagColBERT における非対称埋め込み

ColBERT は BERT アーキテクチャに後期相互作用と非対称なクエリ・ドキュメントエンコーディングを追加して構築されています。

ColBERT の非対称性は、jina-colbert-v2 や jina-colbert-v1-en などのモデルを使用する際、クエリの埋め込み、ドキュメントの埋め込み、または両方（再ランク付け用）のいずれを行うかを指定する必要があることを意味します。この追加された柔軟性により、検索タスクにおいて均一な埋め込みモデルよりも優れたパフォーマンスを発揮します。

tag89 言語以上の多言語サポート

Jina ColBERT v2 は、現代のグローバル化された情報検索と AI アプリケーションのニーズに応えるため、広範な多言語機能を備えています。jina-colbert-v2 のトレーニングコーパスには 89 言語が含まれており、アラビア語、中国語、英語、フランス語、ドイツ語、日本語、ロシア語、スペイン語などの主要な国際言語およびプログラミング言語に対する追加のトレーニングステージが含まれています。また、クロスリンガルの可能性を引き出すために対訳テキストコーパスも含まれており、再ランク付け/検索タスクで異なる言語のクエリとドキュメントをマッチングすることができます。

Chart of language distribution in training data, highlighting dominance of English and Chinese. — 事前トレーニングデータセットの言語分布（ISO-639コードで指定）を対数スケールで表示。

現在、Jina ColBERT v2 はコンパクトな埋め込みを生成する唯一の多言語 ColBERT 型モデルであり、MIRACL ベンチマークでテストされたすべての言語で BM25 ベースの検索を大きく上回るパフォーマンスを示しています。

Bar chart comparing jina-colbert-v2 and BM25 performance across 20 languages on multilingual tasks. — MIRACL ベンチマークにおける 16 言語での Jina ColBERT v2 と BM25 の性能比較。

さらに、英語の検索タスクにおいて、Jina ColBERT v2 は前バージョンの jina-colbert-v1-en やオリジナルの ColBERT v2 モデルを上回り、英語専用の特化モデルである AnswerAI-ColBERT-small と同等のパフォーマンスを示しています。

モデル名	平均スコア (14 BEIR 英語ベンチマーク)	多言語サポート
jina-colbert-v2	0.521	多言語
jina-colbert-v1-en	0.494	英語のみ
ColBERT v2.0	0.489	英語のみ
AnswerAI-ColBERT-small	0.549	英語のみ

Bar chart showing model evaluations on English BEIR datasets, with several models like 'jina-colbert' and 'BM25'. — BEIR ベンチマークの英語データセットにおける jina-colbert-v2 の評価。

tagマトリョーシカ表現学習

マトリョーシカ表現学習は、精度の低下を最小限に抑えながら、異なる出力ベクトルサイズをサポートするようモデルを訓練する技術です。ニューラルネットワークの最終層である複数の線形投影ヘッドを使用して隠れ層を訓練し、それぞれが異なる出力サイズをサポートします。Jina ColBERT v2 は 128、96、64 次元の出力ベクトルをサポートしています。

Jina ColBERT v2 はデフォルトで 128 次元の出力埋め込みを生成しますが、パフォーマンスがほぼ同じで長さが 25%、50% 短い 96 次元と 64 次元の出力も生成可能です。

以下の表は、

jina-colbert-v2の上位10件の結果（nDGC@10）をBEIRベンチマークの6つのデータセットに対して示しています。128次元と96次元の性能差はわずか1%、128次元と64次元の差は1.5%未満であることがわかります。

出力次元	平均スコア (6つのベンチマークの nDGC@10)
128	0.565
96	0.558
64	0.556

Bar chart of BEIR benchmarks, highlighting scores of datasets like nfcorpus to msmarco, with jina-colbert-v2.64 excelling. — 異なる出力次元での Jina ColBERT v2 の性能。

出力ベクトルのサイズを削減することで、ベクトル間の比較や距離測定を行うベクトルベースの情報検索などのアプリケーションでスペースを節約し、速度を向上させることができます。

これは、保存容量の削減だけでも大きなコストへの影響があります。例えば、Qdrant のクラウドコスト計算ツールを使用すると、AWS 上で各文書に128次元のベクトルを持つ1億件の文書を保存する場合、推定コストは月額1,319.24米ドルとなります。64次元の場合、これは659.62米ドルに低下します。

tagJina ColBERT v2 を始める

Jina ColBERT v2 は、Jina Search Foundation API、AWS マーケットプレイス、およびAzureで利用可能です。また、非商用利用のみ（CC BY-NC-4.0）でHugging Faceからも利用可能です。

tagJina Search Foundation API 経由

埋め込み用

以下のcurlコマンドは、Jina Embeddings API を通じてjina-colbert-v2から文書埋め込みを取得するための入力とオプションを指定する方法を示しています。希望するサイズのベクトルを取得するには、dimensionsパラメータに128または64を指定します。このパラメータはオプションで、デフォルト値は128です。

入力文書は8192トークンより長い場合は切り捨てられます。

認証ヘッダーAuthorization: Bearer <YOUR JINA API KEY>に Jina API キーを指定してください：

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # または64でベクトルサイズを半分に
	"input_type": "document", # クエリ埋め込みについては下記参照
	"embedding_type": "float",
	"input": [
		"Your document text string goes here", 
		"You can send multiple texts", 
		"Each text can be up to 8192 tokens long"
    ]}'

クエリ埋め込みを取得するには、input_typeパラメータをdocumentではなくqueryに設定します。クエリは文書よりもはるかに厳しいサイズ制限があることに注意してください。32トークンで切り捨てられます。クエリエンコーディングは、32トークンより少ない場合はパディングの埋め込みを含めて、常に32トークンを返します。

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # または64でベクトルサイズを半分に
	"input_type": "query", # クエリ埋め込みにはこれを指定する必要があります
	"embedding_type": "float",
	"input": [
		"Your query text string goes here", 
		"You can send multiple texts", 
		"Each query text can be up to 32 tokens long"
    ]}'

再ランキング用

Jina Reranker API を通じてjina-colbert-v2を使用し、1つのクエリと複数の文書を渡してランク付け可能なマッチスコアを取得するには、以下のようにリクエストを構築します：

curl https://api.jina.ai/v1/rerank \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
      "model": "jina-colbert-v2",
      "query": "What is the population of Berlin?",
      "top_n": 3,
      "documents": [
        "Berlin's population grew by 0.7 percent in 2023 compared with the previous year. Accordingly, around 27,300 more residents lived in Berlin at the end of the last year than in 2022. Those of 30 to under 40 years old form the numerically largest age group. With roughly 881,000 foreign residents from around 170 nations and an average age of the population of 42.5 years old.",
        "Mount Berlin is a glacier-covered volcano in Marie Byrd Land, Antarctica, 100 kilometres (62 mi) from the Amundsen Sea. It is a roughly 20-kilometre-wide (12 mi) mountain with parasitic vents that consists of two coalesced volcanoes: Berlin proper with the 2-kilometre-wide (1.2 mi) Berlin Crater and Merrem Peak with a 2.5-by-1-kilometre-wide (1.55 mi × 0.62 mi) crater, 3.5 kilometres (2.2 mi) away from Berlin.",
        "Population as of 31.12.2023 by nationality and federal states Land\\tTotal\\tGermans\\tForeigners\\tincluding EU-states number\\t%\\tnumber\\t%",
        "The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union.",
        "Irving Berlin (born Israel Beilin) was an American composer and songwriter. His music forms a large part of the Great American Songbook. Berlin received numerous honors including an Academy Award, a Grammy Award, and a Tony Award.",
        "Berlin is a town in the Capitol Planning Region, Connecticut, United States. The population was 20,175 at the 2020 census.",
        "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
        "Berlin, Berlin ist eine für die ARD produzierte Fernsehserie, die von 2002 bis 2005 im Vorabendprogramm des Ersten ausgestrahlt wurde. Regie führten unter anderem Franziska Meyer Price, Christoph Schnee, Sven Unterwaldt Jr. und Titus Selge."
        ]
    }'

top_n引数に注意してください。これは取得したい文書数を指定します。例えば、アプリケーションが最上位のマッチのみを使用する場合は、top_nを1に設定します。

Python やその他のプログラミング言語とフレームワークのコードスニペットについては、Jina AI Embeddings API ページを参照するか、Jina Reranker API ページのドロップダウンメニューからjina-colbert-v2を選択してください。

tagStanford ColBERT 経由

また、Stanford ColBERT ライブラリで ColBERT v2 の代わりに Jina ColBERT v2 をそのまま使用することもできます。モデルソースとして jinaai/jina-colbert-v2 を指定するだけです：

from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = ["Your list of texts"] 
query_vectors = ckpt.queryFromText(docs)

⚠️

上記のコードを使用するには、einops と flash_attn をインストールする必要があります。

tagRAGatouille を使用する方法

Jina ColBERT v2 は同様に RAGatouille にも統合されています。RAGPretrainedModel.from_pretrained() メソッドを使用してダウンロードし、使用できます：

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = ["Your list of texts"]
RAG.index(docs, index_name="your_index_name")
query = "Your query"
results = RAG.search(query)

⚠️

上記のコードを使用するには、einops と flash_attn をインストールする必要があります。

tagQdrant を使用する方法

Qdrant はバージョン 1.10 から、マルチベクトルと後期インタラクションモデルのサポートを追加しました。ローカルまたはマネージドクラウドバージョンの Qdrant エンジンの既存ユーザーは、Qdrant のクライアントを使用して jina-colbert-v2 を直接統合できます。

MAX_SIM 操作を使用した新しいコレクションの作成

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(
    url="",
    api_key="",
)

qdrant_client.create_collection(
    collection_name="{collection_name}",
    vectors_config={
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        )
    }
)

⚠️

Qdrant で ColBERT スタイルのモデルを使用するには、multivector_config パラメータを正しく設定することが重要です。

マルチベクターコレクションへのドキュメントの挿入

import requests
from qdrant_client import QdrantClient, models

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer '
}

data = {
    'model': 'jina-colbert-v2',
    'input_type': 'query',
    'embedding_type': 'float',
    'input': [
        'Your text string goes here',
        'You can send multiple texts',
        'Each text can be up to 8192 tokens long'
    ]
}

response = requests.post(url, headers=headers, json=data)
rows = response.json()["data"]

qdrant_client = QdrantClient(
    url="",
    api_key="",
)

for i, row in enumerate(rows):
    qdrant_client.upsert(
        collection_name="{collection_name}",
        points=[
            models.PointStruct(
                id=i,  
                vector=row["embeddings"],  
                payload={"text": data["input"][i]} 
            )
        ],
    )

コレクションのクエリ

from qdrant_client import QdrantClient, models
import requests

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer '
}


data = {
    'model': 'jina-colbert-v2',
    "input_type": "query",
    "embedding_type": "float",
    "input": [
        "how many tokens in an input do Jina AI's embedding models support?"
    ]
}

response = requests.post(url, headers=headers, json=data)
vector = response.json()["data"][0]["embeddings"]


qdrant_client = QdrantClient(
    url="",
    api_key="",
)

results = qdrant_client.query_points(
    collection_name="{collection_name}",
    query=vector,
)

print(results)

tagまとめ

Jina ColBERT v2（jina-colbert-v2）は、jina-colbert-v1-en の高性能を基盤に、幅広いグローバル言語に対応するように機能を拡張しています。複数の埋め込みサイズをサポートすることで、jina-colbert-v2 はユーザーが特定のユースケースに合わせて精度と効率性のトレードオフを調整できるようになり、時間とコンピューティングコストを大幅に節約できる可能性があります。

このモデルは、これらすべての機能を単一の競争力のある価格のパッケージに組み合わせ、直感的な Web API を通じてアクセス可能で、HTTP リクエストをサポートするあらゆるコンピューティングフレームワークと互換性があります。100 万トークンの無料枠を使って実際に試してみて、アプリケーションやプロセスをどのように強化できるか確認してください。