Jina ColBERT v2: 임베딩과 재순위화를 위한 다국어 후기 상호작용 검색기

Jina ColBERT v2(jina-colbert-v2), ColBERT 아키텍처를 기반으로 한 고급 후기 상호작용 검색 모델을 출시하게 되어 기쁩니다. 이 새로운 언어 모델은 jina-colbert-v1-en의 성능을 개선하고 다국어 지원과 동적 출력 차원을 추가했습니다.

이번 새 릴리스의 주요 특징은 다음과 같습니다:

기존 ColBERT-v2(+6.5%)와 이전 버전인 jina-colbert-v1-en(+5.4%)에 비해 우수한 검색 성능
89개 언어에 대한 다국어 지원으로 주요 글로벌 언어에서 강력한 성능 제공
Matryoshka 표현 학습을 통한 사용자 제어 가능한 출력 임베딩 크기로 효율성과 정확도 간의 유연한 균형 조정 가능

tagjina-colbert-v2의 기술 요약

전체 기술 보고서는 arXiv에서 확인할 수 있습니다:

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.

arXiv.orgRohan Jha

	jina-colbert-v2	jina-colbert-v1-en	Original ColBERTv2
14개 영어 BEIR 태스크 평균	0.521	0.494	0.489
다국어	89개 언어	영어 전용	영어 전용
출력 차원	128, 96, or 64	고정 128	고정 128
최대 쿼리 길이	32 토큰	32 토큰	32 토큰
최대 문서 길이	8192 토큰	8192 토큰	512 토큰
파라미터	560M	137M	110M
모델 크기	1.1GB	550MB	438MB

tagColBERT의 비대칭 임베딩

ColBERT는 BERT 아키텍처에 후기 상호작용과 비대칭 쿼리-문서 인코딩을 추가하여 구축되었습니다.

ColBERT의 비대칭 특성은 jina-colbert-v2 또는 jina-colbert-v1-en과 같은 모델을 사용할 때 쿼리, 문서, 또는 둘 다(재순위화 목적)를 임베딩하는지 지정해야 함을 의미합니다. 이러한 추가적인 유연성은 검색 작업에서 동질적 임베딩 모델보다 성능을 향상시킵니다.

tag89개 이상 언어에 대한 다국어 지원

Jina ColBERT v2는 현대의 글로벌화된 정보 검색과 AI 애플리케이션의 요구를 충족하도록 설계된 광범위한 다국어 기능을 갖추고 있습니다. jina-colbert-v2의 학습 코퍼스는 89개 언어를 포함하며, 아랍어, 중국어, 영어, 프랑스어, 독일어, 일본어, 러시아어, 스페인어와 같은 주요 국제 언어와 프로그래밍 언어에 대한 추가 학습 단계를 거쳤습니다. 또한 이중 언어 정렬 텍스트 코퍼스를 포함하여 재순위화/검색 작업에서 서로 다른 언어의 쿼리와 문서를 매칭할 수 있는 교차 언어 잠재력을 열었습니다.

Chart of language distribution in training data, highlighting dominance of English and Chinese. — 사전 학습 데이터셋의 언어별 분포 (ISO-639 코드로 지정), 로그 스케일

현재 Jina ColBERT v2는 컴팩트한 임베딩을 생성하는 유일한 다국어 ColBERT 유사 모델로, MIRACL 벤치마크에서 테스트한 모든 언어에서 BM25 기반 검색을 크게 능가합니다.

Bar chart comparing jina-colbert-v2 and BM25 performance across 20 languages on multilingual tasks. — MIRACL 벤치마크에서 16개 언어에 대한 Jina ColBERT v2와 BM25의 성능 비교

더욱이 영어 검색 작업에서 Jina ColBERT v2는 이전 버전인 jina-colbert-v1-en과 원래의 ColBERT v2 모델의 성능을 뛰어넘으며, 영어 전용으로 특화된 AnswerAI-ColBERT-small 모델과 비슷한 성능을 보입니다.

모델명	평균 점수 (14개 BEIR 영어 전용 벤치마크)	다국어 지원
jina-colbert-v2	0.521	다국어
jina-colbert-v1-en	0.494	영어 전용
ColBERT v2.0	0.489	영어 전용
AnswerAI-ColBERT-small	0.549	영어 전용

Bar chart showing model evaluations on English BEIR datasets, with several models like 'jina-colbert' and 'BM25'. — BEIR 벤치마크의 영어 전용 데이터셋에 대한 jina-colbert-v2 평가

tag마트료시카 표현 학습

마트료시카 표현 학습은 정확도 손실을 최소화하면서 다양한 출력 벡터 크기를 지원하도록 모델을 학습시키는 기술입니다. 신경망의 은닉층을 여러 다른 선형 투영 헤드 - 신경망의 최종 층 - 로 학습시켜 각각 다른 출력 크기를 지원합니다. Jina ColBERT v2는 128, 96, 64 차원의 출력 벡터를 지원합니다.

Jina ColBERT v2는 기본적으로 128차원의 출력 임베딩을 생성하지만, 거의 동일한 성능을 유지하면서 각각 25%와 50% 더 짧은 96차원과 64차원도 생성할 수 있습니다.

아래 표는 nDGC 성능을 보여줍니다6개의 BEIR 벤치마크에서 상위 10개 결과(nDGC@10)에 대한 jina-colbert-v2의 성능입니다. 128차원과 96차원의 성능 차이는 겨우 1% 정도이며, 128차원과 64차원의 차이는 1.5% 미만입니다.

출력 차원	평균 점수 (6개 벤치마크의 nDGC@10)
128	0.565
96	0.558
64	0.556

Bar chart of BEIR benchmarks, highlighting scores of datasets like nfcorpus to msmarco, with jina-colbert-v2.64 excelling. — 다양한 출력 차원에서의 Jina ColBERT v2 성능.

출력 벡터의 크기를 줄이면 공간이 절약되고 서로 다른 벡터를 비교하거나 벡터 간 거리를 측정해야 하는 벡터 기반 정보 검색과 같은 애플리케이션의 속도가 향상됩니다.

이는 저장 공간 절감 측면에서도 상당한 비용 절감 효과가 있습니다. 예를 들어 Qdrant의 클라우드 비용 계산기를 사용하면, AWS에서 1억 개의 문서를 각각 128차원 벡터로 저장할 때 월 예상 비용은 US$1,319.24입니다. 64차원에서는 이 비용이 US$659.62로 감소합니다.

tagJina ColBERT v2 시작하기

Jina ColBERT v2는 Jina Search Foundation API, AWS 마켓플레이스, 그리고 Azure를 통해 사용할 수 있습니다. 또한 비상업적 용도에 한해 (CC BY-NC-4.0) Hugging Face를 통해서도 이용할 수 있습니다.

tagJina Search Foundation API를 통한 사용

임베딩

다음 curl 명령은 Jina Embeddings API를 통해 jina-colbert-v2로부터 문서 임베딩을 얻기 위한 입력과 옵션을 지정하는 방법을 보여줍니다. 원하는 크기의 벡터를 얻으려면 dimensions 매개변수에 128 또는 64를 지정하세요. 이 매개변수는 선택사항이며 기본값은 128입니다.

입력 문서가 8192 토큰보다 길면 잘립니다.

인증 헤더에 Jina API 키를 지정하세요 Authorization: Bearer <YOUR JINA API KEY>:

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # 또는 64로 벡터 크기 절반으로 설정
	"input_type": "document", # 쿼리 임베딩은 아래 참조
	"embedding_type": "float",
	"input": [
		"Your document text string goes here", 
		"You can send multiple texts", 
		"Each text can be up to 8192 tokens long"
    ]}'

쿼리 임베딩을 얻으려면 input_type 매개변수를 document 대신 query로 설정하세요. 쿼리는 문서보다 훨씬 더 엄격한 크기 제한이 있습니다. 32 토큰에서 잘립니다. 쿼리 인코딩은 32 토큰보다 짧을 경우 패딩에 대한 임베딩을 포함하여 항상 32 토큰을 반환합니다.

curl https://api.jina.ai/v1/multi-vector \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
	"model": "jina-colbert-v2",
	"dimensions": 128, # 또는 64로 벡터 크기 절반으로 설정	
	"input_type": "query", # 쿼리 임베딩에는 이 설정이 필요
	"embedding_type": "float",
	"input": [
		"Your query text string goes here", 
		"You can send multiple texts", 
		"Each query text can be up to 32 tokens long"
    ]}'

재순위화

Jina Reranker API를 통해 jina-colbert-v2를 사용하여 하나의 쿼리와 여러 문서를 전달하고 순위를 매길 수 있는 매치 점수를 받으려면 아래와 같이 요청을 구성하세요:

curl https://api.jina.ai/v1/rerank \\
	 -H "Content-Type: application/json" \\
	 -H "Authorization: Bearer <YOUR JINA API KEY>" \\
	 -d '{
      "model": "jina-colbert-v2",
      "query": "What is the population of Berlin?",
      "top_n": 3,
      "documents": [
        "Berlin's population grew by 0.7 percent in 2023 compared with the previous year. Accordingly, around 27,300 more residents lived in Berlin at the end of the last year than in 2022. Those of 30 to under 40 years old form the numerically largest age group. With roughly 881,000 foreign residents from around 170 nations and an average age of the population of 42.5 years old.",
        "Mount Berlin is a glacier-covered volcano in Marie Byrd Land, Antarctica, 100 kilometres (62 mi) from the Amundsen Sea. It is a roughly 20-kilometre-wide (12 mi) mountain with parasitic vents that consists of two coalesced volcanoes: Berlin proper with the 2-kilometre-wide (1.2 mi) Berlin Crater and Merrem Peak with a 2.5-by-1-kilometre-wide (1.55 mi × 0.62 mi) crater, 3.5 kilometres (2.2 mi) away from Berlin.",
        "Population as of 31.12.2023 by nationality and federal states Land\\tTotal\\tGermans\\tForeigners\\tincluding EU-states number\\t%\\tnumber\\t%",
        "The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union.",
        "Irving Berlin (born Israel Beilin) was an American composer and songwriter. His music forms a large part of the Great American Songbook. Berlin received numerous honors including an Academy Award, a Grammy Award, and a Tony Award.",
        "Berlin is a town in the Capitol Planning Region, Connecticut, United States. The population was 20,175 at the 2020 census.",
        "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
        "Berlin, Berlin ist eine für die ARD produzierte Fernsehserie, die von 2002 bis 2005 im Vorabendprogramm des Ersten ausgestrahlt wurde. Regie führten unter anderem Franziska Meyer Price, Christoph Schnee, Sven Unterwaldt Jr. und Titus Selge."
        ]
    }'

top_n 인수는 검색하고자 하는 문서의 수를 지정합니다. 예를 들어 애플리케이션에서 최상위 매치만 사용하는 경우 top_n을 1로 설정하세요.

Python 및 기타 프로그래밍 언어와 프레임워크의 코드 스니펫은 Jina AI Embeddings API 페이지를 방문하거나 Jina Reranker API 페이지의 드롭다운 메뉴에서 jina-colbert-v2를 선택하세요.

tagStanford ColBERT를 통한 사용

Stanford ColBERT 라이브러리에서도 Jina ColBERT v2를 ColBERT v2의 대체품으로 사용할 수 있습니다. 모델 소스로 jinaai/jina-colbert-v2를 지정하기만 하면 됩니다:

from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint

ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = ["Your list of texts"] 
query_vectors = ckpt.queryFromText(docs)

⚠️

위 코드를 사용하려면 einops와 flash_attn을 설치해야 합니다.

tagRAGatouille를 통한 사용

Jina ColBERT v2는 RAGatouille에도 통합되어 있습니다. RAGPretrainedModel.from_pretrained() 메서드를 통해 다운로드하고 사용할 수 있습니다:

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = ["Your list of texts"]
RAG.index(docs, index_name="your_index_name")
query = "Your query"
results = RAG.search(query)

⚠️

위 코드를 사용하려면 einops와 flash_attn을 설치해야 합니다.

tagQdrant를 통한 사용

Qdrant는 1.10 버전부터 다중 벡터와 후기 상호작용 모델에 대한 지원을 추가했습니다. Qdrant 엔진의 기존 사용자들은 로컬이든 관리형 클라우드 버전이든 Qdrant 클라이언트를 사용하여 jina-colbert-v2를 직접 통합할 수 있습니다.

MAX_SIM 연산을 사용하여 새 컬렉션 생성하기

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

qdrant_client.create_collection(
    collection_name="{collection_name}",
    vectors_config={
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
        )
    }
)

⚠️

Qdrant에서 ColBERT 스타일 모델을 사용하려면 multivector_config 매개변수를 올바르게 설정하는 것이 필수적입니다.

다중 벡터 컬렉션에 문서 삽입하기

import requests
from qdrant_client import QdrantClient, models

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer <YOUR BEARER>'
}

data = {
    'model': 'jina-colbert-v2',
    'input_type': 'query',
    'embedding_type': 'float',
    'input': [
        'Your text string goes here',
        'You can send multiple texts',
        'Each text can be up to 8192 tokens long'
    ]
}

response = requests.post(url, headers=headers, json=data)
rows = response.json()["data"]

qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

for i, row in enumerate(rows):
    qdrant_client.upsert(
        collection_name="{collection_name}",
        points=[
            models.PointStruct(
                id=i,  
                vector=row["embeddings"],  
                payload={"text": data["input"][i]} 
            )
        ],
    )

컬렉션 쿼리하기

from qdrant_client import QdrantClient, models
import requests

url = 'https://api.jina.ai/v1/multi-vector'

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer <YOUR BEARER>'
}


data = {
    'model': 'jina-colbert-v2',
    "input_type": "query",
    "embedding_type": "float",
    "input": [
        "how many tokens in an input do Jina AI's embedding models support?"
    ]
}

response = requests.post(url, headers=headers, json=data)
vector = response.json()["data"][0]["embeddings"]


qdrant_client = QdrantClient(
    url="<YOUR_ENDPOINT>",
    api_key="<YOUR_API_KEY>",
)

results = qdrant_client.query_points(
    collection_name="{collection_name}",
    query=vector,
)

print(results)

tag요약

Jina ColBERT v2(jina-colbert-v2)는 jina-colbert-v1-en의 높은 성능을 기반으로 하여 전 세계의 다양한 언어로 그 기능을 확장했습니다. 다양한 임베딩 크기를 지원하여 사용자가 특정 사용 사례에 맞게 정밀도/효율성 트레이드오프를 조정할 수 있으며, 이는 시간과 컴퓨팅 비용을 크게 절약할 수 있습니다.

이 모델은 이러한 모든 기능을 하나의 경쟁력 있는 가격의 패키지로 통합하여, 직관적인 웹 API를 통해 접근할 수 있으며 HTTP 요청을 지원하는 모든 컴퓨팅 프레임워크와 호환됩니다. 100만 개의 무료 토큰으로 직접 사용해보시고 애플리케이션과 프로세스를 어떻게 향상시킬 수 있는지 확인해보세요.