Jina 向量模型 v4：適用於多模態多語檢索的通用向量模型 (Embeddings)

jina-embeddings-v4：用於多模組多語言檢索的通用向量模型 (Embeddings)

我們介紹 jina-embeddings-v4，這是一個 38 億參數的多模組向量模型 (embedding model)，它通過一種新穎的架構統一了文本和圖像表示，該架構支援後期互動風格中的單向量和多向量向量模型 (embeddings)。該模型採用特定任務的 Low-Rank Adaptation (LoRA) 適配器來優化各種檢索場景的效能，包括基於查詢的資訊檢索、跨模組語義相似性和程式碼搜尋。全面的評估表明，jina-embeddings-v4 在單模組和跨模組檢索任務上都實現了最先進的效能，尤其是在處理視覺豐富的內容（如表格、圖表、示意圖和混合媒體格式）方面。為了方便評估此功能，我們還推出了 Jina-VDR，這是一個專為視覺豐富的圖像檢索而設計的新基準。

arXiv.orgMichael Günther

今天，我們發布了 jina-embeddings-v4，這是我們用於文本和圖像的全新 38 億參數通用向量模型 (embedding model)。它包含一組特定於任務的 LoRA 適配器，可優化最受歡迎的檢索任務的效能，包括查詢文件檢索、語義匹配和程式碼搜尋。jina-embeddings-v4 在 MTEB、MMTEB、CoIR、LongEmbed、STS、Jina-VDR、CLIP 和 ViDoRe 基準測試中，在多模組和多語言任務上實現了最先進的檢索效能，尤其是在處理視覺豐富的內容（如表格、圖表、示意圖及其混合）方面。該模型支援單向量和多向量向量模型 (embeddings)。

在視覺文件檢索和多模組基準測試中，jina-embeddings-v4 的效能。盒鬚圖分佈顯示了六個基準類別中向量模型 (embedding models) 的平均分數和效能變異性：ViDoRe（視覺文件檢索）、Jina-VDR（綜合視覺文件檢索）、Wikimedia Commons 檢索（多語言文件描述匹配）、GitHub README 檢索（程式碼文件檢索）、Tweet Stock 檢索（財務圖表分析）和 CLIP 基準（通用文本到圖像檢索）。Jina-embeddings-v4 變體（以青色突出顯示）在視覺豐富的文件任務中展示了最先進的效能，其中多向量版本在專門的視覺文件基準測試中取得了最高分（ViDoRe 為 90.2，Jina-VDR 為 80.2），同時在通用多模組檢索任務中保持了具有競爭力的效能（CLIP 基準測試為 84.1）。模型按每個基準類別內的平均效能進行排名，各個數據點顯示了多個評估任務中的分數分佈。

jina-embeddings-v4 是我們迄今為止最雄心勃勃的向量模型 (embedding model)。作為一個開放原始碼模型，jina-embeddings-v4 的效能優於主要供應商領先的封閉原始碼向量模型 (embedding models)，在多語言檢索方面比 OpenAI 的 text-embedding-3-large 高出 12%（66.49 對 59.27），在長文件任務方面提高了 28%（67.11 對 52.42），在程式碼檢索方面比 voyage-3 高出 15%（71.59 對 67.23），並且與 Google 的 gemini-embedding-001 效能相符。這使得 v4 成為當今可用的功能最強大的開放原始碼通用向量模型 (embedding model)，透過我們全面的技術報告，為研究人員和開發人員提供企業級多模組向量模型 (embedding) 功能，並完全透明地了解訓練過程、架構決策和模型權重。

jina-embeddings-v4 在五個檢索基準測試中的效能。該圖表顯示了每個模型在文本檢索、程式碼檢索、多語言檢索、長上下文檢索和語義文本相似性 (STS) 基準測試中的盒鬚圖分佈，其中包含平均分數。jina-embeddings-v4（以青色突出顯示）在所有評估類別中都展示了具有競爭力或最先進的效能，尤其是在文本檢索和 STS 方面取得了強勁的成果。模型按每個基準類別內的平均效能進行排名，各個數據點顯示了多個評估任務中的分數分佈。

tag新架構

jina-embeddings-v4 的架構。該模型建立在 Qwen2.5-VL-3B-Instruct 骨幹（38 億參數）之上。文本和圖像輸入通過一個共享路徑進行處理：圖像首先通過視覺編碼器轉換為詞元 (token) 序列，然後兩種模組都由具有上下文注意層的語言模型解碼器聯合處理。三個特定於任務的 LoRA 適配器（每個 6000 萬個參數）為檢索、文本匹配和程式碼任務提供專門的優化，而無需修改凍結的骨幹權重。該架構支援雙重輸出模式：(1) 通過平均池化生成的單向量向量模型 (embeddings)（2048 維，可截斷為 128），用於高效的相似性搜尋，以及 (2) 通過投影層生成的多向量向量模型 (embeddings)（每個詞元 (token) 128 維），用於後期互動檢索策略。

從 jina-embeddings-v3 升級到jina-embeddings-v4 代表了從純文字到多模態向量模型 (Embeddings) 的典範轉移。儘管 v3 專注於使用特定任務的 LoRA 調整器來優化文字向量模型 (Embeddings)，但 v4 解決了對在統一表示中嵌入文字和視覺內容不斷增長的需求。

方面	<strong>jina-embeddings-v3</strong>	<strong>jina-embeddings-v4</strong>
主幹模型 (Backbone Model)	jina-XLM-RoBERTa	Qwen2.5-VL-3B-Instruct
參數（基礎）(Parameters (Base))	559M	3.8B
參數（含調整器）(Parameters (with adapters))	572M	3.8B + 每個調整器 60M
模態 (Modalities)	僅文字	文字 + 圖像（多模態）
最大輸入長度 (Max Input Length)	8,192 個詞元 (Tokens)	32,768 個詞元 (Tokens)
圖像處理 (Image Processing)	無	高達 20 百萬像素，視覺上豐富的文件
多語言支援 (Multilingual Support)	89 種語言	29+ 種語言
向量類型 (Vector Types)	僅單一向量	單一向量 + 多重向量（延遲交互）
單一向量維度 (Single-vector Dimensions)	1024（MRL 可截斷至 32）	2048（MRL 可截斷至 128）
多重向量維度 (Multi-vector Dimensions)	不可用	每個詞元 (Token) 128
任務 LoRA 專業化 (Task LoRA Specializations)	• 不對稱檢索 • 語義相似性 • 分類 • 分離	• 不對稱檢索 • 語義相似性 • 代碼檢索
訓練階段 (Training Stages)	3 階段：預訓練 → 向量模型 (Embeddings) 微調 → 調整器訓練	2 階段：聯合配對訓練 → 任務特定調整器訓練
損失函數 (Loss Functions)	InfoNCE、CoSent、擴展三元組損失	單一/多重向量的聯合 InfoNCE + KL 散度
位置編碼 (Positional Encoding)	RoPE（旋轉基礎頻率調整）	M-RoPE（多模態旋轉位置嵌入）
跨模態處理 (Cross-modal Processing)	不適用	統一編碼器（減少模態差距）
MRL 支援	是	是
注意力實作 (Attention Implementation)	FlashAttention2	FlashAttention2

tag主幹模型 (Backbone)

v4 中最顯著的架構變更是從 XLM-RoBERTa 到 Qwen2.5-VL-3B-Instruct 的主幹變更。此決策是由 v4 的核心目標驅動的，即創建一個通用向量模型 (Embedding Model)，該模型能夠實現“真正的多模態處理”，其中圖像被轉換為詞元 (Token) 序列並與文本一起處理，從而消除了雙編碼器架構中存在的模態差距。

主幹模型的選擇與幾個關鍵設計目標一致：Qwen2.5-VL 在文檔理解方面的卓越性直接支持了 v4 在處理視覺上豐富的內容（如表格、圖表和螢幕截圖）方面的優勢。動態解析度功能使 v4 能夠處理調整大小為 20 百萬像素的圖像，如架構中所指定的。先進的位置編碼 (Positional Encoding) 提供了基礎，使 v4 能夠實現卓越的跨模態對齊，對齊分數為 0.71，而 OpenAI CLIP 的對齊分數為 0.15。

tagLoRA 調整器

v4 將 v3 的五個任務簡化為三個重點任務，反映了關於有效性和用戶採用的經驗教訓：

不對稱檢索（合併了 v3 的查詢/段落調整器）
對稱相似性（v3 的文本匹配等效項，用於 STS 任務）
代碼檢索（從 v2-code 中學習，v3 中缺少）

這種合併消除了 v3 的分類和分離調整器，使 v4 專注於最具影響力的向量模型 (Embedding) 用例 - 檢索和 STS。

tag輸出向量模型 (Embeddings)

v4 引入了支持單一向量和多重向量向量模型 (Embeddings) 的雙輸出系統，而 v3 僅提供單一向量輸出。這解決了不同的檢索場景：

單一向量模式：2048 維向量模型 (Embeddings)（可通過 MRL 截斷至 128），用於高效相似性搜尋
多重向量模式：每個詞元 (Token) 128 維，用於延遲交互檢索

這種雙重方法通過多重向量表示提供了更高的效率，尤其是在視覺上豐富的文檔檢索中，同時保持了標準相似性任務的效率。在視覺任務中，多重向量比單一向量模式始終高出 7-10% 的性能優勢表明，對於多模態內容，延遲交互提供了根本上更好的語義匹配。

tag參數大小

雖然 v4 比 v3 大 6.7 倍（3.8B 對 570M 參數），但僅文本性能的改進實際上是適度的，這表明參數縮放主要是由多模態需求驅動，而不是文本增強。在核心文本基準測試中，v4 在 MMTEB 上達到 66.49，而 v3 為 58.58（改進 14%），在 MTEB-EN 上達到 55.97，而 v3 為 54.33（改進 3%）。對於代碼檢索，v4 在 CoIR 上得分 71.59，而 v3 為 55.07（改進 30%），而長文檔性能顯示 v4 在 LongEmbed 上為 67.11，而 v3 為 55.66（改進 21%）。當考慮到 v4 的多模態功能時，這種大幅縮放變得合理：在視覺文檔檢索 (Jina-VDR) 上實現 84.11 nDCG@5，在 ViDoRe 基準測試上實現 90.17 - 這些功能在 v3 中完全沒有。因此，參數增加代表了我們對多模態功能的投資，同時保持了具有競爭力的文本性能，統一架構消除了對單獨的文本和視覺模型的需求，同時實現了 0.71 的跨模態對齊，而傳統雙編碼器方法的跨模態對齊為 0.15。

tag開始使用

要快速檢查，請在 Search Foundation 工具箱中嘗試我們的文本到圖像演示。我們準備了來自我們網站的一系列文檔圖像，您還可以添加自己的圖像 URL。只需輸入您的查詢並按 Enter 鍵即可查看排名結果。您可以像 OCR 或基於內容的圖像檢索一樣撤回它 - 也可以嘗試使用非英語查詢。

0:00

/0:22

該演示可在以下網址找到：https://jina.ai/api-dashboard/m0-image-rerank請注意，使用此演示將消耗您主要 API 密鑰的詞元 (Tokens)。此外，由於它需要從這些 URL 在伺服器上下載所有圖像，並且沒有為圖像實現緩存，因此演示可能看起來有點慢。

tag通過 API

下面的代碼顯示了如何使用 jina-embeddings-v4。您可以傳遞一個文字字符串、一個 base64 編碼的圖像或一個圖像 URL。新用戶可以使用 Jina API 密鑰獲得 1000 萬個免費詞元 (Tokens)。

curl https://api.jina.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer JINA_API_KEY" \
  -d @- <<EOFEOF
  {
    "model": "jina-embeddings-v4",
    "task": "text-matching",
    "input": [
        {
            "text": "A beautiful sunset over the beach"
        },
        {
            "text": "Un beau coucher de soleil sur la plage"
        },
        {
            "text": "海滩上美丽的日落"
        },
        {
            "text": "浜辺に沈む美しい夕日"
        },
        {
            "image": "https://i.ibb.co/nQNGqL0/beach1.jpg"
        },
        {
            "image": "https://i.ibb.co/r5w8hG8/beach2.jpg"
        },
        {
            "image": "iVBORw0KGgoAAAANSUhEUgAAABwAAAA4CAIAAABhUg/jAAAAMklEQVR4nO3MQREAMAgAoLkoFreTiSzhy4MARGe9bX99lEqlUqlUKpVKpVKpVCqVHksHaBwCA2cPf0cAAAAASUVORK5CYII="
        }
    ]
  }
EOFEOF

由於 GPU 資源有限，儘管 jina-embeddings-v4 本身能夠處理高達 32K 個詞元 (Tokens)，我們的向量模型 (Embedding) API 目前僅支援長度達 8K 個詞元 (Tokens) 的文件。對於需要超過 8K 個詞元 (Tokens) 的較長上下文的應用程式（例如 Late Chunking），我們建議透過 CSP 部署我們的模型或自行託管模型。

tag透過 CSP Marketplace

jina-embeddings-v4 將很快在 AWS、Azure 和 GCP 上直接提供，價格如上所示。

tag透過 HuggingFace

為了研究和實驗目的，您可以從我們的 Hugging Face 頁面在本地使用該模型。我們準備了一個 Google Colab 筆記本，示範其運作方式。

tag結論

jina-embeddings-v4 代表了我們迄今為止最重大的飛躍——一個 38 億參數的通用向量模型 (Embedding)，它透過統一的路徑處理文字和圖像，支援密集和延遲互動檢索，同時在視覺豐富的文件檢索方面，超越了 Google、OpenAI 和 Voyage AI 的專有模型。但這種能力並非孤立出現；它是解決基本限制的四代產品的結晶。

當我們在 2022 年初開始使用 jina-embeddings-v1 時，每個人都認為更多數據意味著更好的性能。我們證明了相反的觀點——將 15 億個配對過濾到 3.85 億個高品質範例，其性能優於更大的數據集。教訓是：策劃勝過收集。

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model’s awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.

arXiv.orgMichael Günther

但是使用者不斷遇到 BERT 的 512 個詞元 (Tokens) 限制。在更長的序列上進行訓練似乎很昂貴，直到 jina-embeddings-v2 揭示了一個優雅的解決方案：訓練時間短，部署時間長。ALiBi 的線性注意力偏差使在 512 個詞元 (Tokens) 上訓練的模型能夠在推理時無縫處理 8,192 個詞元 (Tokens)。我們以更少的計算獲得了更多的能力。

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI’s proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

arXiv.orgMichael Günther

jina-embeddings-v2 的成功暴露了另一個限制——不同的任務需要不同的優化。jina-embeddings-v3 沒有建立單獨的模型，而是使用微小的 60M LoRA 調整器來自訂 570M 基礎模型，以適應任何任務。一個模型變成了五個專用模型。

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.

arXiv.orgSaba Sturua

即使進行了任務專業化，我們仍然只能處理文字，而使用者需要視覺理解。像 jina-clip-v1 和 jina-clip-v2 這樣的標準 CLIP 模型使用單獨的編碼器，從而產生「模態差距」，導致不同格式的相似內容最終相距甚遠。與我們最近發布的 jina-reranker-m0 一樣，jina-embeddings-v4 完全消除了這種差距——一個統一的路徑處理所有內容，消除差距而不是彌合差距。

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding model that unifies text and image representations through a novel architecture supporting both single-vector and multi-vector embeddings in the late interaction style. The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios, including query-based information retrieval, cross-modal semantic similarity, and programming code search. Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves state-of-the-art performance on both single- modal and cross-modal retrieval tasks, with particular strength in processing visually rich content such as tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.

arXiv.orgMichael Günther

jina-embeddings-v4 和 jina-reranker-m0 都有一個根本性的轉變：使用大模型 (LLM) 作為骨幹，而不是僅使用編碼器的模型。這並非巧合——它反映了一種大多數人錯過的深刻優勢：僅使用編碼器的模型會產生「模態差距」，其中圖像與文字分開聚類。僅使用解碼器的模型開闢了僅使用編碼器的架構無法實現的可能性，包括真正的混合模態表示和可解釋性。

我們的核心觀點：向量模型 (Embeddings) 和生成都是關於理解語義。擅長生成的大型語言模型 (LLM) 自然也擅長表徵。我們相信，未來在於統一的架構，其中向量模型 (Embedding) 和重排器 (Reranker) 都源自同一個搜尋基礎模型——而這正是 Jina AI 正在努力實現的目標。