Jina Code Embeddings：0.5B 和 1.5B 參數規模下，最先進的程式碼檢索向量模型

Efficient Code Embeddings from Code Generation Models

jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

arXiv.orgDaria Kryvosheieva

今天我們發布了jina-code-embeddings，這是一個新的程式碼向量模型套件，有兩種尺寸—0.5B 和 1.5B 參數—以及這兩者的 GGUF 量化版本。這些模型建立在自迴歸程式碼生成 LLM 之上，儘管體積小巧，但仍能實現最先進的檢索效能。它們支援超過 15 種程式語言，包括 Python、JavaScript、Java、C++、C#、Go、Rust、TypeScript、SQL、MATLAB、R、Swift、Kotlin、HTML/CSS、PHP、Ruby、Scala、Perl 和 Shell。

jina-code-embeddings在 25 個程式碼檢索基準測試中實現了 78.41% (0.5B) 和 79.04% (1.5B) 的平均效能。0.5B 模型優於Qwen3-Embedding-0.6B 5 個百分點，儘管體積小了 20%，而 1.5B 版本與voyage-code-3 (79.23%) 相符，並超過了gemini-embedding-001 (77.38%)—這兩者都是具有未公開架構的專有模型。

Model	Parameters	Overall AVG	MTEB Code AVG
<strong>jina-code-embeddings-1.5b</strong>	1.54B	79.04%	78.94%
<strong>jina-code-embeddings-0.5b</strong>	494M	78.41%	78.72%
voyage-code-3	Unknown*	79.23%	79.84%
gemini-embedding-001	Unknown*	77.38%	76.48%
jina-embeddings-v4	3.8B	74.11%	74.87%
Qwen3-Embedding-0.6B	600M	73.49%	74.69%

*Closed-source models with undisclosed architecture

這兩個模型都使用五個特定任務的指令前綴進行訓練，用於不同的檢索場景，每個都支援查詢和文件角色以進行非對稱檢索。例如，您可以使用nl2code_query來嵌入查詢，並使用nl2code_document來嵌入文件。

Task	Use Case	Instruction Prefix
`nl2code`	"How to read CSV" → `pandas.read_csv()`	"Find the most relevant code snippet given the following query:\n"
`qa`	Technical Q&A retrieval	"Find the most relevant answer given the following question:\n"
`code2code`	Finding similar implementations	"Find an equivalent code snippet given the following code snippet:\n"
`code2nl`	Code to documentation	"Find the most relevant comment given the following code snippet:\n"
`code2completion`	Autocomplete scenarios	"Find the most relevant completion given the following start of code snippet:\n"

tagTraining Recipe

我們使用預訓練的程式碼生成模型作為向量模型主幹。我們的模型建立在Qwen2.5-Coder-0.5B和1.5B之上，具有以下特點：

Feature	jina-code-embeddings-0.5b	jina-code-embeddings-1.5b
Base Model	Qwen2.5-Coder-0.5B	Qwen2.5-Coder-1.5B
Embedding Dimensions	896	1536
Matryoshka Dimensions	64, 128, 256, 512, 896	128, 256, 512, 1024, 1536
Max Sequence Length	32,768 tokens	32,768 tokens
Pooling Strategy	Last-token pooling	Last-token pooling
Attention	FlashAttention2	FlashAttention2
Data Type	BFloat16	BFloat16

傳統的程式碼向量模型面臨一個根本的瓶頸：根本沒有足夠高品質的註解-程式碼對用於監督式訓練。透過從在跨越 92 多種程式語言的 5.5 兆個詞元上預訓練的Qwen2.5-Coder開始，我們繼承了對程式設計結構的深刻語義理解、跨語言模式識別以及內建的語法和慣用語知識。然後，對比微調使用最少的對齊資料來調整這些知識以用於檢索任務—繞過了限制僅編碼器模型的資料稀缺性。

對於諸如跨框架程式碼翻譯等代表性不足的任務，我們使用 LLM 生成了合成資料，並對每個合成範例進行了人工驗證以確保品質。我們的訓練資料將現有的 MTEB 程式碼任務訓練拆分與改編的公共資料集（包括 CommitPackFT、SWE-Bench、Spider、MBPP 和 CodeSearchNet）結合在一起。

與jina-embeddings-v3和v4不同，我們沒有使用 LoRA，而是直接進行完整的後訓練。對於像我們這樣的小型模型（494M 和 1.54B 參數），LoRA 的參數效率變得不那麼有吸引力—當您的容量有限時，適配器開銷實際上會損害效能。我們需要每個參數都參與向量模型任務。即使對於多任務場景，特定任務的指令前綴也比多個 LoRA 適配器更簡潔。我們沒有切換權重配置，而是簡單地在前面加上不同的指令—更精簡，並且更符合 LLM 自然處理條件資訊的方式。

訓練非常有效率：這兩個模型都使用帶有 InfoNCE 損失的對比學習在 4 個 A100 80GB GPU 上進行訓練，0.5B 模型僅用了 8.3 小時，而 1.5B 模型用了 12 小時。

最後，我們對不同的池化策略進行了基準測試。Last-token pooling 實現了 78.41% 的總體平均值，在所有基準類別中始終優於平均池化 (77.20%) 和潛在注意力池化 (78.27%)。這 1.2 個百分點的優勢使我們打破了我們在jina-embeddings-v2、v3和v4中建立的平均池化傳統。隨著越來越多的檢索模型建立在僅解碼器 LLM 之上，last-token pooling 成為自然選擇—平均池化根本無法很好地與單向注意力機制對齊。雖然平均池化可以工作，並且在早期步驟中通常更容易訓練（可能是由於其凸優化前景），但我們的實驗始終表明，它的效能會停滯在 last-token pooling 實現的效能上限之下。

tagGetting Started

這兩個模型都可以透過我們的 Search Foundation API 以及包括sentence-transformers、transformers和llama.cpp在內的流行框架無縫協作。

tagVia API

curl http://api.jina.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d @- <<EOFEOF
  {
    "model": "jina-code-embeddings-1.5b",
    "input": ["print hello world in python"],
    "task": "nl2code.passage"
  }
EOFEOF

tagVia `sentence-transformers`

from sentence_transformers import SentenceTransformer

# Load the model (choose 0.5b or 1.5b)
model = SentenceTransformer(
    "jinaai/jina-code-embeddings-1.5b",
    model_kwargs={"torch_dtype": "bfloat16"},
    tokenizer_kwargs={"padding_side": "left"}
)

# Natural language to code
queries = ["print hello world in python", "initialize array of 5 zeros in c++"]
documents = ["print('Hello World!')", "int arr[5] = {0, 0, 0, 0, 0};"]

# Generate embeddings with task-specific prefixes
query_embeddings = model.encode(queries, prompt_name="nl2code_query")
document_embeddings = model.encode(documents, prompt_name="nl2code_document")

# Compute similarity
similarity = model.similarity(query_embeddings, document_embeddings)

tagVia `transformers`

from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

def last_token_pool(last_hidden_states, attention_mask):
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size), sequence_lengths]

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-code-embeddings-1.5b')
model = AutoModel.from_pretrained('jinaai/jina-code-embeddings-1.5b')

# Apply task-specific prefix
query = "Find the most relevant code snippet given the following query:\nprint hello world"
code = "Candidate code snippet:\nprint('Hello World!')"

# Tokenize and embed
batch_dict = tokenizer([query, code], padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

tagMatryoshka 向量模型截斷

這兩個模型都使用 Matryoshka 表示學習進行訓練，維度為 [64, 128, 256, 512, 896]，允許您截斷向量模型而無需重新計算：

# Full embeddings: 896d (0.5B) or 1536d (1.5B)
full_embedding = model.encode(text)

# Truncate to smaller dimensions for efficiency
small_embedding = full_embedding[:256]  # Works for both models
tiny_embedding = full_embedding[:128]   # 0.5B supports down to 64d

這種彈性讓您可以根據需求在效能和效率之間取得平衡。

tag結論

jina-code-embeddings 證明了有效的程式碼向量模型並不需要大規模。透過在程式碼生成模型的基礎上，應用有針對性的微調，我們可以使用參數小於 1.5B 的模型實現最先進的效能。

這些緊湊型模型 (0.5B/1.5B) 的強勁結果驗證了我們的論點：正確的基礎比參數數量更重要。 生成模型理解程式碼語意，這種理解直接轉移到表示任務。

這與我們在 Jina AI 的更廣泛願景一致：統一架構，其中向量模型和生成來自相同的基礎，從而突破搜尋基礎模型所能實現的界限。