


Today we're releasing jina-code-embeddings
, a new suite of code embedding models in two sizes—0.5B and 1.5B parameters—along with GGUF quantizations for both. Built on autoregressive code generation LLMs, these models achieve state-of-the-art retrieval performance despite their compact size. They support over 15 programming languages including Python, JavaScript, Java, C++, C#, Go, Rust, TypeScript, SQL, MATLAB, R, Swift, Kotlin, HTML/CSS, PHP, Ruby, Scala, Perl, and Shell.
jina-code-embeddings
achieves 78.41% (0.5B) and 79.04% (1.5B) average performance across 25 code retrieval benchmarks. The 0.5B model outperforms Qwen3-Embedding-0.6B
by 5 percentage points despite being 20% smaller, while the 1.5B variant matches voyage-code-3
(79.23%) and exceeds gemini-embedding-001
(77.38%)—both proprietary models with undisclosed architectures.
Model | Parameters | Overall AVG | MTEB Code AVG |
---|---|---|---|
<strong>jina-code-embeddings-1.5b</strong> | 1.54B | 79.04% | 78.94% |
<strong>jina-code-embeddings-0.5b</strong> | 494M | 78.41% | 78.72% |
voyage-code-3 | Unknown* | 79.23% | 79.84% |
gemini-embedding-001 | Unknown* | 77.38% | 76.48% |
jina-embeddings-v4 | 3.8B | 74.11% | 74.87% |
Qwen3-Embedding-0.6B | 600M | 73.49% | 74.69% |
*Closed-source models with undisclosed architecture

Both models were trained with five task-specific instruction prefixes for different retrieval scenarios, each supporting both query and document roles for asymmetric retrieval. For example, you can use nl2code_query
to embed queries and nl2code_document
to embed documents.
Task | Use Case | Instruction Prefix |
---|---|---|
nl2code |
"How to read CSV" → pandas.read_csv() |
"Find the most relevant code snippet given the following query:\n" |
qa |
Technical Q&A retrieval | "Find the most relevant answer given the following question:\n" |
code2code |
Finding similar implementations | "Find an equivalent code snippet given the following code snippet:\n" |
code2nl |
Code to documentation | "Find the most relevant comment given the following code snippet:\n" |
code2completion |
Autocomplete scenarios | "Find the most relevant completion given the following start of code snippet:\n" |
tagTraining Recipe
We use pre-trained code generation models as embedding backbones. Built on Qwen2.5-Coder-0.5B
and 1.5B
, our models feature:
Feature | jina-code-embeddings-0.5b | jina-code-embeddings-1.5b |
---|---|---|
Base Model | Qwen2.5-Coder-0.5B | Qwen2.5-Coder-1.5B |
Embedding Dimensions | 896 | 1536 |
Matryoshka Dimensions | 64, 128, 256, 512, 896 | 128, 256, 512, 1024, 1536 |
Max Sequence Length | 32,768 tokens | 32,768 tokens |
Pooling Strategy | Last-token pooling | Last-token pooling |
Attention | FlashAttention2 | FlashAttention2 |
Data Type | BFloat16 | BFloat16 |
Traditional code embedding models face a fundamental bottleneck: there simply aren't enough high-quality comment-code pairs for supervised training. By starting with Qwen2.5-Coder
pre-trained on 5.5 trillion tokens spanning 92+ programming languages, we inherit deep semantic understanding of programming constructs, cross-language pattern recognition, and built-in knowledge of syntax and idioms. The contrastive fine-tuning then adapts this knowledge for retrieval tasks with minimal aligned data—sidestepping the data scarcity that constrains encoder-only models.
For underrepresented tasks like cross-framework code translations, we generated synthetic data using LLMs, with every synthetic example manually validated for quality. Our training data combined existing MTEB code task training splits with adapted public datasets including CommitPackFT, SWE-Bench, Spider, MBPP, and CodeSearchNet.
Unlike jina-embeddings-v3 and v4
, we didn't use LoRA and went straight to full post-training. For small models like ours (494M and 1.54B parameters), LoRA's parameter efficiency becomes less compelling—the adapter overhead can actually hurt performance when you have limited capacity. We needed every parameter working on the embedding task. Even for multi-task scenarios, task-specific instruction prefixes proved cleaner than multiple LoRA adapters. Instead of switching weight configurations, we simply prepend different instructions—much leaner and more aligned with how LLMs naturally process conditional information.
Training was remarkably efficient: both models were trained using contrastive learning with InfoNCE loss on 4x A100 80GB GPUs, completing in just 8.3 hours for the 0.5B model and 12 hours for the 1.5B variant.
Finally, we benchmarked different pooling strategies. Last-token pooling achieved 78.41% overall average, consistently outperforming mean pooling (77.20%) and latent attention pooling (78.27%) across all benchmark categories. This 1.2 percentage point advantage led us to break from the mean pooling tradition we established in jina-embeddings-v2
, v3
, and v4
. As more retrieval models build on decoder-only LLMs, last-token pooling becomes the natural choice—mean pooling simply doesn't align well with unidirectional attention mechanisms. While mean pooling can work and often trains more easily in early steps (likely due to its convex optimization landscape), our experiments consistently show it plateaus below the performance ceiling that last-token pooling achieves.
tagGetting Started
Both models work seamlessly via our Search Foundation API and with popular frameworks including sentence-transformers
, transformers
and llama.cpp
tagVia API
curl http://api.jina.ai/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $JINA_API_KEY" \
-d @- <<EOFEOF
{
"model": "jina-code-embeddings-1.5b",
"input": ["print hello world in python"],
"task": "nl2code.passage"
}
EOFEOF
tagVia sentence-transformers
from sentence_transformers import SentenceTransformer
# Load the model (choose 0.5b or 1.5b)
model = SentenceTransformer(
"jinaai/jina-code-embeddings-1.5b",
model_kwargs={"torch_dtype": "bfloat16"},
tokenizer_kwargs={"padding_side": "left"}
)
# Natural language to code
queries = ["print hello world in python", "initialize array of 5 zeros in c++"]
documents = ["print('Hello World!')", "int arr[5] = {0, 0, 0, 0, 0};"]
# Generate embeddings with task-specific prefixes
query_embeddings = model.encode(queries, prompt_name="nl2code_query")
document_embeddings = model.encode(documents, prompt_name="nl2code_document")
# Compute similarity
similarity = model.similarity(query_embeddings, document_embeddings)
tagVia transformers
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
def last_token_pool(last_hidden_states, attention_mask):
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size), sequence_lengths]
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-code-embeddings-1.5b')
model = AutoModel.from_pretrained('jinaai/jina-code-embeddings-1.5b')
# Apply task-specific prefix
query = "Find the most relevant code snippet given the following query:\nprint hello world"
code = "Candidate code snippet:\nprint('Hello World!')"
# Tokenize and embed
batch_dict = tokenizer([query, code], padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
tagMatryoshka Embeddings Cut-Off
Both models was trained with Matryoshka representation learning for dimensions [64, 128, 256, 512, 896]
, allowing you to truncate embeddings without recomputing:
# Full embeddings: 896d (0.5B) or 1536d (1.5B)
full_embedding = model.encode(text)
# Truncate to smaller dimensions for efficiency
small_embedding = full_embedding[:256] # Works for both models
tiny_embedding = full_embedding[:128] # 0.5B supports down to 64d
This flexibility enables trading off between performance and efficiency based on your requirements.
tagConclusion
jina-code-embeddings
demonstrates that effective code embeddings don't require massive scale. By building on code generation models and applying targeted fine-tuning, we achieve state-of-the-art performance with models under 1.5B parameters.
The strong results from such compact models (0.5B/1.5B) validate our thesis: the right foundation matters more than parameter count. Generation models understand code semantics—that understanding transfers directly to representation tasks.
This aligns with our broader vision at Jina AI: unified architectures where embedding and generation emerge from the same foundation, pushing the boundaries of what's possible with search foundation models.