News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

MCP Server
Add mcp.jina.ai as your MCP server to access our API in LLMs
open_in_new
API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
copyright

jina-code-embeddings-1.5b

Efficient code embeddings from code generation models
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2025-09-01
Input
abc
Text (Code)
arrow_forward
Output
more_horiz
Vector
Matryoshka Dimensions
128
256
512
1024
1536
Model Details
Parameters: 1.5B
Input Token Length: 32K
Output Dimension: 1536
Language Support
🌍 Multilingual support
Quantizations
GGUF
Related Models
link
jina-code-embeddings-0.5b
link
jina-embeddings-v2-base-code
Tags
code-embeddings
programming-languages
semantic-code-search
code-similarity
long-context
text-embeddings
multilingual-code
docstring-search
Available via
Jina APIAWS SageMakerMicrosoft AzureGoogle CloudHugging Face
I/O graph
Choose models to compare
Publications (1)
arXiv
August 31, 2025
Efficient Code Embeddings from Code Generation Models

Overview

jina-code-embeddings-1.5b is a 1.54 billion parameter model representing a significant advancement in code retrieval capabilities. Built on Qwen2.5-Coder-1.5B backbone with last-token pooling, it moves beyond traditional training on limited aligned data to leverage vast unaligned code and documentation corpora. The model implements comprehensive task-specific instructions across five categories: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion, each with distinct prefixes for queries and documents. Supports Matryoshka representation learning for flexible embedding truncation. Despite larger size, maintains practical deployment characteristics while achieving benchmark performance competitive with substantially larger alternatives.

Methods

Implements contrastive training with InfoNCE loss using temperature τ=0.05, batch size 256 (adjusted for memory efficiency), sequence length 512. Training for 1500 steps on four A100 GPUs took 12 hours. Comprehensive training data includes MTEB splits, CoSQA+, CodeSearchNet, CommitPackFT, and GPT-4o synthetic data for underrepresented scenarios like framework translations. Task-specific prefixes enable nuanced understanding - Code2Code uses 'Find an equivalent code snippet given the following code snippet:' for queries. Last-token pooling confirmed superior through ablation. Contrastive learning multiplies training signal by using all batch combinations as positive/negative pairs.

Performance

Achieves 79.04% overall average and 78.94% MTEB Code average, establishing new benchmarks for its parameter class. Exceptional scores include 98.41% on HumanEval, 90.13% on MBPP, 98.02% on WikiSQL, and 99.44% on CodeChefXLang. Code-to-code retrieval shows 92.54% on CodeTransOceanContest. NL2Code delivers 86.45% on COIR-CodeSearchNet and 96.34% on Doc2Code. Technical Q&A achieves 92.37% on StackOverflowQA. Surpasses larger alternatives and shows consistent improvements over 0.5B variant, particularly on complex tasks like SWE-Bench (86.33% vs 83.00%).

Best Practice

Strategically employ instruction prefixes based on retrieval requirements, maintaining consistency across pipeline. Enhanced capacity ideal for complex scenarios involving multiple paradigms and extensive codebases. Profile use cases to determine optimal Matryoshka dimension balancing quality and resources. Use batch size 256 for production alignment with training. Excellent for cross-repository and cross-language searches given 99.44% CodeChefXLang performance. Implement as primary retrieval component in RAG systems. Consider confidence scoring based on embedding similarities. Optimal for enterprise deployments requiring both performance and efficiency with sub-second latency. Cache frequent embeddings and use hierarchical indexing for speed.
Blogs that mention this model
September 04, 2025 • 6 minutes read
Jina Code Embeddings: SOTA Code Retrieval at 0.5B and 1.5B
Code generation LLMs → code embeddings: 0.5B/1.5B models achieve SOTA performance across 25 code retrieval benchmarks.
Jina AI
Green "Code Embeddings" text displayed in a LED dot style on a black background, evoking a futuristic and technological atmos
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.