News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

MCP Server
Add mcp.jina.ai as your MCP server to access our API in LLMs
open_in_new
API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
copyright

jina-code-embeddings-0.5b

Efficient code embeddings from code generation models
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2025-09-01
Input
abc
Text (Code)
arrow_forward
Output
more_horiz
Vector
Matryoshka Dimensions
64
128
256
512
896
Model Details
Parameters: 0.5B
Input Token Length: 32K
Output Dimension: 896
Language Support
🌍 Multilingual support
Quantizations
GGUF
Related Models
link
jina-code-embeddings-1.5b
link
jina-embeddings-v2-base-code
Tags
code-embeddings
programming-languages
semantic-code-search
code-similarity
long-context
text-embeddings
multilingual-code
docstring-search
Available via
Jina APIAWS SageMakerMicrosoft AzureGoogle CloudHugging Face
I/O graph
Choose models to compare
Publications (1)
arXiv
August 31, 2025
Efficient Code Embeddings from Code Generation Models

Overview

jina-code-embeddings-0.5b is a 494 million parameter code embedding model designed for retrieving code from natural language queries, technical Q&A, and identifying similar code across languages. Built on Qwen2.5-Coder-0.5B backbone, it generates embeddings via last-token pooling and addresses the fundamental limitation of traditional code embedding models that rely on scarce aligned data like comments and docstrings. The model leverages abundant unaligned code and documentation used in LLM training, achieving state-of-the-art performance despite its compact size. It supports five task categories with specific instruction prefixes: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion. The model implements Matryoshka representation learning for truncatable embeddings, allowing flexible precision-resource trade-offs.

Methods

The model employs contrastive training using InfoNCE loss with temperature τ=0.05, batch size 512, and sequence length 512. Training data includes MTEB code tasks, CoSQA+, adapted public datasets, and GPT-4o synthetic data for rare scenarios. Task-specific instruction prefixes condition the model differently for queries and documents - for example, NL2Code uses 'Find the most relevant code snippet given the following query:' for queries. Training on four A100 GPUs for 1500 steps took 8.3 hours. Last-token pooling outperformed mean and latent attention pooling in ablation studies. The contrastive approach treats query-document pairs as positive and cross-combinations as negative examples within each batch.

Performance

Achieves 78.41% overall average and 78.72% MTEB Code average across benchmarks. Notable scores include 96.77% on HumanEval, 89.01% on MBPP, 98.31% on WikiSQL, and 99.70% on CodeChefXLang. Outperforms similar-sized Qwen3-Embedding-0.6B and larger models like jina-embeddings-v4 (74.11%) and gemini-embedding-001 (77.38%). Excels in code-to-code retrieval with 90.37% on CodeTransOceanContest. Strong NL2Code performance with 85.73% on COIR-CodeSearchNet and 95.98% on Doc2Code. Technical Q&A capabilities demonstrated with 91.04% on StackOverflowQA.

Best Practice

Always use appropriate task-specific instruction prefixes for queries and documents. Leverage Matryoshka embeddings to balance quality and resources - start with full dimensions and truncate as needed. Optimal batch size is 512, sequence length 512 tokens. Use cosine similarity for embedding comparison. Excellent for multilingual code search given 99.70% CodeChefXLang performance. Consider two-stage retrieval with initial candidates from this model followed by reranking. Ideal for edge deployment and real-time applications due to compact size. Cache frequently accessed embeddings and implement hierarchical indexing for large codebases.
Blogs that mention this model
September 04, 2025 • 6 minutes read
Jina Code Embeddings: SOTA Code Retrieval at 0.5B and 1.5B
Code generation LLMs → code embeddings: 0.5B/1.5B models achieve SOTA performance across 25 code retrieval benchmarks.
Jina AI
Green "Code Embeddings" text displayed in a LED dot style on a black background, evoking a futuristic and technological atmos
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.