jina-code-embeddings-1.5b

Efficient code embeddings from code generation models

Release Post

License

CC-BY-NC-4.0

Release Date

2025-09-01

Input

Text (Code)

Output

Vector

Matryoshka Dimensions

128

256

512

1024

1536

Model Details

Parameters: 1.5B

Input Token Length: 32K

Output Dimension: 1536

Language Support

🌍 Multilingual support

Quantizations

GGUF

Related Models

jina-code-embeddings-0.5b

jina-embeddings-v2-base-code

Overview

jina-code-embeddings-1.5b is a 1.54 billion parameter model representing a significant advancement in code retrieval capabilities. Built on Qwen2.5-Coder-1.5B backbone with last-token pooling, it moves beyond traditional training on limited aligned data to leverage vast unaligned code and documentation corpora. The model implements comprehensive task-specific instructions across five categories: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion, each with distinct prefixes for queries and documents. Supports Matryoshka representation learning for flexible embedding truncation. Despite larger size, maintains practical deployment characteristics while achieving benchmark performance competitive with substantially larger alternatives.

Methods

Implements contrastive training with InfoNCE loss using temperature τ=0.05, batch size 256 (adjusted for memory efficiency), sequence length 512. Training for 1500 steps on four A100 GPUs took 12 hours. Comprehensive training data includes MTEB splits, CoSQA+, CodeSearchNet, CommitPackFT, and GPT-4o synthetic data for underrepresented scenarios like framework translations. Task-specific prefixes enable nuanced understanding - Code2Code uses 'Find an equivalent code snippet given the following code snippet:' for queries. Last-token pooling confirmed superior through ablation. Contrastive learning multiplies training signal by using all batch combinations as positive/negative pairs.

Performance

Achieves 79.04% overall average and 78.94% MTEB Code average, establishing new benchmarks for its parameter class. Exceptional scores include 98.41% on HumanEval, 90.13% on MBPP, 98.02% on WikiSQL, and 99.44% on CodeChefXLang. Code-to-code retrieval shows 92.54% on CodeTransOceanContest. NL2Code delivers 86.45% on COIR-CodeSearchNet and 96.34% on Doc2Code. Technical Q&A achieves 92.37% on StackOverflowQA. Surpasses larger alternatives and shows consistent improvements over 0.5B variant, particularly on complex tasks like SWE-Bench (86.33% vs 83.00%).

Best Practice

Strategically employ instruction prefixes based on retrieval requirements, maintaining consistency across pipeline. Enhanced capacity ideal for complex scenarios involving multiple paradigms and extensive codebases. Profile use cases to determine optimal Matryoshka dimension balancing quality and resources. Use batch size 256 for production alignment with training. Excellent for cross-repository and cross-language searches given 99.44% CodeChefXLang performance. Implement as primary retrieval component in RAG systems. Consider confidence scoring based on embedding similarities. Optimal for enterprise deployments requiring both performance and efficiency with sub-second latency. Cache frequent embeddings and use hierarchical indexing for speed.

Blogs that mention this model

September 04, 2025 • 6 minutes read

Jina Code Embeddings: SOTA Code Retrieval at 0.5B and 1.5B

Code generation LLMs → code embeddings: 0.5B/1.5B models achieve SOTA performance across 25 code retrieval benchmarks.