I/O graph
Choose models to compare
Publications (1)
Overview
jina-code-embeddings-1.5b is a 1.54 billion parameter model representing a significant advancement in code retrieval capabilities. Built on Qwen2.5-Coder-1.5B backbone with last-token pooling, it moves beyond traditional training on limited aligned data to leverage vast unaligned code and documentation corpora. The model implements comprehensive task-specific instructions across five categories: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion, each with distinct prefixes for queries and documents. Supports Matryoshka representation learning for flexible embedding truncation. Despite larger size, maintains practical deployment characteristics while achieving benchmark performance competitive with substantially larger alternatives.
Methods
Implements contrastive training with InfoNCE loss using temperature τ=0.05, batch size 256 (adjusted for memory efficiency), sequence length 512. Training for 1500 steps on four A100 GPUs took 12 hours. Comprehensive training data includes MTEB splits, CoSQA+, CodeSearchNet, CommitPackFT, and GPT-4o synthetic data for underrepresented scenarios like framework translations. Task-specific prefixes enable nuanced understanding - Code2Code uses 'Find an equivalent code snippet given the following code snippet:' for queries. Last-token pooling confirmed superior through ablation. Contrastive learning multiplies training signal by using all batch combinations as positive/negative pairs.
Performance
Achieves 79.04% overall average and 78.94% MTEB Code average, establishing new benchmarks for its parameter class. Exceptional scores include 98.41% on HumanEval, 90.13% on MBPP, 98.02% on WikiSQL, and 99.44% on CodeChefXLang. Code-to-code retrieval shows 92.54% on CodeTransOceanContest. NL2Code delivers 86.45% on COIR-CodeSearchNet and 96.34% on Doc2Code. Technical Q&A achieves 92.37% on StackOverflowQA. Surpasses larger alternatives and shows consistent improvements over 0.5B variant, particularly on complex tasks like SWE-Bench (86.33% vs 83.00%).
Best Practice
Strategically employ instruction prefixes based on retrieval requirements, maintaining consistency across pipeline. Enhanced capacity ideal for complex scenarios involving multiple paradigms and extensive codebases. Profile use cases to determine optimal Matryoshka dimension balancing quality and resources. Use batch size 256 for production alignment with training. Excellent for cross-repository and cross-language searches given 99.44% CodeChefXLang performance. Implement as primary retrieval component in RAG systems. Consider confidence scoring based on embedding similarities. Optimal for enterprise deployments requiring both performance and efficiency with sub-second latency. Cache frequent embeddings and use hierarchical indexing for speed.
Blogs that mention this model