I/O graph
Choose models to compare
Publications (1)
Overview
jina-code-embeddings-0.5b is a 494 million parameter code embedding model designed for retrieving code from natural language queries, technical Q&A, and identifying similar code across languages. Built on Qwen2.5-Coder-0.5B backbone, it generates embeddings via last-token pooling and addresses the fundamental limitation of traditional code embedding models that rely on scarce aligned data like comments and docstrings. The model leverages abundant unaligned code and documentation used in LLM training, achieving state-of-the-art performance despite its compact size. It supports five task categories with specific instruction prefixes: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion. The model implements Matryoshka representation learning for truncatable embeddings, allowing flexible precision-resource trade-offs.
Methods
The model employs contrastive training using InfoNCE loss with temperature τ=0.05, batch size 512, and sequence length 512. Training data includes MTEB code tasks, CoSQA+, adapted public datasets, and GPT-4o synthetic data for rare scenarios. Task-specific instruction prefixes condition the model differently for queries and documents - for example, NL2Code uses 'Find the most relevant code snippet given the following query:' for queries. Training on four A100 GPUs for 1500 steps took 8.3 hours. Last-token pooling outperformed mean and latent attention pooling in ablation studies. The contrastive approach treats query-document pairs as positive and cross-combinations as negative examples within each batch.
Performance
Achieves 78.41% overall average and 78.72% MTEB Code average across benchmarks. Notable scores include 96.77% on HumanEval, 89.01% on MBPP, 98.31% on WikiSQL, and 99.70% on CodeChefXLang. Outperforms similar-sized Qwen3-Embedding-0.6B and larger models like jina-embeddings-v4 (74.11%) and gemini-embedding-001 (77.38%). Excels in code-to-code retrieval with 90.37% on CodeTransOceanContest. Strong NL2Code performance with 85.73% on COIR-CodeSearchNet and 95.98% on Doc2Code. Technical Q&A capabilities demonstrated with 91.04% on StackOverflowQA.
Best Practice
Always use appropriate task-specific instruction prefixes for queries and documents. Leverage Matryoshka embeddings to balance quality and resources - start with full dimensions and truncate as needed. Optimal batch size is 512, sequence length 512 tokens. Use cosine similarity for embedding comparison. Excellent for multilingual code search given 99.70% CodeChefXLang performance. Consider two-stage retrieval with initial candidates from this model followed by reranking. Ideal for edge deployment and real-time applications due to compact size. Cache frequently accessed embeddings and implement hierarchical indexing for large codebases.
Blogs that mention this model