News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login

jina-embeddings-v2-base-code

Optimized for code and docstring search
Release Postarrow_forward
License
license
Apache-2.0
Release Date
calendar_month
2024-02-05
Input
abc
Text (Code)
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 137M
Input Token Length: 8K
Output Dimension: 768
Language Support
🇺🇸 English
Related Models
link
jina-embeddings-v2-base-en
Tags
code-embeddings
programming-languages
semantic-code-search
code-similarity
long-context
text-embeddings
multilingual-code
docstring-search
Available via
Jina APIAWS SageMakerMicrosoft AzureHugging Face
I/O graph
Choose models to compare

Overview

Jina Embeddings v2 Base Code tackles a critical challenge in modern software development: efficiently navigating and understanding large codebases. For development teams struggling with code discovery and documentation, this model transforms how developers interact with code by enabling natural language search across 30 programming languages. Unlike traditional code search tools that rely on exact pattern matching, this model understands the semantic meaning behind code, allowing developers to find relevant code snippets using plain English descriptions. This capability is particularly valuable for teams maintaining large legacy codebases, developers onboarding to new projects, or organizations looking to improve code reuse and documentation practices.

Methods

The model achieves its impressive performance through a specialized architecture designed specifically for code understanding. At its core, it uses a transformer-based neural network with 161 million parameters, trained on diverse programming language datasets with emphasis on six major languages: Python, JavaScript, Java, PHP, Go, and Ruby. What makes this architecture unique is its extended context window of 8,192 tokens, allowing it to process entire functions or multiple files at once while maintaining semantic understanding. The model generates dense 768-dimensional embeddings that capture both the syntactic structure and semantic meaning of code, enabling it to understand relationships between different code segments even when they use different programming patterns or syntax to achieve the same goal.

Performance

In real-world testing, Jina Embeddings v2 Base Code demonstrates exceptional capabilities, leading the field in nine out of fifteen crucial CodeNetSearch benchmarks. When compared to models from industry giants like Microsoft and Salesforce, it achieves superior performance while maintaining a more efficient footprint. The model particularly excels in cross-language code understanding, successfully matching functionally equivalent code snippets across different programming languages. Its 8,192 token context window proves particularly valuable for large functions and complex code files, significantly outperforming traditional models that typically handle only a few hundred tokens. The model's efficiency is evident in its compact size of 307MB (unquantized), enabling fast inference while maintaining high accuracy in code similarity and search tasks.

Best Practice

To effectively deploy Jina Embeddings v2 Base Code, teams should consider several practical aspects. The model integrates seamlessly with popular vector databases like MongoDB, Qdrant, and Weaviate, making it easy to build scalable code search systems. For optimal performance, implement proper code preprocessing to handle the 8,192 token limit, which typically accommodates most function and class definitions. While the model supports 30 programming languages, it shows strongest performance in the six core languages: Python, JavaScript, Java, PHP, Go, and Ruby. Teams should consider using batch processing for large-scale code indexing to optimize performance. The model's RAG compatibility makes it particularly effective for automated documentation generation and code understanding tasks, though teams should implement appropriate chunking strategies for very large codebases. For production deployments, consider using the AWS SageMaker endpoint for managed inference, and implement appropriate caching strategies to optimize query performance.
Blogs that mention this model
April 08, 2025 • 21 minutes read
jina-reranker-m0: Multilingual Multimodal Document Reranker
Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.
Jina AI
Modern dot matrix text display on a dark blue background, conveying a digital feel.
September 27, 2024 • 15 minutes read
Migration From Jina Embeddings v2 to v3
We collected some tips to help you migrate from Jina Embeddings v2 to v3.
Alex C-G
Scott Martens
A digital upgrade theme with "V3" and a white "2", set against a green and black binary code background, with "Upgrade" centr
April 29, 2024 • 7 minutes read
Jina Embeddings and Reranker on Azure: Scalable Business-Ready AI Solutions
Jina Embeddings and Rerankers are now available on Azure Marketplace. Enterprises that prioritize privacy and security can now easily integrate Jina AI's state-of-the-art models right in their existing Azure ecosystem.
Susana Guzmán
Futuristic black background with a purple 3D grid, featuring the "Embeddings" and "Reranker" logos with a stylized "A".
February 05, 2024 • 4 minutes read
Elevate Your Code Search with New Jina Code Embeddings
New 𝗷𝗶𝗻𝗮-𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝘀-𝘃𝟮-𝗯𝗮𝘀𝗲-𝗰𝗼𝗱𝗲 is optimized for code & docstring search. This powerful model supports searches between English and 30 widely-used programming languages, all with 8192 context length and SOTA performance.
Jina AI
Abstract image with concentric circles in purple and green, featuring "jina" logo and repeated "code embeddings" text around
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.