News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.

jina-embeddings-v2-base-en

On par with OpenAI's text-embedding-ada002
Release Postarrow_forward
License
license
Apache-2.0
Release Date
calendar_month
2023-10-28
Input
abc
Text
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 137M
Input Token Length: 8K
Output Dimension: 768
Language Support
🇺🇸 English
Related Models
link
jina-embedding-b-en-v1
link
jina-embeddings-v3
Tags
text-embeddings
english
long-context
production-ready
multi-task-learning
semantic-search
document-retrieval
high-performance
bert-based
fine-tunable
rag-optimized
8k-context
Available via
Jina APIAWS SageMakerMicrosoft AzureHugging Face
Choose models to compare
Publications (3)
arXiv
September 07, 2024
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
arXiv
February 26, 2024
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
arXiv
October 30, 2023
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Overview

Jina Embeddings v2 Base English is a groundbreaking open-source text embedding model that solves the critical challenge of processing long documents while maintaining high accuracy. Organizations struggling with analyzing extensive legal documents, research papers, or financial reports will find this model particularly valuable. It stands out by handling documents up to 8,192 tokens in length—16 times longer than traditional models—while matching the performance of OpenAI's proprietary solutions. With a compact size of 0.27GB and efficient resource usage, it offers an accessible solution for teams seeking to implement advanced document analysis without excessive computational overhead.

Methods

The model's architecture combines a BERT Small backbone with an innovative symmetric bidirectional ALiBi (Attention with Linear Biases) mechanism, eliminating the need for traditional positional embeddings. This architectural choice enables the model to extrapolate far beyond its training length of 512 tokens, handling sequences up to 8,192 tokens without performance degradation. The training process involved two key phases: initial pretraining on the C4 dataset, followed by refinement on Jina AI's curated collection of over 40 specialized datasets. This diverse training data, including challenging negative examples and varied sentence pairs, ensures robust performance across different domains and use cases. The model produces 768-dimensional dense vectors that capture nuanced semantic relationships, achieved with a relatively modest 137M parameters.

Performance

In real-world testing, Jina Embeddings v2 Base English demonstrates exceptional capabilities across multiple benchmarks. It outperforms OpenAI's text-embedding-ada-002 in several key metrics: classification (73.45% vs 70.93%), reranking (85.38% vs 84.89%), retrieval (56.98% vs 56.32%), and summarization (31.6% vs 30.8%). These numbers translate to practical advantages in tasks like document classification, where the model shows superior ability to categorize complex texts, and in search applications, where it better understands user queries and finds relevant documents. However, users should note that performance may vary when dealing with highly specialized domain-specific content not represented in the training data.

Best Practice

To effectively deploy Jina Embeddings v2 Base English, teams should consider several practical aspects. The model requires CUDA-capable hardware for optimal performance, though its efficient architecture means it can run on consumer-grade GPUs. It's available through multiple channels: direct download from Hugging Face, AWS Marketplace deployment, or the Jina AI API with 10M free tokens. For production deployments, AWS SageMaker in the us-east-1 region offers the most scalable solution. The model excels at general-purpose text analysis but may not be the best choice for highly specialized scientific terminology or domain-specific jargon without fine-tuning. When processing long documents, consider breaking them into meaningful semantic chunks rather than arbitrary splits to maintain context integrity. For optimal results, implement proper text preprocessing and ensure clean, well-formatted input data.
Blogs that mention this model
December 17, 2024 • 12 minutes read
Text Embeddings Fail to Capture Word Order and How to Fix It
Text embedding models struggle with capturing subtle linguistic nuances like word order, directional relationships, temporal sequences, causal connections, comparisons, and negation. Understanding these challenges is key to improving model performance.
Bo Wang
Alex C-G
Three abstract figures in white, gray, and pink on matching cubes placed on a colorful checkered surface against a green back
October 25, 2024 • 19 minutes read
Finding Optimal Breakpoints in Long Documents Using Small Language Models
We trained three small language models to better segment long documents into chunks, and here are the key lessons we learned.
Andrei Ungureanu
Alex C-G
A pattern of yellow file icons on a blue background with one icon displaying a smiley face creating an emotive contrast.
October 15, 2024 • 9 minutes read
Fact-Checking with New Grounding API in Jina Reader
With the new g.jina.ai, you can easily ground statements to reduce LLM hallucinations or improve the integrity of human-written content.
Jina AI
Jina developer interface showing "Jina AI was founded in 2020" with controls labeled true and false, and web address on top.
September 27, 2024 • 15 minutes read
Migration From Jina Embeddings v2 to v3
We collected some tips to help you migrate from Jina Embeddings v2 to v3.
Alex C-G
Scott Martens
A digital upgrade theme with "V3" and a white "2", set against a green and black binary code background, with "Upgrade" centr
September 18, 2024 • 10 minutes read
Jina Embeddings v3: A Frontier Multilingual Embedding Model
jina-embeddings-v3 is a frontier multilingual text embedding model with 570M parameters and 8192 token-length, outperforming the latest proprietary embeddings from OpenAI and Cohere on MTEB.
Jina AI
Dynamic image showing the characters "V3" formed by bright green dots varying in size on a black background.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.