News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.

jina-embedding-b-en-v1

The first version of the Jina Embedding model, the OG.
License
license
Apache-2.0
Release Date
calendar_month
2023-06-17
Input
abc
Text
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 110M
Input Token Length: 512
Output Dimension: 768
Language Support
🇺🇸 English
Related Models
link
jina-embeddings-v2-base-en
link
jina-embeddings-v3
Tags
text-embedding
english
base-model
legacy
bert-based
production
Available via
Hugging Face
Choose models to compare
Publications (1)
EMNLP 2023
July 20, 2023
Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Overview

Jina Embedding B v1 is a specialized text embedding model designed to transform English text into high-dimensional numerical representations while maintaining semantic meaning. The model addresses the critical need for efficient and accurate text embeddings in production environments, particularly valuable for organizations requiring a balance between computational efficiency and embedding quality. With its 110M parameters generating 768-dimensional embeddings, it serves as a practical solution for teams implementing semantic search, document clustering, or content recommendation systems without requiring extensive computational resources.

Methods

The model employs a T5 encoder-based architecture enhanced with mean pooling to generate fixed-length representations. Trained on the carefully curated Linnaeus-Clean dataset, which contains 385 million high-quality sentence pairs filtered down from an initial 1.6 billion pairs, the model underwent a two-phase training process. The first phase utilized contrastive learning with InfoNCE loss on text pairs, while the second phase incorporated triplet training to refine the model's ability to distinguish between similar and dissimilar content. This innovative training approach, combined with rigorous data filtering including language detection and consistency checking, enables the model to capture nuanced semantic relationships effectively.

Performance

In real-world evaluations, Jina Embedding B v1 demonstrates impressive capabilities, particularly in semantic textual similarity tasks. The model achieves state-of-the-art performance on STS12 with a score of 0.751, surpassing established models like all-mpnet-base-v2 and all-minilm-l6-v2. It shows strong performance across various benchmarks while maintaining efficient inference times. However, users should note that the model is specifically optimized for English language content and may not perform optimally on multilingual or code-specific tasks. The model has since been superseded by jina-embeddings-v2-base-en and jina-embeddings-v3, which offer enhanced performance across a broader range of use cases.

Best Practice

For optimal deployment, the model requires a CUDA-capable GPU, though its moderate size allows for efficient inference on standard hardware. The model accepts input sequences up to 512 tokens in length and is particularly well-suited for production environments where consistent, reliable embedding generation is crucial. It performs best on English language content and is ideal for applications like semantic search, document similarity comparison, and content recommendation systems. Teams should consider using the newer v2 or v3 versions for new projects, as they offer improved performance and broader language support. The model is not recommended for tasks requiring multilingual understanding or specialized domain knowledge outside of general English text.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.