News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
The Power of Preprocessing
Shrewd Sampling During Training
Triplet Training Targets Specificity
Proof is in the Benchmarks
The Bigger Picture
star
Featured
Tech blog
August 23, 2023

Training Smarter, Not Harder: Slimming Down Sentence Embeddings

Jina AI trains top-tier sentence embeddings with 20% of the data. How? Maniacal filtering and shrewd sampling unmask the waste in Big Tech's data-hungry doctrine. Our streamlined embeddings rival titans' with meticulous curation, not mindless consumption. Technical ingenuity, not gluttony.
Abstract digital art of a neon square tunnel with radiant blue, green, and purple colors on a black background
Engineering Group
Engineering Group • 4 minutes read

At Jina AI, sentence embedding models are a critical component of many natural language processing and multimodal AI systems we develop. By encoding semantic information into vector representations, these models equip systems to assess similarity, retrieve relevant passages, and even generate embeddings tailored to specific downstream tasks.

However, a lingering concern around sentence embedding models has been their immense appetite for training data. Models like Sentence-BERT and Sentence-T5 are trained on billions of sentence pairs, demanding substantial computational resources. This causes both financial and environmental strains that limit access and adoption of these models.

In our new paper, we demonstrate that with careful data filtering and training methodology, compelling sentence embeddings can be attained using far less data than previously assumed. Our newly developed JINA EMBEDDINGS models deliver performance rivaling state-of-the-art models while reducing training samples by over 80%.

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text. The models excel in applications such as dense retrieval and semantic textual similarity. Th…
arXiv.orgMichael Günther
jinaai (Jina AI)
machine-learning, deep-learning, semantic-search, creative-ai, multimodal machine-learning, neural-search, vector-search, crossmodal machine-learning

tagThe Power of Preprocessing

Many datasets for training sentence embedding models contain duplicates, non-English samples, and low-quality pairs with minimal semantic similarity. By meticulously removing such cases, we condensed our initial dataset from 1.5 billion pairs down to a refined 385 million high-quality samples.

Two labeled pie charts showing data distribution across platforms like Reddit and Stack Exchange, with percentages.
The composition of 385 million pairwise data

We applied a multi-step filtering process encompassing:

  • De-duplication: Removing duplicate entries.
  • Language filtering: Eliminating non-English sentences.
  • Consistency filtering: Excluding pairs with low embedding similarity based on an auxiliary model. This screened out 84.3% of the Reddit dataset.

An ablation study on the impact of these steps showed consistency filtering, in particular, gave significant gains on retrieval tasks. This indicates that training on a smaller set of truly relevant pairs trumps indiscriminate exposure to billions of noisy examples.

tagShrewd Sampling During Training

In addition to filtering the data itself, we employed smart sampling strategies when creating training batches. Our parallelized approach trains on all datasets simultaneously, but ensures each batch contains examples from just one dataset.

The sampling rates for drawing datasets are weighted based on their relative sizes and quality. This prevents overfitting on smaller datasets while ensuring sufficient exposure to key high-value datasets.

Together with filtering, the weighted sampling reduces the actual number of training pairs encountered to only 180 million. This frugal fine-tuning strategy departs drastically from the data-hungry doctrines of Big Tech.

tagTriplet Training Targets Specificity

After pre-training on paired data, we incorporate a triplet training phase using smaller datasets. Here, each sample contains a query, positive match, and negative match. The model learns to embed the query closer to the positive than the negative.

This second stage exposes models to hard negatives, making embeddings more discriminative. It also incorporates custom negation data to help distinguish contradictory statements. Our jina-large-v1 model saw accuracy on hard negations improve from 16.6% to 65.4% after this triplet training.

A table listing models and their performance on a Negation Dataset with parameters and training samples, referencing Huggingf

Adding triplet training atop strong pre-training on filtered pairs gives embeddings with strong generalizability and fine-grained specificity.

tagProof is in the Benchmarks

We benchmarked our JINA EMBEDDINGS against models like Sentence-T5 and competitors on established datasets including MTEB and BEIR.

Table displaying scores for models on RR, RT, and STS tasks with numerical data indicating performance.

Remarkably, our 110 million parameter jina-base-v1 model matched or exceeded the performance of the 330 million parameter sentence-t5-large on numerous retrieval and reranking tasks. Even the small 35 million parameter jina-small-v1 showed competitive scores, demonstrating the data and training efficiencies unlock smaller yet powerful models.

Table showing performance comparison of machine learning models based on nDCG@10 score for retrieval tasks.
MTEB Leaderboard - a Hugging Face Space by mteb
Discover amazing ML apps made by the community
a Hugging Face Space by mteb

tagThe Bigger Picture

With meticulous data curation and an inventive training approach, we push state-of-the-art sentence embedding performance while slashing the data demand by over 80%. This carries profound implications:

  • Reduced resources: Lower data requirements cut computing infrastructure and energy needs for training.
  • Increased accessibility: Efficient data usage unlocks performant small models, expanding access for organizations without massive compute budgets.
  • Responsible AI: Judicious data usage and smaller models align with principles of responsible AI by limiting wasteful resource consumption.

Our work signifies masterful model development need not equate to mindless data consumption. Our technical ingenuity in creating the JINA EMBEDDINGS sets an important precedent - great NLP results can be achieved while exercising data frugality.

Categories:
star
Featured
Tech blog
rss_feed

Read more
May 07, 2025 • 9 minutes read
Model Soup’s Recipe for Embeddings
Bo Wang
Scott Martens
Still life drawing of a purple bowl filled with apples and oranges on a white table. The scene features rich colors against a
April 16, 2025 • 10 minutes read
On the Size Bias of Text Embeddings and Its Impact in Search
Scott Martens
Black background with a simple white ruler marked in centimeters, emphasizing a minimalist design.
April 01, 2025 • 17 minutes read
Using DeepSeek R1 Reasoning Model in DeepSearch
Andrei Ungureanu
Alex C-G
Brown background with a stylized whale graphic and the text "THINK:" and ":SEARCH>" in code-like font.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.